Analysis of variance by randomization wi
ENVIRONMETRICS, VOL. 9, 53±65 (1998)
Environmetrics, 9, 53±65 (1998)
ANALYSIS OF VARIANCE BY RANDOMIZATION
WITH SMALL DATA SETS
LILIANA GONZALEZ1* AND BRYAN F. J. MANLY2
1Finance
and Quantitative Analysis Department, University of Otago, PO Box 56, Dunedin, New Zealand
Department of Mathematics and Statistics, University of Otago, PO Box 56, Dunedin, New Zealand
2
SUMMARY
A simulation study has been carried out to compare the results from using dierent randomization methods
to assess the signi®cance of the F-statistics for factor eects with analysis of variance. Two-way and threeway designs with and without replication were considered, with the randomization of observations, the
restricted randomization of observations, and the randomization of dierent types of residuals. Data from
normal, uniform, exponential, and an empirical distribution were considered. It was found that usually all
methods of randomization gave similar results, as did the use of the usual F-distribution tables, and that no
method of analysis was clearly superior to the others under all conditions. # 1998 John Wiley & Sons, Ltd.
KEY WORDS
analysis of variance; computer intensive statistics; simulation; F-distribution; randomization
inference
1. INTRODUCTION
Randomization methods for testing hypotheses are useful in environmental areas because their
justi®cation is easy to understand by government ocials and the public at large, and they are
applicable with the non-normal distributions that often occur. However, as soon as the required
analysis becomes more than something as simple as the comparison of the mean values of several
samples the appropriate randomization procedure becomes questionable. There has been controversy with several types of randomization analysis but in this paper we only consider simple
factorial designs with analysis of variance. It was in this area that Crowley (1992) in his review of
resampling methods remarked that `contrasting views on interaction terms in factorial ANOVA
. . . needs to be resolved'.
In discussing the merits of randomization tests we start with one important premise. This is
that the major value of these tests is in situations where the data are grossly non-normal (e.g. with
several extremely large values and many tied values), and sample sizes are not large. It is in these
situations that more conventional methods are of questionable validity and the results of a
randomization test may carry more weight. This premise is important because it tells us that our
main concern should be with the performance of alternative methods on grossly non-normal data.
Theoretical or simulation studies suggesting that one method is somewhat better than another
with normally distributed data are not necessarily of much relevance.
* Correspondence to: Liliana Gonzalez, Finance and Quantitative Analysis Department, University of Otago,
PO Box 56, Dunedin, New Zealand. e-mail: [email protected]
CCC 1180±4009/98/010053±13$17.50
# 1998 John Wiley & Sons, Ltd.
Received 29 December 1995
Revised 13 June 1997
54
L. GONZALEZ AND B. F. J. MANLY
In this context we address two questions:
(a)
Should it be observations or residuals that are permuted and, if it is to be residuals, what
type of residuals should be used?
(b) Should restricted randomization be used where possible instead of a completely free
randomization?
Our method of approach involves looking at several quite speci®c situations and examining the
performance of alternative methods. Because the situations are speci®c there is no guarantee that
our conclusions apply to analysis of variance in general. Nevertheless, the results are clear enough
to suggest that this is the case.
In brief, our conclusions with respect to the testing of factor eects is that there is usually
similar performance with analysis of variance based on the usual F-distribution tables, the
randomization of observations, or the randomization of dierent types of residuals with normal
or moderately non-normal data. If anything, the use of F-distribution tables tends to give
slightly higher power than the other methods except with extremely non-normal data. On the
other hand, restricted randomization has guaranteed performance when the null hypothesis is
true, but has tendency to give low power to detect real eects under some conditions. Overall,
therefore, it is not possible to say that any method of analysis is clearly superior to the other
methods.
2. THE SIMULATED SITUATION
To set the scene for our study, we imagine that three streams are sampled in the four seasons of a
year and some environmentally important variable is measured. In Section 3 we assume that one
reading only is taken for each stream in each season. The sample design is then suitable for a twofactor analysis of variance without replication, with 3 4 12 observations. We generated
simulated data for analysis in four dierent ways:
(i) Twelve values were generated from a normal distribution and scaled to have a mean of
®ve and standard deviation of one. These values were then placed in a random order and
stream eects and seasonal eects were added to produce one set of data. This process
was repeated 1000 times for each of several combinations of stream eects and seasonal
eects.
(ii) The same process as described for (i) was used but with the original 12 values chosen
from a uniform distribution.
(iii) The same process as described for (i) was used but with the original 12 values chosen
from an exponential distribution.
(iv) The same process as described for (i) was used but with the original 12 values chosen
from an observed distribution of the ratio of counts of ephemeroptera to counts of
oligochaetes (a measure of pollution) in New Zealand streams, as shown in Table I.
Basically, the data generated by methods (i) and (ii) are quite `well-behaved', with no very extreme
values; the data generated by method (iii) are positively skewed with a few relatively large values;
and the data generated by method (iv) are like the worst type of data found in real life, with many
tied values that were originally zero before scaling to a mean of ®ve and a standard deviation of
one and several relatively large values.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
# 1998 John Wiley & Sons, Ltd.
ANALYSIS OF VARIANCE BY RANDOMIZATION
55
Table I. Ratios of numbers of ephemeroptera to numbers of oligochaetes in New Zealand streams. The
three ratios shown for each stream±position±season combination come from three independent samples
taken in the same area at the same time
Stream
Position
Summer
Autumn
A
Bottom
Middle
Top
0.2,
2.6,
0.7,
0.2,
0.7,
8.5,
0.0
0. 7
7.1
0.0,
0.3,
0.5,
B
Bottom
Middle
Top
0.0,
0.3,
1.2,
0.0,
0.2,
0.7,
0. 0
0.3
0. 8
0.0,
0.0,
10.7,
C
Bottom
Middle
Top
0.1, 0.2, 0.1
1.0, 2.6, 1.9
7.3, 10.4, 9.5
0.0,
0.6,
1.1,
0. 0
0.4
1. 4
Winter
Spring
0.0,
0.0,
0.1,
0.0,
0.0,
0.1,
0.0
0. 0
0. 2
0.0,
0.0,
0.2,
0.0,
0.0,
0.1,
0.0
0.0
0.4
0.0, 0.0
0.0, 0.0
19.9, 9.4
0.0,
0.0,
2.0,
0.0,
0.0,
6.3,
0. 0
0. 0
4.8
0.0,
0.0,
2.0,
0.0,
0.0,
1.6,
0.0
0.1
1.9
0.0, 0.0, 0.1
0.0, 0.0, 0.0
46.6, 20.3, 24.0
0.4,
0.1,
1.2,
0.0,
0.2,
0.8,
0. 3
0.0
6. 1
0.0,
0.0,
0.2,
0.0,
0.0,
0.1,
0.0
0.0
0.1
After generation, the F-values for variation between streams and variation between seasons
were obtained for each set of data from a two-factor analysis of variance. The p-values were then
calculated by four dierent methods:
(a)
The observations were randomized freely between the 12 factor combinations and the
analysis of variance was repeated. This was done 99 times and the p-value for a factor was
taken as the proportion of times that the corresponding observed F-value was equalled or
exceeded in the set of 100 F-values consisting of the observed one plus the 99 obtained by
randomization.
(b) The usual residuals for a two-way analysis of variance (observations 7 stream mean 7
season mean overall mean) were calculated, and these were then freely randomized
between the 12 factor combinations. This was repeated 99 times and p-values were determined as for (a). This is the method proposed by ter Braak (1992).
(c) To test the eect of streams a restricted randomization was used, as described by
Edgington (1995, Chapter 6). This involves producing alternative sets of data by
randomly permuting observations between streams within seasons only, and thereby
allowing for possible dierences between seasons. Similarly, to test the eect of seasons a
restricted randomization allowing observations to move between seasons but not between
streams was used. For each eect 99 restricted randomizations were carried out to produce
99 F-values for comparison with the observed F-value for that eect. The p-value for an
eect was then the proportion of F-values as large or larger than the observed one in the
set consisting of the observed F-value plus the 99 randomized values.
(d) The p-values for eects were determined by computing the probability of F-values as
large or larger than those observed from the usual F-distributions.
Method (a) was also used with tests based on the sums of squares of eects as proposed by Manly
(1991, Chapter 5). However the relatively poor performance of this approach when compared to
the others that were considered means that it is not worth discussing further.
The results of this simulation allow the comparison of the dierent testing procedures under
quite a wide range of conditions in terms of the type of data involved. In Sections 4 and 5 slightly
more complicated designs involving two factors with replication and the introduction of a third
# 1998 John Wiley & Sons, Ltd.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
56
L. GONZALEZ AND B. F. J. MANLY
factor for the position in the stream are also considered. These introduce the problem of testing
for interactions between factors.
3.
A TWO-FACTOR SAMPLING DESIGN WITHOUT REPLICATION
The two-factor sampling design was simulated as described in the previous section, with data
from four distributions (normal, uniform, exponential and empirical). Seven levels for the
combined eects of streams and seasons were used. Let A0 stand for no stream eects, A1 stand
for low stream eects, A2 stand for high stream eects, B0 stand for no seasonal eects, B1 stand
for low seasonal eects, and B2 stand for high seasonal eects. The seven levels are then A0 and
B0, A1 and B0, A2 and B0, A0 and B1, A0 and B2, A1 and B1, and A2 and B2, where, for
example A1 and B0 means that there was a low level of stream eects and no seasonal eects. To
apply the levels A1, A2, B1 and B2 the 12 data values had constants added to them after being
placed in an initial random order as described in the last section. For A1 stream 2 observations
were increased by 1.0 and stream 3 observations by 2.0; for A2 stream 2 observations were
increased by 2.0 and stream 3 observations by 4.0; for B1 season 2 observations were increased by
0.7, season 3 observations were increased by 1.3, and season 4 observations were increased by 2.0;
for B2 season 2 observations were increased by 1.3, season 3 observations were increased by 2.7,
and season 4 observations were increased by 4.0.
The results of the simulations are summarized in Table II in terms of the percentage of
signi®cant results for tests at the 5 per cent level of signi®cance. Cases where the null hypothesis is
true and the observed result is signi®cantly dierent from 5 per cent with a test at the 5 per cent
level (i.e. the result is outside the range 3.6 to 6.4 per cent for single values, and outside the range
4.4 to 5.6 per cent of average size) are underlined once. Cases where the null hypothesis was not
true and a test gave the fewest signi®cant results (i.e. had relatively low observed power) are
underlined twice. The average size is the mean percentage of signi®cant results when the null
hypothesis was true. The average power is the mean percentage of signi®cant results when the null
hypothesis was not true. It is desirable for this to be large.
Overall the testing procedures have given fairly similar results. However the following
tendencies can be noted:
1. Freely randomizing observations usually gave good power and did relatively well in terms
of maintaining an average size close to 5 per cent when the null hypothesis was true.
2. Freely randomizing residuals as proposed by ter Braak (1992) did slightly worse than freely
randomizing observations with respect to the percentage of signi®cant results when the null
hypothesis was true and with respect to power when the null hypothesis was not true.
3. Restricted randomization as proposed by Edgington (1995) gave about 5 per cent signi®cant results when the null hypothesis was true, except with the empirical data, but often
had the lowest average power of all the procedures.
4. Use of F-tables for testing gave about the correct proportion of signi®cant results when
the null hypothesis was true and had the highest power of all the tests, except with the
empirical data.
If anything the simulation results suggest that the use of F-tables is better than any of the
randomization methods, except with extremely non-normal data, in which case freely randomizing observations gave the best results. However it should be noted that only 99 randomizations
were used with the randomization tests to keep computational times within reasonable limits.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
# 1998 John Wiley & Sons, Ltd.
ANALYSIS OF VARIANCE BY RANDOMIZATION
57
Table II. Results from the simulation experiment with the two-factor sampling design without replication.
The tabulated values are the percentages of signi®cant results with tests at the 5 per cent level (F1, test for a
stream eect; F2, test for a seasonal test). Single underlined values are those where the null hypothesis is
true but the percentage of signi®cant results is signi®cantly dierent from 5 per cent (i.e. the result is outside
the range 3.6 to 6.4 per cent for single values and outside the range 4.4 to 5.6 per cent for average size), for a
test at the 5 per cent level. Double underlined values are those where the test produced the fewest signi®cant
results when the null hypothesis was not true. The average size is for the cases where the null hypothesis was
true so that the 5 per cent is the desired value. The average power is for cases where the null hypothesis was
not true so that high values are desirable
Eects added to data
Normal data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
Average: size
power
Uniform data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
Average: size
power
Exponential data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
Average: size
power
Empirical data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
Average: size
power
Observations
randomized
F1
F2
ter Braak
method
F1
F2
4.8
36.6
97.1
4.1
4.9
37.5
97.7
3.9
5.2
3.4
23.3
80.3
24.1
82.1
4.3
35.6
98.0
3.5
4.7
36.2
97.1
5.4
5.2
4.9
22.8
80.8
24.1
80.1
6.2
36.7
98.2
5.3
6.6
36.4
97.0
5.6
5.2
4.8
22.7
82.1
21.3
83.3
2.7
35.8
97.2
2.5
3.8
34.1
98.2
2.5
1.9
1.9
13.9
31.2
9.2
26.6
0.9
7.5
40.9
0.2
0.3
9.7
40.7
4.9
36.7
97.3
5.3
6.8
37.0
98.1
4.8
37.1
97.0
3.3
4.0
37.1
97.9
2.9
22.2
47.9
4.1
1.7
17.3
42.9
# 1998 John Wiley & Sons, Ltd.
4. 4
59.8
5. 4
59.6
4. 6
59.8
2.3
26.4
4.1
59.3
5.6
59.2
4.6
58.8
1.3
18.9
Edgington
method
F1
F2
2.7
5. 1
4. 2
23.4
81.4
21.4
81.6
5.7
33.9
93.9
4. 2
5.3
34.3
95.1
5. 6
4. 6
5.0
23.9
79.0
22.2
80.1
4.6
33.9
96.3
3.5
4.9
31.6
95.0
6. 4
5.8
6.1
23.4
81.4
19.8
80.3
4. 7
33.6
94.7
4. 2
5.1
32.4
94.5
2.3
1.8
2.3
4. 2
21.5
3. 9
23.0
0.0
17.9
42.7
0.1
0.0
17.8
41.7
4.8
58.0
4.6
58.2
5.0
57.1
0.3
26.5
Use of
F-tables
F1
F2
3. 8
5.5
4.3
23.6
81.2
22.3
79.6
4.9
39.4
100.0
4. 0
5. 0
39.1
100.0
5. 0
4.3
5.3
24.6
79.7
23.2
81.5
7.1
41.0
100.0
5. 5
7.4
39.0
100.0
5.4
5.5
5.2
20.4
80.7
20.0
80.1
3. 7
38.2
99.9
3.5
4.5
37.6
100.0
0.7
0.3
0.9
12.9
33.5
11.6
33.8
0.0
7.0
40.7
0.0
0.0
8.3
38.7
ENVIRONMETRICS, VOL.
4. 4
62.5
6.0
63.2
4.9
62.2
1.2
17.5
3.0
5.6
4.1
24.9
86.5
23.6
86.1
5. 4
5.2
5.5
28.0
86.1
26.0
85.5
6.3
5.9
5.5
24.2
88.2
22.3
87.4
2.6
1.9
2.5
2.9
18.6
2.6
21.1
9, 53±65 (1998)
58
L. GONZALEZ AND B. F. J. MANLY
The use of more than 99 randomizations can be expected to increase the power of the randomization tests to some extent, and hence reduce the apparent superiority of the use of F-tables in
this respect. Some limited extra simulations that were carried out to investigate this show that
using 999 rather than 99 randomizations can increase power by up to about 2 per cent. Therefore
overall it is fair to say that in the context of this example none of the testing methods has proved
to be clearly superior to the others but randomizing observations seems slightly better than the
alternatives in terms of maintaining a size close to 5 per cent and having good power.
4.
A TWO-FACTOR SAMPLING DESIGN WITH REPLICATION
The second design that we have considered is the same as the ®rst except that replicate observations were considered for each of the 12 stream±season combinations. Situations were simulated
with two, four and six replicates, using the same four types of data (normal, uniform, exponential
and empirical) as for the design without replicates. However, stream (A) and season (B) eects
were halved to take into account to some extent the extra power for detecting eects that was
introduced by the replication. In addition, interactions were introduced at a low level (AB1) by
adding 0.5 to all observations in stream 3 and season 3, and 1.0 to all observations in stream 3 in
season 4, and interactions at a high level (AB2) by adding twice these increments. The values used
for the empirical distribution are shown in Table I.
Nine dierent scenarios were simulated for the sample design: A0 B0 AB0 (no eects);
A1 B0 AB0 (low stream eects); A2 B0 AB0 (high stream eects); A0 B1 AB0
(low seasonal eects); A0 B2 AB0 (high seasonal eects); A1 B1 AB0 (low stream and
low seasonal eects); A2 B2 AB0 (high stream and high seasonal eects); A1 B1 AB1
(low stream eects, low seasonal eects and low interaction); and A2 B2 AB2 (high stream
eects, high seasonal eects and high interaction).
Tests of main eects were conducted using the methods (a) to (d) described in Section 2. In
addition, the stream by season interaction was tested by the following four methods.
(e)
The observed F-value for the interaction from the analysis of variance was tested by
comparison with the distribution consisting of itself plus 99 alternative values obtained
by freely randomizing observations. An interaction was then considered to be signi®cant
at the 5 per cent level if the observed F-value was one of the largest in the set of 100.
(f ) This was similar to (e) except that the randomized F-values were obtained by freely reallocating residuals calculated as dierences between observations and the means from the
stream±season combinations to which they belonged, as proposed by ter Braak (1992).
(g) This was similar to (e) except that residuals were calculated as deviations from the analysis
of variance model without interactions. That is to say, the residuals were calculated
as (observation 7 stream mean 7 season mean overall mean). This approach was
suggested by Still and White (1981).
(h) The observed interaction F-value was tested using the F-distribution in the usual way.
Because the output from the simulations in terms of percentages of signi®cant results is very
extensive, we only present part of this and a summary of the remainder. Table III shows the percentages signi®cant with four replicates. As for Table II, percentages are underlined once when
they are signi®cantly dierent from 5 per cent and the null hypothesis is true, and underlined
twice when they are the lowest percentage for the alternative methods of testing the particular
eects.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
# 1998 John Wiley & Sons, Ltd.
Eects added to data
Normal data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
8. Low with interaction
9. High with interaction
Average: size
power
ENVIRONMETRICS, VOL.
Uniform data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
8. Low with interaction
9. High with interaction
Average: size
power
(e)
F12
ter Braak
method
F1
F2
(f)
F12
Edgington
method
F1
F2
(g)
F12
Use of
F-tables
F1
F2
(h)
F12
5.6
63.0
100.0
5.0
3.4
64.3
99.9
91.9
100.0
4. 4
4.9
5. 0
49.2
99.5
46.7
99.1
78.2
100.0
4.7
82.7
5.0
4.4
6.5
5.7
4.4
6.0
4.2
9.6
40.4
5.2
25.0
5. 0
63.8
99.9
6.1
4. 1
63.2
100.0
92.1
100.0
4.9
4.5
4. 3
49.0
99.0
45.4
99.1
78.7
100.0
4. 8
82.5
5.4
4.8
6.5
5. 8
5. 5
5. 5
4. 3
10.0
39.8
5.4
24.9
4.8
65.3
100.0
5.2
3.6
63.6
99.9
90.6
100.0
4.6
4.9
4.7
49.5
99.4
46.6
99.2
76.0
100.0
4.6
82.5
4. 8
4. 2
6.6
5. 3
5. 3
5. 1
3.7
9.4
40.7
5. 0
25.1
5.4
65.2
100.0
5. 4
3.5
65.8
99.9
92.8
100.0
4.7
4.9
4.1
50.2
99.3
48.5
99.3
79.7
100.0
4. 7
83.4
5. 4
4.6
6.0
5.8
4.7
5.6
4.1
9.6
41.9
5.2
25.8
4.5
66.9
99.9
4.1
4.5
60.4
99.9
90.6
100.0
5. 3
5.5
4.8
45.1
99.8
49.4
99.4
78.2
100.0
4.8
82.5
4.6
5.4
4.5
5.6
5.3
4.9
6.0
13.3
37.8
5.2
25.6
4.8
66.7
100.0
4.4
4.4
60.9
100.0
92.1
100.0
4.6
4. 7
4.9
43.9
100.0
47.0
99.3
79.0
100.0
4. 6
82.4
5.1
5. 5
3.8
5. 5
6.7
4. 8
6. 2
14.3
37.8
5.4
26.1
4. 1
65.9
100.0
3.9
4.4
61.9
99.9
91.6
100.0
4.6
5.3
4.7
44.1
99.9
48.3
99.7
77.7
100.0
4.5
82.4
4.8
6.0
4. 2
5. 6
5. 2
4. 3
6.6
14.0
38.6
5. 2
26.3
4.4
69.2
100.0
3. 9
4.4
62.4
99.9
93.1
100.0
5.0
4.9
4.9
46.7
100.0
48.9
99.6
80.2
100.0
4. 6
83.3
4. 8
5. 7
3.8
5.6
5.6
4.6
6.3
14.1
39.1
5.2
26.6
6.1
64.0
100.0
4.0
6. 1
3.8
4. 2
46.4
5.0
4.9
5.7
3.6
5.6
64.4
100.0
4.9
5.6
4.0
4. 8
46.1
5.0
4.7
6. 4
3. 6
5. 6
65.3
100.0
5.2
5.6
3.9
4.7
48.1
5.9
4. 7
6.8
3.4
5.7
5.9
5. 1
66.2
3.8
4.3
100.0
4.6
6.0
4. 3
47.6
3.3
Table continued on next page
59
9, 53±65 (1998)
Exponential data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
Observations
randomized
F1
F2
ANALYSIS OF VARIANCE BY RANDOMIZATION
# 1998 John Wiley & Sons, Ltd.
Table III. Results from the simulation experiment with the two-factor sampling design with four replicates. The tabulated values are the
percentages of signi®cant results with tests at the 5 per cent level (F1, test for a stream eect; F2, test for a seasonal eect; F12, test for the
interaction eect). Single underlined and double underlined values are as de®ned in Table II. Tests on interactions (e) to (h) are as described in
Section 4. The average size and average power are as de®ned in Table II
60
ENVIRONMETRICS, VOL.
Table continued.
Eects added to data
9, 53±65 (1998)
5.
6.
7.
8.
9.
High for seasons
Low for both
High for both
Low with interaction
High with interaction
Average: size
power
(e)
F12
4.6
65.6
100.0
92.3
100.0
99.2
47.8
99.2
78.2
100.0
4. 8
82.7
4.9
2.9
4.5
11.0
39.9
4. 5
25.5
4.9
68.9
100.0
4.7
3.9
67.5
100.0
95.0
100.0
3. 9
5.0
3.5
54.2
99.8
55.0
99.7
79.9
100.0
4.3
85.0
4. 1
5.1
3.3
4.5
3.2
4.0
3.5
8.8
36.7
4.0
22.8
ter Braak
method
F1
F2
(f)
Edgington
method
F1
F2
(g)
F12
5.3
64.7
99.9
91.1
100.0
99.2
48.2
98.8
78.3
100.0
5. 0
82.6
4.3
2.6
5.5
10.7
39.6
4.6
25.2
5.0
65.7
100.0
91.2
100.0
99.3
50.2
99.1
77.0
100.0
5.0
83.0
4. 9
3.0
5. 5
10.5
39.9
4. 9
25.2
4.8
66.9
100.0
93.2
100.0
99.6
48.8
99.6
78.9
100.0
4.9
83.4
4.1
2.6
4.7
10.3
41.0
4.3
25.7
4. 0
68.7
100.0
4. 2
4. 7
66.9
100.0
96.2
100.0
3.7
4.9
3.9
52.8
99.8
56.1
99.7
82.2
100.0
4.2
85.2
4.5
4. 5
3. 9
4. 3
3. 9
4. 1
4.2
10.5
44.9
4.2
27.7
4.5
69.0
100.0
5.7
4.9
67.2
100.0
94.3
100.0
4.3
5.4
5.1
55.7
99.5
58.0
100.0
79.9
100.0
5.0
85.3
4. 3
4. 7
4.0
4. 0
3.8
3.9
4. 2
10.3
41.9
4.1
26.1
3.2
66.2
100.0
3.2
3.4
64.5
100.0
95.0
100.0
3.0
3.6
2.9
49.8
99.9
51.5
99.9
80.0
100.0
3.2
83.9
3.1
2.6
2.6
3.1
2.0
2.6
3.0
7.8
38.2
2.7
23.0
F12
Use of
F-tables
F1
F2
(h)
F12
L. GONZALEZ AND B. F. J. MANLY
Empirical data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
8. Low with interaction
9. High with interaction
Average: size
power
Observations
randomized
F1
F2
# 1998 John Wiley & Sons, Ltd.
ANALYSIS OF VARIANCE BY RANDOMIZATION
61
A summary of all the simulation results is provided in Table IV in terms of average size
and power, with the results for the design without replication included for completeness. From
these results it is apparent that overall the dierences between the randomization methods are
relatively small.
5.
A THREE-FACTOR SAMPLING DESIGN
The third design that we have considered is where samples are taken at three depths (low, medium
and high), in three rivers, for four seasons, with either no replication, two replicates, four
replicates, or six replicates for each of the 3 3 4 36 factor combinations. Data for analysis
were again generated by starting with a ®xed set of values, allocating them in a random order to
the factor combinations and then adding factor eects at various levels. The initial ®xed sets of
values were also chosen as before from normal, uniform, exponential and empirical distributions,
and coded to have a mean of ®ve and a standard deviation of one. The empirical distribution was
the same one as was used for the two-factor simulations (Table I).
The signi®cance levels for F-statistics from analysis of variance were determined by three
methods for this design. The ®rst method involved comparing each F-statistic with the distribution obtained for the same statistic by randomly permuting the original data values and redoing
the analysis of variance. In this case the reference distribution consisted of 99 randomized
F-statistics plus the original statistic. Signi®cance at the 5 per cent level was then obtained if the
F-statistic for the original data was among the largest ®ve of the 100 statistics. The second method
used ter Braak's (1992) approach of randomizing residuals instead of observations, but was
otherwise similar to the ®rst method. The residuals in this case were from a model without the
three-factor interaction when there was no replication, and for a model with the three-factor
interaction when there was replication. Finally, the third method involved comparing the
F-statistics with the F-distribution in the conventional manner.
Rather than devoting a good deal of space to de®ning the factor eects that we used and
presenting many tables of results, we have just provided an overall summary in Table V. Our
conclusions from these three-factor simulations is essentially the same as for the two-factor
simulations. For normal or moderately non-normal data the use of F-tables is quite satisfactory,
and tends to give slightly higher power than randomization. However with the grossly nonnormal empirical data the randomization of observations gave the best performance in terms of
power and size.
6. EXAMPLE
As an example we consider the ratios of species numbers in dierent streams, seasons, and
positions in streams that are shown in Table I. The question to be considered is whether there is
any evidence that the E/O ratio varied with the stream (A, B, C), the position within stream
(bottom, middle, top), or the season of the year (summer, autumn, winter, spring).
An analysis of variance on the ratios gives the results shown in Table VI. Here the F-ratios all
have the `error' mean square as the denominator, because ®xed eects are being assumed. For all
the methods of testing, the ratios are very highly signi®cant, corresponding to probability levels
of 0.0001 or less when compared with tables. If there had been any discrepancies then our simulations suggest that in this situation the randomization of observations gives the most reliable
results.
# 1998 John Wiley & Sons, Ltd.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
9, 53±65 (1998)
Number of Average for Observations
replicates
randomized
F1 and F2
Normal data
1
(e)
F12
ter Braak
method
F1 and F2
(f)
F12
Edgington
method
F1 and F2
(g)
F12
Use of
F-tables
F1 and F2
F12
(h)
4. 4
59.8
5. 0
57.0
4. 7
82.7
5. 1
92.2
±
±
5.1
10.8
5.2
25.0
5.1
39.2
4.1
59.3
4.9
57.2
4.8
82.5
4.7
92.1
±
±
5.2
11.3
5. 4
24.9
4.9
39.4
4. 8
58.0
5. 1
59.9
4. 6
82.5
5. 0
92.1
±
±
4. 9
11.7
5.0
25.1
5.1
39.3
4. 4
62.5
4.8
58.8
4. 7
83.4
4.8
92.7
±
±
5.0
11.6
5. 2
25.8
5. 0
40.9
Uniform data
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
5. 4
59.6
4. 7
57.0
4. 8
82.5
5. 2
92.5
±
±
5.2
10.1
5.2
25.6
5.3
40.0
5.6
59.2
5.2
56.8
4.6
82.4
5.2
92.5
±
±
5.7
10.6
5.4
26.1
5.2
40.3
4. 6
58.2
4. 6
59.4
4.5
82.4
4. 8
92.3
±
±
5.4
10.2
5. 2
26.3
5.1
41.4
6.0
63.2
5. 1
58.5
4.6
83.3
5.2
93.1
±
±
5. 3
10.8
5.2
26.6
5. 1
42.0
Exponential data
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
4. 6
59.8
5.7
56.9
4. 8
82.7
5. 4
92.4
±
±
6.2
11.0
4.5
25.5
4.7
38.8
4.6
58.8
6.4
57.6
5.0
82.6
5.2
92.5
±
±
7.9
13.9
4. 6
25.2
4.5
39.7
5. 0
57.1
5. 2
60.5
5. 0
83.0
5.6
92.4
±
±
6.2
11.2
4.9
25.2
4. 8
40.1
4
6
# 1998 John Wiley & Sons, Ltd.
4. 9
62.2
5.9
58.3
4. 9
83.4
4.9
92.7
Table continued on
±
±
6.9
12.1
4.3
25.7
4.2
40.2
next page
L. GONZALEZ AND B. F. J. MANLY
Size
Power
Size
Power
Size
Power
Size
Power
2
62
ENVIRONMETRICS, VOL.
Table IV. Summary of the simulation results for a two-factor design in terms of the average size and the average power as de®ned for Table II,
with the averages calculated separately for stream and season eects (F1 and F2) and interaction (F12). Single underlined values are those where
the average size is for the cases where the null hypothesis was true but the percentage of signi®cant results is signi®cantly dierent from 5 per cent
(i.e. the result is outside the range 4.4 to 5.6 per cent for average size, and outside the range 4.9 to 5.1 per cent for overall results). Double
underlining is as de®ned in Table II
Number of Average for Observations
replicates
randomized
F1 and F2
Empirical data
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Over all distributions
Size
Power
(e)
F12
ter Braak
method
F1 and F2
2.3
26.4
4.9
56.3
4.3
85.0
3.9
93.1
±
±
4.1
6.7
4.0
22.8
3.7
38.7
4.7
71.0
4.9
24.5
(f)
F12
Edgington
method
F1 and F2
1.3
18.9
4.9
60.4
4.2
85.2
3.7
93.4
±
±
4.4
7. 9
4.2
27.7
3. 7
45.2
4.7
70.7
5.1
26.0
(g)
F12
Use of
F-tables
F1 and F2
(h)
F12
0.3
26.5
5. 4
63.7
5.0
85.3
4.8
93.3
±
±
4.4
6. 7
4.1
26.1
4.2
45.5
1.2
17.5
4.4
59.0
3.2
83.9
2.0
91.9
±
±
4.3
5.7
2.7
23.0
2.0
38.6
4.6
71.7
4. 9
25.7
4.4
71.5
4.6
25.3
ENVIRONMETRICS, VOL.
ANALYSIS OF VARIANCE BY RANDOMIZATION
# 1998 John Wiley & Sons, Ltd.
Table continued.
63
9, 53±65 (1998)
64
L. GONZALEZ AND B. F. J. MANLY
Table V. Summary of the results obtained from simulations with the three-factor sampling design. Overall
results are given for tests on the main eects of the position in the stream, the stream, and the season, and
the three interactions of position stream, position season, and stream season. For each eect, the
average size is the mean percentage of signi®cant results when the null hypothesis was true, and the average
power is the mean percentage of signi®cant results when the null hypothesis was not true. All tests were at
the 5 per cent level. Single underlined values are those where the average size is for the cases where the null
hypothesis was true but the percentage of signi®cant results is signi®cantly dierent from 5 per cent (i.e. the
result is outside the range 4.4 to 5.6 per cent for average size, and outside the range 4.91 to 5.09 per cent for
overall results). Double underlining is as de®ned in Table II
Number of Average for Observation randomized
Replicates
F1, F2
F12, F13
and F3
and F23
Normal distribution
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Uniform distribution
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Exponential distribution
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Empirical distribution
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Over all distributions
Size
Power
ENVIRONMETRICS, VOL.
ter Braak method
Use of F-table
F1, F2
and F3
F12, F13
and F23
F1, F2
and F3
F12, F13
and F23
5.1
13.4
5.1
27.9
5.2
55.2
5.0
72.3
5. 3
5.3
5. 0
6.5
5. 1
9. 4
5. 1
14.0
5. 2
13.9
4.9
28.1
5.1
55.2
5.0
72.3
5.4
5.3
5.1
6.5
5.2
9.1
5.0
13.8
5.4
14.2
4. 9
28.9
5. 1
56.6
4. 9
73.8
5. 5
5.3
5.0
6. 7
5. 1
9. 9
5. 0
14.3
5.4
13.6
4.8
28.5
5.1
55.7
4.9
72.8
4. 9
4.6
5.0
7.6
5.0
10.3
5. 3
14.1
5.5
13.8
4.8
28.6
5.2
55.7
4.8
72.7
5.2
4.9
5.1
7.7
5.1
10.1
5.3
14.1
5. 7
14.4
4. 9
29.3
5. 2
56.9
4. 8
73.9
5. 3
5. 1
5.2
8. 1
5.1
10.5
5. 4
14.6
4.8
13.7
5.0
28.8
5.1
55.7
5.1
72.4
5. 0
5.5
4.9
7. 4
5. 0
10.8
4. 8
14.5
4.7
13.2
4.7
28.8
5.1
55.4
5.1
72.5
4.7
5.8
4.9
7.0
4.9
10.5
4.8
14.8
4. 8
13.7
4. 8
29.1
4. 9
56.4
5. 0
73.7
4. 9
5. 7
4.8
7. 2
4. 8
10.4
4.8
14.7
4.9
14.8
5.0
29.2
4.9
60.4
4.6
74.9
4.8
6. 1
5. 3
7. 7
4. 6
11.9
4. 6
15.1
3.8
12.4
4.9
28.7
3.9
58.8
4.4
74.2
3.4
4.7
5.8
7.8
3.9
11.5
4.5
13.8
3.7
12.7
4.6
28.7
3.0
56.7
3.9
73.9
3.4
4.7
5. 3
7.4
2.8
9.1
3.6
12.6
5.0
43.1
5. 0
9. 4
4.8
42.8
4.9
9.2
4.7
43.3
4.7
9. 1
9, 53±65 (1998)
# 1998 John Wiley & Sons, Ltd.
ANALYSIS OF VARIANCE BY RANDOMIZATION
65
Table VI. Results for the analysis of variance of ratios of ephemeroptera to oligochaetes using three
dierent methods. The ®rst involves randomizing observations; the second randomizing residuals
(ter Braak); and the third the conventional F-test for analysis of variance
Source of variation
d.f.
Stream
Position
Season
Stream position
Stream season
Position season
Stream position season
Error
Total
SS
MS
2
2
3
4
6
6
12
72
166.3
753.4
364.3
313.2
316.0
724.5
640.5
542.1
83.1
376.7
121.4
78.3
52.7
120.8
53.4
7.5
107
3820.3
7.
F
Signi®cance level (%)
Randomizing Randomizing
Using
observations
residuals
F distribution
11.04
50.03
16.13
10.40
7.00
16.04
7.09
0.02
0.02
0.02
0.02
0.04
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.01
0.00
0.00
0.00
0.00
0.00
0.00
DISCUSSION
Whilst it is obviously not possible to draw de®nitive conclusions from the simulation of a few
speci®c situations, it is clear that all of the randomization methods have given rather similar
results for assessing the existence of factor eects except in very extreme situations. Therefore,
from a practical point of view it is not possible to say that one method of randomization is much
better than another.
One reason for this is that the use of F-statistics has the eect of making all of the
randomization distributions rather similar for most data. We discovered early in our study that
this does not occur if sums of squares are used instead of F-statistics. Indeed it became very clear
that using sums of squares does not give tests with good properties. This explains, for example,
why Edgington (1995, p. 133) found somewhat dierent p-values of 2.8 and 1.4 per cent for
testing a factor eect using a restricted randomization and freely randomizing observations with
a small arti®cial two-factor set of data. Using the F-statistic when freely randomizing observations gives a p-value of 3.2 per cent, which is much closer to the 2.8 per cent for restricted
randomization. Interestingly, randomizing residuals by ter Braak's (1992) method does not seem
sensible for this set of arti®cial data because the residuals are all either 1 or ÿ1 and the p-value
is found to be 8.7 per cent.
REFERENCES
Crowley, P. H. (1992). `Resampling methods for computation-intensive data analysis in ecology and
evolution', Annual Review of Ecology and Systematics, 23, 405±447.
Edgington, E. S. (1995). Randomization Tests, 3rd edn., Marcel Dekker, New York.
Manly, B. F. J. (1991). Randomization and Monte Carlo Methods in Biology, Chapman and Hall, London.
Still, A. W. and White, A. P. (1981). `The approximate randomization test as an alternative to the F-test in
analysis of variance', British Journal of Mathematical and Statistical Psychology, 34, 243±252.
ter Braak, C. J. F. (1992). `Permutation versus bootstrap signi®cance tests in multiple regression and
ANOVA', in Jockel, K. H. (ed.), Bootstrapping and Related Techniques, Springer, Berlin, pp. 79±86.
# 1998 John Wiley & Sons, Ltd.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
Environmetrics, 9, 53±65 (1998)
ANALYSIS OF VARIANCE BY RANDOMIZATION
WITH SMALL DATA SETS
LILIANA GONZALEZ1* AND BRYAN F. J. MANLY2
1Finance
and Quantitative Analysis Department, University of Otago, PO Box 56, Dunedin, New Zealand
Department of Mathematics and Statistics, University of Otago, PO Box 56, Dunedin, New Zealand
2
SUMMARY
A simulation study has been carried out to compare the results from using dierent randomization methods
to assess the signi®cance of the F-statistics for factor eects with analysis of variance. Two-way and threeway designs with and without replication were considered, with the randomization of observations, the
restricted randomization of observations, and the randomization of dierent types of residuals. Data from
normal, uniform, exponential, and an empirical distribution were considered. It was found that usually all
methods of randomization gave similar results, as did the use of the usual F-distribution tables, and that no
method of analysis was clearly superior to the others under all conditions. # 1998 John Wiley & Sons, Ltd.
KEY WORDS
analysis of variance; computer intensive statistics; simulation; F-distribution; randomization
inference
1. INTRODUCTION
Randomization methods for testing hypotheses are useful in environmental areas because their
justi®cation is easy to understand by government ocials and the public at large, and they are
applicable with the non-normal distributions that often occur. However, as soon as the required
analysis becomes more than something as simple as the comparison of the mean values of several
samples the appropriate randomization procedure becomes questionable. There has been controversy with several types of randomization analysis but in this paper we only consider simple
factorial designs with analysis of variance. It was in this area that Crowley (1992) in his review of
resampling methods remarked that `contrasting views on interaction terms in factorial ANOVA
. . . needs to be resolved'.
In discussing the merits of randomization tests we start with one important premise. This is
that the major value of these tests is in situations where the data are grossly non-normal (e.g. with
several extremely large values and many tied values), and sample sizes are not large. It is in these
situations that more conventional methods are of questionable validity and the results of a
randomization test may carry more weight. This premise is important because it tells us that our
main concern should be with the performance of alternative methods on grossly non-normal data.
Theoretical or simulation studies suggesting that one method is somewhat better than another
with normally distributed data are not necessarily of much relevance.
* Correspondence to: Liliana Gonzalez, Finance and Quantitative Analysis Department, University of Otago,
PO Box 56, Dunedin, New Zealand. e-mail: [email protected]
CCC 1180±4009/98/010053±13$17.50
# 1998 John Wiley & Sons, Ltd.
Received 29 December 1995
Revised 13 June 1997
54
L. GONZALEZ AND B. F. J. MANLY
In this context we address two questions:
(a)
Should it be observations or residuals that are permuted and, if it is to be residuals, what
type of residuals should be used?
(b) Should restricted randomization be used where possible instead of a completely free
randomization?
Our method of approach involves looking at several quite speci®c situations and examining the
performance of alternative methods. Because the situations are speci®c there is no guarantee that
our conclusions apply to analysis of variance in general. Nevertheless, the results are clear enough
to suggest that this is the case.
In brief, our conclusions with respect to the testing of factor eects is that there is usually
similar performance with analysis of variance based on the usual F-distribution tables, the
randomization of observations, or the randomization of dierent types of residuals with normal
or moderately non-normal data. If anything, the use of F-distribution tables tends to give
slightly higher power than the other methods except with extremely non-normal data. On the
other hand, restricted randomization has guaranteed performance when the null hypothesis is
true, but has tendency to give low power to detect real eects under some conditions. Overall,
therefore, it is not possible to say that any method of analysis is clearly superior to the other
methods.
2. THE SIMULATED SITUATION
To set the scene for our study, we imagine that three streams are sampled in the four seasons of a
year and some environmentally important variable is measured. In Section 3 we assume that one
reading only is taken for each stream in each season. The sample design is then suitable for a twofactor analysis of variance without replication, with 3 4 12 observations. We generated
simulated data for analysis in four dierent ways:
(i) Twelve values were generated from a normal distribution and scaled to have a mean of
®ve and standard deviation of one. These values were then placed in a random order and
stream eects and seasonal eects were added to produce one set of data. This process
was repeated 1000 times for each of several combinations of stream eects and seasonal
eects.
(ii) The same process as described for (i) was used but with the original 12 values chosen
from a uniform distribution.
(iii) The same process as described for (i) was used but with the original 12 values chosen
from an exponential distribution.
(iv) The same process as described for (i) was used but with the original 12 values chosen
from an observed distribution of the ratio of counts of ephemeroptera to counts of
oligochaetes (a measure of pollution) in New Zealand streams, as shown in Table I.
Basically, the data generated by methods (i) and (ii) are quite `well-behaved', with no very extreme
values; the data generated by method (iii) are positively skewed with a few relatively large values;
and the data generated by method (iv) are like the worst type of data found in real life, with many
tied values that were originally zero before scaling to a mean of ®ve and a standard deviation of
one and several relatively large values.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
# 1998 John Wiley & Sons, Ltd.
ANALYSIS OF VARIANCE BY RANDOMIZATION
55
Table I. Ratios of numbers of ephemeroptera to numbers of oligochaetes in New Zealand streams. The
three ratios shown for each stream±position±season combination come from three independent samples
taken in the same area at the same time
Stream
Position
Summer
Autumn
A
Bottom
Middle
Top
0.2,
2.6,
0.7,
0.2,
0.7,
8.5,
0.0
0. 7
7.1
0.0,
0.3,
0.5,
B
Bottom
Middle
Top
0.0,
0.3,
1.2,
0.0,
0.2,
0.7,
0. 0
0.3
0. 8
0.0,
0.0,
10.7,
C
Bottom
Middle
Top
0.1, 0.2, 0.1
1.0, 2.6, 1.9
7.3, 10.4, 9.5
0.0,
0.6,
1.1,
0. 0
0.4
1. 4
Winter
Spring
0.0,
0.0,
0.1,
0.0,
0.0,
0.1,
0.0
0. 0
0. 2
0.0,
0.0,
0.2,
0.0,
0.0,
0.1,
0.0
0.0
0.4
0.0, 0.0
0.0, 0.0
19.9, 9.4
0.0,
0.0,
2.0,
0.0,
0.0,
6.3,
0. 0
0. 0
4.8
0.0,
0.0,
2.0,
0.0,
0.0,
1.6,
0.0
0.1
1.9
0.0, 0.0, 0.1
0.0, 0.0, 0.0
46.6, 20.3, 24.0
0.4,
0.1,
1.2,
0.0,
0.2,
0.8,
0. 3
0.0
6. 1
0.0,
0.0,
0.2,
0.0,
0.0,
0.1,
0.0
0.0
0.1
After generation, the F-values for variation between streams and variation between seasons
were obtained for each set of data from a two-factor analysis of variance. The p-values were then
calculated by four dierent methods:
(a)
The observations were randomized freely between the 12 factor combinations and the
analysis of variance was repeated. This was done 99 times and the p-value for a factor was
taken as the proportion of times that the corresponding observed F-value was equalled or
exceeded in the set of 100 F-values consisting of the observed one plus the 99 obtained by
randomization.
(b) The usual residuals for a two-way analysis of variance (observations 7 stream mean 7
season mean overall mean) were calculated, and these were then freely randomized
between the 12 factor combinations. This was repeated 99 times and p-values were determined as for (a). This is the method proposed by ter Braak (1992).
(c) To test the eect of streams a restricted randomization was used, as described by
Edgington (1995, Chapter 6). This involves producing alternative sets of data by
randomly permuting observations between streams within seasons only, and thereby
allowing for possible dierences between seasons. Similarly, to test the eect of seasons a
restricted randomization allowing observations to move between seasons but not between
streams was used. For each eect 99 restricted randomizations were carried out to produce
99 F-values for comparison with the observed F-value for that eect. The p-value for an
eect was then the proportion of F-values as large or larger than the observed one in the
set consisting of the observed F-value plus the 99 randomized values.
(d) The p-values for eects were determined by computing the probability of F-values as
large or larger than those observed from the usual F-distributions.
Method (a) was also used with tests based on the sums of squares of eects as proposed by Manly
(1991, Chapter 5). However the relatively poor performance of this approach when compared to
the others that were considered means that it is not worth discussing further.
The results of this simulation allow the comparison of the dierent testing procedures under
quite a wide range of conditions in terms of the type of data involved. In Sections 4 and 5 slightly
more complicated designs involving two factors with replication and the introduction of a third
# 1998 John Wiley & Sons, Ltd.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
56
L. GONZALEZ AND B. F. J. MANLY
factor for the position in the stream are also considered. These introduce the problem of testing
for interactions between factors.
3.
A TWO-FACTOR SAMPLING DESIGN WITHOUT REPLICATION
The two-factor sampling design was simulated as described in the previous section, with data
from four distributions (normal, uniform, exponential and empirical). Seven levels for the
combined eects of streams and seasons were used. Let A0 stand for no stream eects, A1 stand
for low stream eects, A2 stand for high stream eects, B0 stand for no seasonal eects, B1 stand
for low seasonal eects, and B2 stand for high seasonal eects. The seven levels are then A0 and
B0, A1 and B0, A2 and B0, A0 and B1, A0 and B2, A1 and B1, and A2 and B2, where, for
example A1 and B0 means that there was a low level of stream eects and no seasonal eects. To
apply the levels A1, A2, B1 and B2 the 12 data values had constants added to them after being
placed in an initial random order as described in the last section. For A1 stream 2 observations
were increased by 1.0 and stream 3 observations by 2.0; for A2 stream 2 observations were
increased by 2.0 and stream 3 observations by 4.0; for B1 season 2 observations were increased by
0.7, season 3 observations were increased by 1.3, and season 4 observations were increased by 2.0;
for B2 season 2 observations were increased by 1.3, season 3 observations were increased by 2.7,
and season 4 observations were increased by 4.0.
The results of the simulations are summarized in Table II in terms of the percentage of
signi®cant results for tests at the 5 per cent level of signi®cance. Cases where the null hypothesis is
true and the observed result is signi®cantly dierent from 5 per cent with a test at the 5 per cent
level (i.e. the result is outside the range 3.6 to 6.4 per cent for single values, and outside the range
4.4 to 5.6 per cent of average size) are underlined once. Cases where the null hypothesis was not
true and a test gave the fewest signi®cant results (i.e. had relatively low observed power) are
underlined twice. The average size is the mean percentage of signi®cant results when the null
hypothesis was true. The average power is the mean percentage of signi®cant results when the null
hypothesis was not true. It is desirable for this to be large.
Overall the testing procedures have given fairly similar results. However the following
tendencies can be noted:
1. Freely randomizing observations usually gave good power and did relatively well in terms
of maintaining an average size close to 5 per cent when the null hypothesis was true.
2. Freely randomizing residuals as proposed by ter Braak (1992) did slightly worse than freely
randomizing observations with respect to the percentage of signi®cant results when the null
hypothesis was true and with respect to power when the null hypothesis was not true.
3. Restricted randomization as proposed by Edgington (1995) gave about 5 per cent signi®cant results when the null hypothesis was true, except with the empirical data, but often
had the lowest average power of all the procedures.
4. Use of F-tables for testing gave about the correct proportion of signi®cant results when
the null hypothesis was true and had the highest power of all the tests, except with the
empirical data.
If anything the simulation results suggest that the use of F-tables is better than any of the
randomization methods, except with extremely non-normal data, in which case freely randomizing observations gave the best results. However it should be noted that only 99 randomizations
were used with the randomization tests to keep computational times within reasonable limits.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
# 1998 John Wiley & Sons, Ltd.
ANALYSIS OF VARIANCE BY RANDOMIZATION
57
Table II. Results from the simulation experiment with the two-factor sampling design without replication.
The tabulated values are the percentages of signi®cant results with tests at the 5 per cent level (F1, test for a
stream eect; F2, test for a seasonal test). Single underlined values are those where the null hypothesis is
true but the percentage of signi®cant results is signi®cantly dierent from 5 per cent (i.e. the result is outside
the range 3.6 to 6.4 per cent for single values and outside the range 4.4 to 5.6 per cent for average size), for a
test at the 5 per cent level. Double underlined values are those where the test produced the fewest signi®cant
results when the null hypothesis was not true. The average size is for the cases where the null hypothesis was
true so that the 5 per cent is the desired value. The average power is for cases where the null hypothesis was
not true so that high values are desirable
Eects added to data
Normal data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
Average: size
power
Uniform data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
Average: size
power
Exponential data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
Average: size
power
Empirical data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
Average: size
power
Observations
randomized
F1
F2
ter Braak
method
F1
F2
4.8
36.6
97.1
4.1
4.9
37.5
97.7
3.9
5.2
3.4
23.3
80.3
24.1
82.1
4.3
35.6
98.0
3.5
4.7
36.2
97.1
5.4
5.2
4.9
22.8
80.8
24.1
80.1
6.2
36.7
98.2
5.3
6.6
36.4
97.0
5.6
5.2
4.8
22.7
82.1
21.3
83.3
2.7
35.8
97.2
2.5
3.8
34.1
98.2
2.5
1.9
1.9
13.9
31.2
9.2
26.6
0.9
7.5
40.9
0.2
0.3
9.7
40.7
4.9
36.7
97.3
5.3
6.8
37.0
98.1
4.8
37.1
97.0
3.3
4.0
37.1
97.9
2.9
22.2
47.9
4.1
1.7
17.3
42.9
# 1998 John Wiley & Sons, Ltd.
4. 4
59.8
5. 4
59.6
4. 6
59.8
2.3
26.4
4.1
59.3
5.6
59.2
4.6
58.8
1.3
18.9
Edgington
method
F1
F2
2.7
5. 1
4. 2
23.4
81.4
21.4
81.6
5.7
33.9
93.9
4. 2
5.3
34.3
95.1
5. 6
4. 6
5.0
23.9
79.0
22.2
80.1
4.6
33.9
96.3
3.5
4.9
31.6
95.0
6. 4
5.8
6.1
23.4
81.4
19.8
80.3
4. 7
33.6
94.7
4. 2
5.1
32.4
94.5
2.3
1.8
2.3
4. 2
21.5
3. 9
23.0
0.0
17.9
42.7
0.1
0.0
17.8
41.7
4.8
58.0
4.6
58.2
5.0
57.1
0.3
26.5
Use of
F-tables
F1
F2
3. 8
5.5
4.3
23.6
81.2
22.3
79.6
4.9
39.4
100.0
4. 0
5. 0
39.1
100.0
5. 0
4.3
5.3
24.6
79.7
23.2
81.5
7.1
41.0
100.0
5. 5
7.4
39.0
100.0
5.4
5.5
5.2
20.4
80.7
20.0
80.1
3. 7
38.2
99.9
3.5
4.5
37.6
100.0
0.7
0.3
0.9
12.9
33.5
11.6
33.8
0.0
7.0
40.7
0.0
0.0
8.3
38.7
ENVIRONMETRICS, VOL.
4. 4
62.5
6.0
63.2
4.9
62.2
1.2
17.5
3.0
5.6
4.1
24.9
86.5
23.6
86.1
5. 4
5.2
5.5
28.0
86.1
26.0
85.5
6.3
5.9
5.5
24.2
88.2
22.3
87.4
2.6
1.9
2.5
2.9
18.6
2.6
21.1
9, 53±65 (1998)
58
L. GONZALEZ AND B. F. J. MANLY
The use of more than 99 randomizations can be expected to increase the power of the randomization tests to some extent, and hence reduce the apparent superiority of the use of F-tables in
this respect. Some limited extra simulations that were carried out to investigate this show that
using 999 rather than 99 randomizations can increase power by up to about 2 per cent. Therefore
overall it is fair to say that in the context of this example none of the testing methods has proved
to be clearly superior to the others but randomizing observations seems slightly better than the
alternatives in terms of maintaining a size close to 5 per cent and having good power.
4.
A TWO-FACTOR SAMPLING DESIGN WITH REPLICATION
The second design that we have considered is the same as the ®rst except that replicate observations were considered for each of the 12 stream±season combinations. Situations were simulated
with two, four and six replicates, using the same four types of data (normal, uniform, exponential
and empirical) as for the design without replicates. However, stream (A) and season (B) eects
were halved to take into account to some extent the extra power for detecting eects that was
introduced by the replication. In addition, interactions were introduced at a low level (AB1) by
adding 0.5 to all observations in stream 3 and season 3, and 1.0 to all observations in stream 3 in
season 4, and interactions at a high level (AB2) by adding twice these increments. The values used
for the empirical distribution are shown in Table I.
Nine dierent scenarios were simulated for the sample design: A0 B0 AB0 (no eects);
A1 B0 AB0 (low stream eects); A2 B0 AB0 (high stream eects); A0 B1 AB0
(low seasonal eects); A0 B2 AB0 (high seasonal eects); A1 B1 AB0 (low stream and
low seasonal eects); A2 B2 AB0 (high stream and high seasonal eects); A1 B1 AB1
(low stream eects, low seasonal eects and low interaction); and A2 B2 AB2 (high stream
eects, high seasonal eects and high interaction).
Tests of main eects were conducted using the methods (a) to (d) described in Section 2. In
addition, the stream by season interaction was tested by the following four methods.
(e)
The observed F-value for the interaction from the analysis of variance was tested by
comparison with the distribution consisting of itself plus 99 alternative values obtained
by freely randomizing observations. An interaction was then considered to be signi®cant
at the 5 per cent level if the observed F-value was one of the largest in the set of 100.
(f ) This was similar to (e) except that the randomized F-values were obtained by freely reallocating residuals calculated as dierences between observations and the means from the
stream±season combinations to which they belonged, as proposed by ter Braak (1992).
(g) This was similar to (e) except that residuals were calculated as deviations from the analysis
of variance model without interactions. That is to say, the residuals were calculated
as (observation 7 stream mean 7 season mean overall mean). This approach was
suggested by Still and White (1981).
(h) The observed interaction F-value was tested using the F-distribution in the usual way.
Because the output from the simulations in terms of percentages of signi®cant results is very
extensive, we only present part of this and a summary of the remainder. Table III shows the percentages signi®cant with four replicates. As for Table II, percentages are underlined once when
they are signi®cantly dierent from 5 per cent and the null hypothesis is true, and underlined
twice when they are the lowest percentage for the alternative methods of testing the particular
eects.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
# 1998 John Wiley & Sons, Ltd.
Eects added to data
Normal data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
8. Low with interaction
9. High with interaction
Average: size
power
ENVIRONMETRICS, VOL.
Uniform data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
8. Low with interaction
9. High with interaction
Average: size
power
(e)
F12
ter Braak
method
F1
F2
(f)
F12
Edgington
method
F1
F2
(g)
F12
Use of
F-tables
F1
F2
(h)
F12
5.6
63.0
100.0
5.0
3.4
64.3
99.9
91.9
100.0
4. 4
4.9
5. 0
49.2
99.5
46.7
99.1
78.2
100.0
4.7
82.7
5.0
4.4
6.5
5.7
4.4
6.0
4.2
9.6
40.4
5.2
25.0
5. 0
63.8
99.9
6.1
4. 1
63.2
100.0
92.1
100.0
4.9
4.5
4. 3
49.0
99.0
45.4
99.1
78.7
100.0
4. 8
82.5
5.4
4.8
6.5
5. 8
5. 5
5. 5
4. 3
10.0
39.8
5.4
24.9
4.8
65.3
100.0
5.2
3.6
63.6
99.9
90.6
100.0
4.6
4.9
4.7
49.5
99.4
46.6
99.2
76.0
100.0
4.6
82.5
4. 8
4. 2
6.6
5. 3
5. 3
5. 1
3.7
9.4
40.7
5. 0
25.1
5.4
65.2
100.0
5. 4
3.5
65.8
99.9
92.8
100.0
4.7
4.9
4.1
50.2
99.3
48.5
99.3
79.7
100.0
4. 7
83.4
5. 4
4.6
6.0
5.8
4.7
5.6
4.1
9.6
41.9
5.2
25.8
4.5
66.9
99.9
4.1
4.5
60.4
99.9
90.6
100.0
5. 3
5.5
4.8
45.1
99.8
49.4
99.4
78.2
100.0
4.8
82.5
4.6
5.4
4.5
5.6
5.3
4.9
6.0
13.3
37.8
5.2
25.6
4.8
66.7
100.0
4.4
4.4
60.9
100.0
92.1
100.0
4.6
4. 7
4.9
43.9
100.0
47.0
99.3
79.0
100.0
4. 6
82.4
5.1
5. 5
3.8
5. 5
6.7
4. 8
6. 2
14.3
37.8
5.4
26.1
4. 1
65.9
100.0
3.9
4.4
61.9
99.9
91.6
100.0
4.6
5.3
4.7
44.1
99.9
48.3
99.7
77.7
100.0
4.5
82.4
4.8
6.0
4. 2
5. 6
5. 2
4. 3
6.6
14.0
38.6
5. 2
26.3
4.4
69.2
100.0
3. 9
4.4
62.4
99.9
93.1
100.0
5.0
4.9
4.9
46.7
100.0
48.9
99.6
80.2
100.0
4. 6
83.3
4. 8
5. 7
3.8
5.6
5.6
4.6
6.3
14.1
39.1
5.2
26.6
6.1
64.0
100.0
4.0
6. 1
3.8
4. 2
46.4
5.0
4.9
5.7
3.6
5.6
64.4
100.0
4.9
5.6
4.0
4. 8
46.1
5.0
4.7
6. 4
3. 6
5. 6
65.3
100.0
5.2
5.6
3.9
4.7
48.1
5.9
4. 7
6.8
3.4
5.7
5.9
5. 1
66.2
3.8
4.3
100.0
4.6
6.0
4. 3
47.6
3.3
Table continued on next page
59
9, 53±65 (1998)
Exponential data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
Observations
randomized
F1
F2
ANALYSIS OF VARIANCE BY RANDOMIZATION
# 1998 John Wiley & Sons, Ltd.
Table III. Results from the simulation experiment with the two-factor sampling design with four replicates. The tabulated values are the
percentages of signi®cant results with tests at the 5 per cent level (F1, test for a stream eect; F2, test for a seasonal eect; F12, test for the
interaction eect). Single underlined and double underlined values are as de®ned in Table II. Tests on interactions (e) to (h) are as described in
Section 4. The average size and average power are as de®ned in Table II
60
ENVIRONMETRICS, VOL.
Table continued.
Eects added to data
9, 53±65 (1998)
5.
6.
7.
8.
9.
High for seasons
Low for both
High for both
Low with interaction
High with interaction
Average: size
power
(e)
F12
4.6
65.6
100.0
92.3
100.0
99.2
47.8
99.2
78.2
100.0
4. 8
82.7
4.9
2.9
4.5
11.0
39.9
4. 5
25.5
4.9
68.9
100.0
4.7
3.9
67.5
100.0
95.0
100.0
3. 9
5.0
3.5
54.2
99.8
55.0
99.7
79.9
100.0
4.3
85.0
4. 1
5.1
3.3
4.5
3.2
4.0
3.5
8.8
36.7
4.0
22.8
ter Braak
method
F1
F2
(f)
Edgington
method
F1
F2
(g)
F12
5.3
64.7
99.9
91.1
100.0
99.2
48.2
98.8
78.3
100.0
5. 0
82.6
4.3
2.6
5.5
10.7
39.6
4.6
25.2
5.0
65.7
100.0
91.2
100.0
99.3
50.2
99.1
77.0
100.0
5.0
83.0
4. 9
3.0
5. 5
10.5
39.9
4. 9
25.2
4.8
66.9
100.0
93.2
100.0
99.6
48.8
99.6
78.9
100.0
4.9
83.4
4.1
2.6
4.7
10.3
41.0
4.3
25.7
4. 0
68.7
100.0
4. 2
4. 7
66.9
100.0
96.2
100.0
3.7
4.9
3.9
52.8
99.8
56.1
99.7
82.2
100.0
4.2
85.2
4.5
4. 5
3. 9
4. 3
3. 9
4. 1
4.2
10.5
44.9
4.2
27.7
4.5
69.0
100.0
5.7
4.9
67.2
100.0
94.3
100.0
4.3
5.4
5.1
55.7
99.5
58.0
100.0
79.9
100.0
5.0
85.3
4. 3
4. 7
4.0
4. 0
3.8
3.9
4. 2
10.3
41.9
4.1
26.1
3.2
66.2
100.0
3.2
3.4
64.5
100.0
95.0
100.0
3.0
3.6
2.9
49.8
99.9
51.5
99.9
80.0
100.0
3.2
83.9
3.1
2.6
2.6
3.1
2.0
2.6
3.0
7.8
38.2
2.7
23.0
F12
Use of
F-tables
F1
F2
(h)
F12
L. GONZALEZ AND B. F. J. MANLY
Empirical data
1. None
2. Low for streams
3. High for streams
4. Low for seasons
5. High for seasons
6. Low for both
7. High for both
8. Low with interaction
9. High with interaction
Average: size
power
Observations
randomized
F1
F2
# 1998 John Wiley & Sons, Ltd.
ANALYSIS OF VARIANCE BY RANDOMIZATION
61
A summary of all the simulation results is provided in Table IV in terms of average size
and power, with the results for the design without replication included for completeness. From
these results it is apparent that overall the dierences between the randomization methods are
relatively small.
5.
A THREE-FACTOR SAMPLING DESIGN
The third design that we have considered is where samples are taken at three depths (low, medium
and high), in three rivers, for four seasons, with either no replication, two replicates, four
replicates, or six replicates for each of the 3 3 4 36 factor combinations. Data for analysis
were again generated by starting with a ®xed set of values, allocating them in a random order to
the factor combinations and then adding factor eects at various levels. The initial ®xed sets of
values were also chosen as before from normal, uniform, exponential and empirical distributions,
and coded to have a mean of ®ve and a standard deviation of one. The empirical distribution was
the same one as was used for the two-factor simulations (Table I).
The signi®cance levels for F-statistics from analysis of variance were determined by three
methods for this design. The ®rst method involved comparing each F-statistic with the distribution obtained for the same statistic by randomly permuting the original data values and redoing
the analysis of variance. In this case the reference distribution consisted of 99 randomized
F-statistics plus the original statistic. Signi®cance at the 5 per cent level was then obtained if the
F-statistic for the original data was among the largest ®ve of the 100 statistics. The second method
used ter Braak's (1992) approach of randomizing residuals instead of observations, but was
otherwise similar to the ®rst method. The residuals in this case were from a model without the
three-factor interaction when there was no replication, and for a model with the three-factor
interaction when there was replication. Finally, the third method involved comparing the
F-statistics with the F-distribution in the conventional manner.
Rather than devoting a good deal of space to de®ning the factor eects that we used and
presenting many tables of results, we have just provided an overall summary in Table V. Our
conclusions from these three-factor simulations is essentially the same as for the two-factor
simulations. For normal or moderately non-normal data the use of F-tables is quite satisfactory,
and tends to give slightly higher power than randomization. However with the grossly nonnormal empirical data the randomization of observations gave the best performance in terms of
power and size.
6. EXAMPLE
As an example we consider the ratios of species numbers in dierent streams, seasons, and
positions in streams that are shown in Table I. The question to be considered is whether there is
any evidence that the E/O ratio varied with the stream (A, B, C), the position within stream
(bottom, middle, top), or the season of the year (summer, autumn, winter, spring).
An analysis of variance on the ratios gives the results shown in Table VI. Here the F-ratios all
have the `error' mean square as the denominator, because ®xed eects are being assumed. For all
the methods of testing, the ratios are very highly signi®cant, corresponding to probability levels
of 0.0001 or less when compared with tables. If there had been any discrepancies then our simulations suggest that in this situation the randomization of observations gives the most reliable
results.
# 1998 John Wiley & Sons, Ltd.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)
9, 53±65 (1998)
Number of Average for Observations
replicates
randomized
F1 and F2
Normal data
1
(e)
F12
ter Braak
method
F1 and F2
(f)
F12
Edgington
method
F1 and F2
(g)
F12
Use of
F-tables
F1 and F2
F12
(h)
4. 4
59.8
5. 0
57.0
4. 7
82.7
5. 1
92.2
±
±
5.1
10.8
5.2
25.0
5.1
39.2
4.1
59.3
4.9
57.2
4.8
82.5
4.7
92.1
±
±
5.2
11.3
5. 4
24.9
4.9
39.4
4. 8
58.0
5. 1
59.9
4. 6
82.5
5. 0
92.1
±
±
4. 9
11.7
5.0
25.1
5.1
39.3
4. 4
62.5
4.8
58.8
4. 7
83.4
4.8
92.7
±
±
5.0
11.6
5. 2
25.8
5. 0
40.9
Uniform data
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
5. 4
59.6
4. 7
57.0
4. 8
82.5
5. 2
92.5
±
±
5.2
10.1
5.2
25.6
5.3
40.0
5.6
59.2
5.2
56.8
4.6
82.4
5.2
92.5
±
±
5.7
10.6
5.4
26.1
5.2
40.3
4. 6
58.2
4. 6
59.4
4.5
82.4
4. 8
92.3
±
±
5.4
10.2
5. 2
26.3
5.1
41.4
6.0
63.2
5. 1
58.5
4.6
83.3
5.2
93.1
±
±
5. 3
10.8
5.2
26.6
5. 1
42.0
Exponential data
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
4. 6
59.8
5.7
56.9
4. 8
82.7
5. 4
92.4
±
±
6.2
11.0
4.5
25.5
4.7
38.8
4.6
58.8
6.4
57.6
5.0
82.6
5.2
92.5
±
±
7.9
13.9
4. 6
25.2
4.5
39.7
5. 0
57.1
5. 2
60.5
5. 0
83.0
5.6
92.4
±
±
6.2
11.2
4.9
25.2
4. 8
40.1
4
6
# 1998 John Wiley & Sons, Ltd.
4. 9
62.2
5.9
58.3
4. 9
83.4
4.9
92.7
Table continued on
±
±
6.9
12.1
4.3
25.7
4.2
40.2
next page
L. GONZALEZ AND B. F. J. MANLY
Size
Power
Size
Power
Size
Power
Size
Power
2
62
ENVIRONMETRICS, VOL.
Table IV. Summary of the simulation results for a two-factor design in terms of the average size and the average power as de®ned for Table II,
with the averages calculated separately for stream and season eects (F1 and F2) and interaction (F12). Single underlined values are those where
the average size is for the cases where the null hypothesis was true but the percentage of signi®cant results is signi®cantly dierent from 5 per cent
(i.e. the result is outside the range 4.4 to 5.6 per cent for average size, and outside the range 4.9 to 5.1 per cent for overall results). Double
underlining is as de®ned in Table II
Number of Average for Observations
replicates
randomized
F1 and F2
Empirical data
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Over all distributions
Size
Power
(e)
F12
ter Braak
method
F1 and F2
2.3
26.4
4.9
56.3
4.3
85.0
3.9
93.1
±
±
4.1
6.7
4.0
22.8
3.7
38.7
4.7
71.0
4.9
24.5
(f)
F12
Edgington
method
F1 and F2
1.3
18.9
4.9
60.4
4.2
85.2
3.7
93.4
±
±
4.4
7. 9
4.2
27.7
3. 7
45.2
4.7
70.7
5.1
26.0
(g)
F12
Use of
F-tables
F1 and F2
(h)
F12
0.3
26.5
5. 4
63.7
5.0
85.3
4.8
93.3
±
±
4.4
6. 7
4.1
26.1
4.2
45.5
1.2
17.5
4.4
59.0
3.2
83.9
2.0
91.9
±
±
4.3
5.7
2.7
23.0
2.0
38.6
4.6
71.7
4. 9
25.7
4.4
71.5
4.6
25.3
ENVIRONMETRICS, VOL.
ANALYSIS OF VARIANCE BY RANDOMIZATION
# 1998 John Wiley & Sons, Ltd.
Table continued.
63
9, 53±65 (1998)
64
L. GONZALEZ AND B. F. J. MANLY
Table V. Summary of the results obtained from simulations with the three-factor sampling design. Overall
results are given for tests on the main eects of the position in the stream, the stream, and the season, and
the three interactions of position stream, position season, and stream season. For each eect, the
average size is the mean percentage of signi®cant results when the null hypothesis was true, and the average
power is the mean percentage of signi®cant results when the null hypothesis was not true. All tests were at
the 5 per cent level. Single underlined values are those where the average size is for the cases where the null
hypothesis was true but the percentage of signi®cant results is signi®cantly dierent from 5 per cent (i.e. the
result is outside the range 4.4 to 5.6 per cent for average size, and outside the range 4.91 to 5.09 per cent for
overall results). Double underlining is as de®ned in Table II
Number of Average for Observation randomized
Replicates
F1, F2
F12, F13
and F3
and F23
Normal distribution
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Uniform distribution
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Exponential distribution
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Empirical distribution
1
Size
Power
2
Size
Power
4
Size
Power
6
Size
Power
Over all distributions
Size
Power
ENVIRONMETRICS, VOL.
ter Braak method
Use of F-table
F1, F2
and F3
F12, F13
and F23
F1, F2
and F3
F12, F13
and F23
5.1
13.4
5.1
27.9
5.2
55.2
5.0
72.3
5. 3
5.3
5. 0
6.5
5. 1
9. 4
5. 1
14.0
5. 2
13.9
4.9
28.1
5.1
55.2
5.0
72.3
5.4
5.3
5.1
6.5
5.2
9.1
5.0
13.8
5.4
14.2
4. 9
28.9
5. 1
56.6
4. 9
73.8
5. 5
5.3
5.0
6. 7
5. 1
9. 9
5. 0
14.3
5.4
13.6
4.8
28.5
5.1
55.7
4.9
72.8
4. 9
4.6
5.0
7.6
5.0
10.3
5. 3
14.1
5.5
13.8
4.8
28.6
5.2
55.7
4.8
72.7
5.2
4.9
5.1
7.7
5.1
10.1
5.3
14.1
5. 7
14.4
4. 9
29.3
5. 2
56.9
4. 8
73.9
5. 3
5. 1
5.2
8. 1
5.1
10.5
5. 4
14.6
4.8
13.7
5.0
28.8
5.1
55.7
5.1
72.4
5. 0
5.5
4.9
7. 4
5. 0
10.8
4. 8
14.5
4.7
13.2
4.7
28.8
5.1
55.4
5.1
72.5
4.7
5.8
4.9
7.0
4.9
10.5
4.8
14.8
4. 8
13.7
4. 8
29.1
4. 9
56.4
5. 0
73.7
4. 9
5. 7
4.8
7. 2
4. 8
10.4
4.8
14.7
4.9
14.8
5.0
29.2
4.9
60.4
4.6
74.9
4.8
6. 1
5. 3
7. 7
4. 6
11.9
4. 6
15.1
3.8
12.4
4.9
28.7
3.9
58.8
4.4
74.2
3.4
4.7
5.8
7.8
3.9
11.5
4.5
13.8
3.7
12.7
4.6
28.7
3.0
56.7
3.9
73.9
3.4
4.7
5. 3
7.4
2.8
9.1
3.6
12.6
5.0
43.1
5. 0
9. 4
4.8
42.8
4.9
9.2
4.7
43.3
4.7
9. 1
9, 53±65 (1998)
# 1998 John Wiley & Sons, Ltd.
ANALYSIS OF VARIANCE BY RANDOMIZATION
65
Table VI. Results for the analysis of variance of ratios of ephemeroptera to oligochaetes using three
dierent methods. The ®rst involves randomizing observations; the second randomizing residuals
(ter Braak); and the third the conventional F-test for analysis of variance
Source of variation
d.f.
Stream
Position
Season
Stream position
Stream season
Position season
Stream position season
Error
Total
SS
MS
2
2
3
4
6
6
12
72
166.3
753.4
364.3
313.2
316.0
724.5
640.5
542.1
83.1
376.7
121.4
78.3
52.7
120.8
53.4
7.5
107
3820.3
7.
F
Signi®cance level (%)
Randomizing Randomizing
Using
observations
residuals
F distribution
11.04
50.03
16.13
10.40
7.00
16.04
7.09
0.02
0.02
0.02
0.02
0.04
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.01
0.00
0.00
0.00
0.00
0.00
0.00
DISCUSSION
Whilst it is obviously not possible to draw de®nitive conclusions from the simulation of a few
speci®c situations, it is clear that all of the randomization methods have given rather similar
results for assessing the existence of factor eects except in very extreme situations. Therefore,
from a practical point of view it is not possible to say that one method of randomization is much
better than another.
One reason for this is that the use of F-statistics has the eect of making all of the
randomization distributions rather similar for most data. We discovered early in our study that
this does not occur if sums of squares are used instead of F-statistics. Indeed it became very clear
that using sums of squares does not give tests with good properties. This explains, for example,
why Edgington (1995, p. 133) found somewhat dierent p-values of 2.8 and 1.4 per cent for
testing a factor eect using a restricted randomization and freely randomizing observations with
a small arti®cial two-factor set of data. Using the F-statistic when freely randomizing observations gives a p-value of 3.2 per cent, which is much closer to the 2.8 per cent for restricted
randomization. Interestingly, randomizing residuals by ter Braak's (1992) method does not seem
sensible for this set of arti®cial data because the residuals are all either 1 or ÿ1 and the p-value
is found to be 8.7 per cent.
REFERENCES
Crowley, P. H. (1992). `Resampling methods for computation-intensive data analysis in ecology and
evolution', Annual Review of Ecology and Systematics, 23, 405±447.
Edgington, E. S. (1995). Randomization Tests, 3rd edn., Marcel Dekker, New York.
Manly, B. F. J. (1991). Randomization and Monte Carlo Methods in Biology, Chapman and Hall, London.
Still, A. W. and White, A. P. (1981). `The approximate randomization test as an alternative to the F-test in
analysis of variance', British Journal of Mathematical and Statistical Psychology, 34, 243±252.
ter Braak, C. J. F. (1992). `Permutation versus bootstrap signi®cance tests in multiple regression and
ANOVA', in Jockel, K. H. (ed.), Bootstrapping and Related Techniques, Springer, Berlin, pp. 79±86.
# 1998 John Wiley & Sons, Ltd.
ENVIRONMETRICS, VOL.
9, 53±65 (1998)