Multiple Comparisons
13.6 Multiple Comparisons
The analysis of variance is a powerful procedure for testing the homogeneity of
a set of means. However, if we reject the null hypothesis and accept the stated alternative—that the means are not all equal—we still do not know which of the population means are equal and which are different.
524 Chapter 13 One-Factor Experiments: General
The GLM Procedure
Dependent Variable: moisture
Sum of
Source
F Value Pr > F Model
DF Squares
Mean Square
Corrected Total
moisture Mean 0.407669
R-Square
Coeff Var
Root MSE
F Value Pr > F aggregate
DF Type I SS
Mean Square
F Value Pr > F aggregate
DF Type III SS
Mean Square
F Value Pr > F (1,2,3,5) vs. 4
DF Contrast SS
Mean Square
Figure 13.4: A set of orthogonal contrasts Often it is of interest to make several (perhaps all possible) paired compar-
isons among the treatments. Actually, a paired comparison may be viewed as a simple contrast, namely, a test of
H 0 :μ i −μ j = 0,
H 1 :μ i −μ j
beneficial when particular complex contrasts are not known a priori. For example, in the aggregate data of Table 13.1, suppose that we wish to test
H 0 :μ 1 −μ 5 = 0,
H 1 :μ 1 −μ 5
The test is developed through use of an F, t, or confidence interval approach. Using t, we have
where s is the square root of the mean square error and n = 6 is the sample size per treatment. In this case,
553.33 − 610.67 t= √
13.6 Multiple Comparisons 525 The P-value for the t-test with 25 degrees of freedom is 0.17. Thus, there is not
sufficient evidence to reject H 0 .
Relationship between T and F
In the foregoing, we displayed the use of a pooled t-test along the lines of that discussed in Chapter 10. The pooled estimate was taken from the mean squared error in order to enjoy the degrees of freedom that are pooled across all five samples. In addition, we have tested a contrast. The reader should note that if the t-value is squared, the result is exactly of the same form as the value of f for a test on a contrast, discussed in the preceding section. In fact,
which, of course, is t 2 .
Confidence Interval Approach to a Paired Comparison
It is straightforward to solve the same problem of a paired comparison (or a con- trast) using a confidence interval approach. Clearly, if we compute a 100(1 − α)%
confidence interval on μ 1 −μ 5 , we have "
y ¯ 1. − ¯y 5. ±t α/2 s
where t α/2 is the upper 100(1 − α/2)% point of a t-distribution with 25 degrees of freedom (degrees of freedom coming from s 2 ). This straightforward connection between hypothesis testing and confidence intervals should be obvious from dis- cussions in Chapters 9 and 10. The test of the simple contrast μ 1 −μ 5 involves no more than observing whether or not the confidence interval above covers zero. Substituting the numbers, we have as the 95% confidence interval
(553.33 − 610.67) ± 2.060 4961 3 = −57.34 ± 83.77. Thus, since the interval covers zero, the contrast is not significant. In other words,
we do not find a significant difference between the means of aggregates 1 and 5.
Experiment-wise Error Rate
Serious difficulties occur when the analyst attempts to make many or all pos- sible paired comparisons. For the case of k means, there will be, of course, r = k(k − 1)/2 possible paired comparisons. Assuming independent comparisons, the experiment-wise error rate or family error rate (i.e., the probability of false rejection of at least one of the hypotheses) is given by 1 − (1 − α) r , where α is the selected probability of a type I error for a specific comparison. Clearly, this measure of experiment-wise type I error can be quite large. For example, even
526 Chapter 13 One-Factor Experiments: General if there are only 6 comparisons, say, in the case of 4 means, and α = 0.05, the
experiment-wise rate is
When many paired comparisons are being tested, there is usually a need to make the effective contrast on a single comparison more conservative. That is, with the confidence interval approach, the confidence intervals would be much wider than the ±t α/2 s 2/n used for the case where only a single comparison is being made.
Tukey’s Test
There are several standard methods for making paired comparisons that sustain the credibility of the type I error rate. We shall discuss and illustrate two of them here. The first one, called Tukey’s procedure, allows formation of simultaneous 100(1 − α)% confidence intervals for all paired comparisons. The method is based on the studentized range distribution. The appropriate percentile point is a function
of α, k, and v = degrees of freedom for s 2 . A list of upper percentage points for α = 0.05 is shown in Table A.12. The method of paired comparisons by Tukey
i. − ¯y j. 5 |
exceeds q(α, k, v) s 2 n .
Tukey’s procedure is easily illustrated. Consider a hypothetical example where we have 6 treatments in a one-factor completely randomized design, with 5 obser- vations taken per treatment. Suppose that the mean square error taken from the
analysis-of-variance table is s 2 = 2.45 (24 degrees of freedom). The sample means are in ascending order:
¯ y 2. y ¯ 5. y ¯ 1. ¯ y 3. y ¯ 6. y ¯ 4.
14.50 16.75 19.84 21.12 22.90 23.20. With α = 0.05, the value of q(0.05, 6, 24) is 4.37. Thus, all absolute differences are
to be compared to
As a result, the following represent means found to be significantly different using Tukey’s procedure:
Where Does the α-Level Come From in Tukey’s Test?
We briefly alluded to the concept of simultaneous confidence intervals being employed for Tukey’s procedure. The reader will gain a useful insight into the notion of multiple comparisons if he or she gains an understanding of what is meant by simultaneous confidence intervals.
In Chapter 9, we saw that if we compute a 95% confidence interval on, say,
a mean μ, then the probability that the interval covers the true mean μ is 0.95.
13.6 Multiple Comparisons 527 However, as we have discussed, for the case of multiple comparisons, the effective
probability of interest is tied to the experiment-wise error rate, and it should be emphasized that the confidence intervals of the type ¯ y i. − ¯y j. ± q(α, k, v)s 1/n are not independent since they all involve s and many involve the use of the same averages, the ¯ y i. . Despite the difficulties, if we use q(0.05, k, v), the simultaneous confidence level is controlled at 95%. The same holds for q(0.01, k, v); namely, the confidence level is controlled at 99%. In the case of α = 0.05, there is a probability of 0.05 that at least one pair of measures will be falsely found to be different (false rejection of at least one null hypothesis). In the α = 0.01 case, the corresponding probability will be 0.01.
Duncan’s Test
The second procedure we shall discuss is called Duncan’s procedure or Dun- can’s multiple-range test. This procedure is also based on the general notion of studentized range. The range of any subset of p sample means must exceed a certain value before any of the p means are found to be different. This value is called the least significant range for the p means and is denoted by R p , where
R p =r p
The values of the quantity r p , called the least significant studentized range, depend on the desired level of significance and the number of degrees of freedom of the mean square error. These values may be obtained from Table A.13 for p = 2, 3, . . . , 10 means.
To illustrate the multiple-range test procedure, let us consider the hypothetical example where 6 treatments are compared, with 5 observations per treatment. This is the same example used to illustrate Tukey’s test. We obtain R p by multiplying each r p by 0.70. The results of these computations are summarized as follows:
2.919 3.066 3.160 3.226 3.276 R p 2.043 2.146 2.212 2.258 2.293
Comparing these least significant ranges with the differences in ordered means, we arrive at the following conclusions:
1. Since ¯ y 4. − ¯y 2. = 8.70 > R 6 = 2.293, we conclude that μ 4 and μ 2 are signifi-
cantly different.
2. Comparing ¯ y 4. − ¯y 5. and ¯ y 6. − ¯y 2. with R 5 , we conclude that μ 4 is significantly greater than μ 5 and μ 6 is significantly greater than μ 2 .
3. Comparing ¯ y 4. − ¯y 1. ,¯ y 6. − ¯y 5. , and ¯ y 3. − ¯y 2. with R 4 , we conclude that each
difference is significant.
4. Comparing ¯ y 4. − ¯y 3. ,¯ y 6. − ¯y 1. ,¯ y 3. − ¯y 5. , and ¯ y 1. − ¯y 2. with R 3 , we find all differences significant except for μ 4 −μ 3 . Therefore, μ 3 ,μ 4 , and μ 6 constitute
a subset of homogeneous means.
5. Comparing ¯ y 3. − ¯y 1. ,¯ y 1. − ¯y 5. , and ¯ y 5. − ¯y 2. with R 2 , we conclude that only
μ 3 and μ 1 are not significantly different.
528 Chapter 13 One-Factor Experiments: General It is customary to summarize the conclusions above by drawing a line under any
subsets of adjacent means that are not significantly different. Thus, we have y ¯ 2. ¯ y 5. y ¯ 1. y ¯ 3. ¯ y 6. y ¯ 4.
14.50 16.75 19.84 21.12 22.90 23.20 It is clear that in this case the results from Tukey’s and Duncan’s procedures
are very similar. Tukey’s procedure did not detect a difference between 2 and 5, whereas Duncan’s did.
Dunnett’s Test: Comparing Treatment with a Control
In many scientific and engineering problems, one is not interested in drawing infer- ences regarding all possible comparisons among the treatment means of the type μ i −μ j . Rather, the experiment often dictates the need to simultaneously compare each treatment with a control. A test procedure developed by C. W. Dunnett de- termines significant differences between each treatment mean and the control, at a single joint significance level α. To illustrate Dunnett’s procedure, let us consider the experimental data of Table 13.6 for a one-way classification where the effect of three catalysts on the yield of a reaction is being studied. A fourth treatment, no catalyst, is used as a control.
Table 13.6: Yield of Reaction
Control
Catalyst 1 Catalyst 2 Catalyst 3
In general, we wish to test the k hypotheses
where μ 0 represents the mean yield for the population of measurements in which the control is used. The usual analysis-of-variance assumptions, as outlined in Section 13.3, are expected to remain valid. To test the null hypotheses specified
by H 0 against two-sided alternatives for an experimental situation in which there are k treatments, excluding the control, and n observations per treatment, we first calculate the values
/n The sample variance s 2 is obtained, as before, from the mean square error in the
2s 2
analysis of variance. Now, the critical region for rejecting H 0 , at the α-level of
Exercises 529 significance, is established by the inequality
|d i |>d α/2 (k, v),
where v is the number of degrees of freedom for the mean square error. The values of the quantity d α/2 (k, v) for a two-tailed test are given in Table A.14 for α = 0.05 and α = 0.01 for various values of k and v.
Example 13.5: For the data of Table 13.6, test hypotheses comparing each catalyst with the con- trol, using two-sided alternatives. Choose α = 0.05 as the joint significance level. Solution : The mean square error with 16 degrees of freedom is obtained from the analysis- of-variance table, using all k + 1 treatments. The mean square error is given by
From Table A.14 the critical value for α = 0.05 is found to be d 0.025 (3, 16) = 2.59. Since |d 1 | < 2.59 and |d 3 | < 2.59, we conclude that only the mean yield for catalyst
2 is significantly different from the mean yield of the reaction using the control. Many practical applications dictate the need for a one-tailed test for comparing treatments with a control. Certainly, when a pharmacologist is concerned with the effect of various dosages of a drug on cholesterol level and his control is zero dosage, it is of interest to determine if each dosage produces a significantly larger reduction than the control. Table A.15 shows the critical values of d α (k, v) for one-sided alternatives.
Exercises
13.12 Consider the data of Review Exercise 13.45 on laundered under specific conditions. Two baths were page 555. Make significance tests on the following con- prepared, one with carboxymethyl cellulose and one trasts:
without. Twelve pieces of fabric were laundered 5 times (a) B versus A, C, and D;
in bath I, and 12 other pieces of fabric were laundered (b) C versus A and D;
10 times in bath I. This was repeated using 24 addi- tional pieces of cloth in bath II. After the washings the
(c) A versus D. lengths of fabric that burned and the burn times were measured. For convenience, let us define the following
13.13 The purpose of the study The Incorporation treatments: of a Chelating Agent into a Flame Retardant Finish of a Cotton Flannelette and the Evaluation of Selected
Treatment 1: 5 launderings in bath I, Fabric Properties conducted at Virginia Tech was to
Treatment 2: 5 launderings in bath II, evaluate the use of a chelating agent as part of the flame-retardant finish of cotton flannelette by deter-
Treatment 3: 10 launderings in bath I, mining its effects upon flammability after the fabric is
Treatment 4: 10 launderings in bath II.
530 Chapter 13 One-Factor Experiments: General Burn times, in seconds, were recorded as follows:
13.16 An investigation was conducted to determine Treatment
the source of reduction in yield of a certain chemical 1 2 3 4 product. It was known that the loss in yield occurred in
13.7 6.2 27.2 18.2 the mother liquor, that is, the material removed at the 23.0 5.4 16.8 8.8 filtration stage. It was thought that different blends 15.7 5.0 12.9 14.5 of the original material might result in different yield reductions at the mother liquor stage. The following 25.5 4.4 14.9 14.7 are the percent reductions for 3 batches at each of 4 15.8 5.0 17.1 17.1 preselected blends:
14.0 16.0 10.8 10.6 Blend 29.4 2.5 13.5 5.8 1 2 3 4 9.7 1.6 25.5 7.3 25.6 25.2 20.8 31.6 14.0 3.9 14.2 17.7 24.3 28.6 26.7 29.8 12.3 2.5 27.4 18.3 27.9 24.7 22.2 34.3
12.3 7.1 11.5 9.9 (a) Perform the analysis of variance at the α = 0.05 (a) Perform an analysis of variance, using a 0.01 level of
level of significance.
significance, and determine whether there are any (b) Use Duncan’s multiple-range test to determine significant differences among the treatment means.
which blends differ.
(b) Use single-degree-of-freedom contrasts with α = (c) Do part (b) using Tukey’s test. 0.01 to compare the mean burn time of treatment 1 versus treatment 2 and also treatment 3 versus
13.17 In the study An Evaluation of the Removal treatment 4.
Method for Estimating Benthic Populations and Diver- sity conducted by Virginia Tech on the Jackson River, 5
13.14 The study Loss of Nitrogen Through Sweat by different sampling procedures were used to determine Preadolescent Boys Consuming Three Levels of Dietary the species counts. Twenty samples were selected at Protein was conducted by the Department of Human random, and each of the 5 sampling procedures was Nutrition and Foods at Virginia Tech to determine per- repeated 4 times. The species counts were recorded as spiration nitrogen loss at various dietary protein levels. follows: Twelve preadolescent boys ranging in age from 7 years,
Sampling Procedure 8 months to 9 years, 8 months, all judged to be clini-
Substrate cally healthy, were used in the experiment. Each boy
Removal Kick- was subjected to one of three controlled diets in which
Deple- Modified
Surber Kicknet net 29, 54, or 84 grams of protein were consumed per day.
tion
Hess
85 75 31 43 17 The following data represent the body perspiration ni-
55 45 20 21 10 trogen loss, in milligrams, during the last two days of
40 35 9 15 8 the experimental period:
77 67 37 27 15 Protein Level
29 Grams
(a) Is there a significant difference in the average 190
54 Grams
84 Grams
species counts for the different sampling proce- 266
dures? Use a P-value in your conclusion. 270
(b) Use Tukey’s test with α = 0.05 to find which sam-
pling procedures differ.
(a) Perform an analysis of variance at the 0.05 level 13.18 The following data are values of pressure (psi) of significance to show that the mean perspiration in a torsion spring for several settings of the angle be- nitrogen losses at the three protein levels are dif- tween the legs of the spring in a free position: ferent.
Angle ( ◦ ) (b) Use Tukey’s test to determine which protein levels
67 71 75 79 83 are significantly different from each other in mean
83 84 86 87 89 90 nitrogen loss.
85 85 87 87 90 92 85 88 88 90 13.15 Use Tukey’s test, with a 0.05 level of signifi-
86 88 88 91 cance, to analyze the means of the five different brands
86 88 89 of headache tablets in Exercise 13.2 on page 518.
Exercises 531 Compute a one-way analysis of variance for this experi- the data of Exercise 13.6 on page 519. Discuss the
ment and state your conclusion concerning the effect of results. angle on the pressure in the spring. (From C. R. Hicks, Fundamental Concepts in the Design of Experiments,
13.23 In a biological experiment, four concentrations Holt, Rinehart and Winston, New York, 1973.)
of a certain chemical are used to enhance the growth of a certain type of plant over time. Five plants are used at each concentration, and the growth in each plant is
13.19 It is suspected that the environmental temper- measured in centimeters. The following growth data ature at which batteries are activated affects their life. are taken. A control (no chemical) is also applied. Thirty homogeneous batteries were tested, six at each of five temperatures, and the data are shown below
Concentration (activated life in seconds). Analyze and interpret the
1 2 3 4 data. (From C. R. Hicks, Fundamental Concepts in
Control
6.8 8.2 7.7 6.9 5.9 Design of Experiments, Holt, Rinehart and Winston,
7.3 8.7 8.4 5.8 6.1 New York, 1973.)
6.9 9.2 8.1 6.8 5.7 Temperature ( C)
55 60 70 72 65 Use Dunnett’s two-sided test at the 0.05 level of signif- 55 61 72 72 66 icance to simultaneously compare the concentrations 57 60 72 72 60 with the control.
54 60 77 68 65 13.24 The financial structure of a firm refers to the 56 60 77 69 65 way the firm’s assets are divided into equity and debt, and the financial leverage refers to the percentage of assets financed by debt. In the paper The Effect of Fi-
13.20 The following table (from A. Hald, Statistical nancial Leverage on Return, Tai Ma of Virginia Tech Theory with Engineering Applications, John Wiley & claims that financial leverage can be used to increase Sons, New York, 1952) gives tensile strengths (in devi- the rate of return on equity. To say it another way, ations from 340) for wires taken from nine cables to be stockholders can receive higher returns on equity with used for a high-voltage network. Each cable is made the same amount of investment through the use of fi- from 12 wires. We want to know whether the mean nancial leverage. The following data show the rates strengths of the wires in the nine cables are the same. of return on equity using 3 different levels of financial If the cables are different, which ones differ? Use a leverage and a control level (zero debt) for 24 randomly P-value in your analysis of variance.
selected firms:
Cable Tensile Strength Financial Leverage
Medium High 2 −11 −13 −8
7 10 7 8 1 Source: Standard & Poor’s Machinery Indus- try Survey, 1975.
13.21 The printout in Figure 13.5 on page 532 gives (a) Perform the analysis of variance at the 0.05 level of information on Duncan’s test, using PROC GLM in
significance.
SAS, for the aggregate data in Example 13.1. Give (b) Use Dunnett’s test at the 0.01 level of significance conclusions regarding paired comparisons using Dun-
to determine whether the mean rates of return on can’s test results.
equity are higher at the low, medium, and high lev- els of financial leverage than at the control level.
13.22 Do Duncan’s test for paired comparisons for
532 Chapter 13 One-Factor Experiments: General
The GLM Procedure Duncan’s Multiple Range Test for moisture NOTE: This test controls the Type I comparisonwise error rate,
not the experimentwise error rate.
Error Degrees of Freedom
Error Mean Square
Number of Means
83.75 87.97 90.69 92.61 Means with the same letter are not significantly different.
Critical Range
Duncan Grouping
6 4 Figure 13.5: SAS printout for Exercise 13.21.
B 465.17