Multiple Comparisons

13.6 Multiple Comparisons

  The analysis of variance is a powerful procedure for testing the homogeneity of

  a set of means. However, if we reject the null hypothesis and accept the stated alternative—that the means are not all equal—we still do not know which of the population means are equal and which are different.

  Chapter 13 One-Factor Experiments: General

  The GLM Procedure

  Dependent Variable: moisture

  Sum of

  Source

  DF Squares

  Mean Square

  F Value Pr > F

  Corrected Total

  R-Square

  Coeff Var

  Root MSE

  moisture Mean

  DF Type I SS

  Mean Square

  F Value Pr > F

  DF Type III SS

  Mean Square

  F Value Pr > F

  DF Contrast SS

  Mean Square

  F Value Pr > F

  Figure 13.4: A set of orthogonal contrasts

  Often it is of interest to make several (perhaps all possible) paired compar- isons among the treatments. Actually, a paired comparison may be viewed as a simple contrast, namely, a test of

  H 0 :μ i −μ j = 0,

  H 1 :μ i −μ j = 0,

  for all i = j. Making all possible paired comparisons among the means can be very beneficial when particular complex contrasts are not known a priori. For example, in the aggregate data of Table 13.1, suppose that we wish to test

  H 0 :μ 1 −μ 5 = 0,

  H 1 :μ 1 −μ 5 = 0.

  The test is developed through use of an F, t, or confidence interval approach. Using t, we have

  where s is the square root of the mean square error and n = 6 is the sample size per treatment. In this case,

  − 610.67

  t=

  √

  = −1.41.

  13.6 Multiple Comparisons

  The P-value for the t-test with 25 degrees of freedom is 0.17. Thus, there is not

  sufficient evidence to reject H 0 .

  Relationship between T and F

  In the foregoing, we displayed the use of a pooled t-test along the lines of that discussed in Chapter 10. The pooled estimate was taken from the mean squared error in order to enjoy the degrees of freedom that are pooled across all five samples. In addition, we have tested a contrast. The reader should note that if the t-value is squared, the result is exactly of the same form as the value of f for a test on a contrast, discussed in the preceding section. In fact,

  (¯ y 1.

  − ¯y 5. ) 2 (553.33

  which, of course, is t 2 .

  Confidence Interval Approach to a Paired Comparison

  It is straightforward to solve the same problem of a paired comparison (or a con- trast) using a confidence interval approach. Clearly, if we compute a 100(1 − α)

  confidence interval on μ 1 −μ 5 , we have "

  where t α2 is the upper 100(1 − α2) point of a t-distribution with 25 degrees

  of freedom (degrees of freedom coming from s 2 ). This straightforward connection

  between hypothesis testing and confidence intervals should be obvious from dis-

  cussions in Chapters 9 and 10. The test of the simple contrast μ 1 −μ 5 involves

  no more than observing whether or not the confidence interval above covers zero. Substituting the numbers, we have as the 95 confidence interval

  (553.33 − 610.67) ± 2.060 4961

  3 −57.34 ± 83.77.

  Thus, since the interval covers zero, the contrast is not significant. In other words, we do not find a significant difference between the means of aggregates 1 and 5.

  Experiment-wise Error Rate

  Serious difficulties occur when the analyst attempts to make many or all pos- sible paired comparisons. For the case of k means, there will be, of course, r = k(k − 1)2 possible paired comparisons. Assuming independent comparisons, the experiment-wise error rate or family error rate (i.e., the probability of false rejection of at least one of the hypotheses) is given by 1

  − (1 − α) r , where

  α is the selected probability of a type I error for a specific comparison. Clearly, this measure of experiment-wise type I error can be quite large. For example, even

  Chapter 13 One-Factor Experiments: General

  if there are only 6 comparisons, say, in the case of 4 means, and α = 0.05, the experiment-wise rate is

  1 − (0.95) 6 ≈ 0.26.

  When many paired comparisons are being tested, there is usually a need to make the effective contrast on a single comparison more conservative. That is, with the confidence interval approach, the confidence intervals would be much wider than the ±t α2 s 2n used for the case where only a single comparison is being made.

  Tukey’s Test

  There are several standard methods for making paired comparisons that sustain the credibility of the type I error rate. We shall discuss and illustrate two of them here. The first one, called Tukey’s procedure, allows formation of simultaneous 100(1 − α) confidence intervals for all paired comparisons. The method is based on the studentized range distribution. The appropriate percentile point is a function

  of α, k, and v = degrees of freedom for s 2 . A list of upper percentage points for

  α = 0.05 is shown in Table A.12. The method of paired comparisons by Tukey involves finding a significant difference between means i and j (i

  5 j. = j) if |¯y − ¯y |

  i.

  exceeds q(α, k, v) s 2 n .

  Tukey’s procedure is easily illustrated. Consider a hypothetical example where we have 6 treatments in a one-factor completely randomized design, with 5 obser- vations taken per treatment. Suppose that the mean square error taken from the

  analysis-of-variance table is s 2 = 2.45 (24 degrees of freedom). The sample means

  are in ascending order:

  y ¯ 2. y ¯ 5. y ¯ 1. ¯ y 3. y ¯ 6. y ¯ 4.

  With α = 0.05, the value of q(0.05, 6, 24) is 4.37. Thus, all absolute differences are to be compared to

  As a result, the following represent means found to be significantly different using Tukey’s procedure:

  Where Does the α-Level Come From in Tukey’s Test?

  We briefly alluded to the concept of simultaneous confidence intervals being employed for Tukey’s procedure. The reader will gain a useful insight into the notion of multiple comparisons if he or she gains an understanding of what is meant by simultaneous confidence intervals.

  In Chapter 9, we saw that if we compute a 95 confidence interval on, say,

  a mean μ, then the probability that the interval covers the true mean μ is 0.95.

  13.6 Multiple Comparisons

  However, as we have discussed, for the case of multiple comparisons, the effective probability of interest is tied to the experiment-wise error rate, and it should be emphasized that the confidence intervals of the type ¯ y i. − ¯y j. ± q(α, k, v)s 1n are not independent since they all involve s and many involve the use of the same averages, the ¯ y i. . Despite the difficulties, if we use q(0.05, k, v), the simultaneous confidence level is controlled at 95. The same holds for q(0.01, k, v); namely, the confidence level is controlled at 99. In the case of α = 0.05, there is a probability of 0.05 that at least one pair of measures will be falsely found to be different (false rejection of at least one null hypothesis). In the α = 0.01 case, the corresponding probability will be 0.01.

  Duncan’s Test

  The second procedure we shall discuss is called Duncan’s procedure or Dun- can’s multiple-range test . This procedure is also based on the general notion of studentized range. The range of any subset of p sample means must exceed a certain value before any of the p means are found to be different. This value is called the least significant range for the p means and is denoted by R p , where

  The values of the quantity r p , called the least significant studentized range, depend on the desired level of significance and the number of degrees of freedom of the mean square error. These values may be obtained from Table A.13 for p = 2, 3, . . . , 10 means.

  To illustrate the multiple-range test procedure, let us consider the hypothetical example where 6 treatments are compared, with 5 observations per treatment. This is the same example used to illustrate Tukey’s test. We obtain R p by multiplying each r p by 0.70. The results of these computations are summarized as follows:

  p

  2.919 3.066 3.160 3.226 3.276 R p 2.043 2.146 2.212 2.258 2.293

  r p

  Comparing these least significant ranges with the differences in ordered means, we arrive at the following conclusions:

  1. Since ¯ y 4. − ¯y 2. = 8.70 > R 6 = 2.293, we conclude that μ 4 and μ 2 are signifi-

  cantly different.

  2. Comparing ¯ y 4. − ¯y 5. and ¯ y 6. − ¯y 2. with R 5 , we conclude that μ 4 is significantly greater than μ 5 and μ 6 is significantly greater than μ 2 .

  3. Comparing ¯ y 4. − ¯y 1. ,¯ y 6. − ¯y 5. , and ¯ y 3. − ¯y 2. with R 4 , we conclude that each

  difference is significant.

  4. Comparing ¯ y 4. − ¯y 3. ,¯ y 6. − ¯y 1. ,¯ y 3. − ¯y 5. , and ¯ y 1. − ¯y 2. with R 3 , we find all differences significant except for μ 4 −μ 3 . Therefore, μ 3 ,μ 4 , and μ 6 constitute

  a subset of homogeneous means.

  5. Comparing ¯ y 3. − ¯y 1. ,¯ y 1. − ¯y 5. , and ¯ y 5. − ¯y 2. with R 2 , we conclude that only

  μ 3 and μ 1 are not significantly different.

  Chapter 13 One-Factor Experiments: General

  It is customary to summarize the conclusions above by drawing a line under any subsets of adjacent means that are not significantly different. Thus, we have

  y ¯ 2. ¯ y 5. y ¯ 1. y ¯ 3. ¯ y 6. y ¯ 4.

  It is clear that in this case the results from Tukey’s and Duncan’s procedures are very similar. Tukey’s procedure did not detect a difference between 2 and 5, whereas Duncan’s did.

  Dunnett’s Test: Comparing Treatment with a Control

  In many scientific and engineering problems, one is not interested in drawing infer- ences regarding all possible comparisons among the treatment means of the type μ i −μ j . Rather, the experiment often dictates the need to simultaneously compare each treatment with a control. A test procedure developed by C. W. Dunnett de- termines significant differences between each treatment mean and the control, at a single joint significance level α. To illustrate Dunnett’s procedure, let us consider the experimental data of Table 13.6 for a one-way classification where the effect of three catalysts on the yield of a reaction is being studied. A fourth treatment, no catalyst, is used as a control.

  Table 13.6: Yield of Reaction

  Control

  Catalyst 1 Catalyst 2 Catalyst 3

  y ¯ 0. = 51.44

  y ¯ 1. = 53.50

  ¯ y 2. = 54.04

  y ¯ 3. = 49.38

  In general, we wish to test the k hypotheses

  where μ 0 represents the mean yield for the population of measurements in which the control is used. The usual analysis-of-variance assumptions, as outlined in Section 13.3, are expected to remain valid. To test the null hypotheses specified

  by H 0 against two-sided alternatives for an experimental situation in which there are k treatments, excluding the control, and n observations per treatment, we first calculate the values

  n The sample variance s 2 is obtained, as before, from the mean square error in the

  2s 2

  analysis of variance. Now, the critical region for rejecting H 0 , at the α-level of

  Exercises

  significance, is established by the inequality

  |d i |>d α2 (k, v),

  where v is the number of degrees of freedom for the mean square error. The values of the quantity d α2 (k, v) for a two-tailed test are given in Table A.14 for α = 0.05 and α = 0.01 for various values of k and v.

  Example 13.5: For the data of Table 13.6, test hypotheses comparing each catalyst with the con-

  trol, using two-sided alternatives. Choose α = 0.05 as the joint significance level. Solution : The mean square error with 16 degrees of freedom is obtained from the analysis-

  of-variance table, using all k + 1 treatments. The mean square error is given by

  From Table A.14 the critical value for α = 0.05 is found to be d 0.025 (3, 16) = 2.59.

  Since |d 1 | < 2.59 and |d 3 | < 2.59, we conclude that only the mean yield for catalyst

  2 is significantly different from the mean yield of the reaction using the control.

  Many practical applications dictate the need for a one-tailed test for comparing treatments with a control. Certainly, when a pharmacologist is concerned with the effect of various dosages of a drug on cholesterol level and his control is zero dosage, it is of interest to determine if each dosage produces a significantly larger reduction than the control. Table A.15 shows the critical values of d α (k, v) for one-sided alternatives.

  Exercises

  13.12 Consider the data of Review Exercise 13.45 on laundered under specific conditions. Two baths were page 555. Make significance tests on the following con- prepared, one with carboxymethyl cellulose and one trasts:

  without. Twelve pieces of fabric were laundered 5 times

  (a) B versus A, C, and D;

  in bath I, and 12 other pieces of fabric were laundered

  (b) C versus A and D;

  10 times in bath I. This was repeated using 24 addi- tional pieces of cloth in bath II. After the washings the

  (c) A versus D.

  lengths of fabric that burned and the burn times were measured. For convenience, let us define the following

  13.13 The purpose of the study The Incorporation treatments: of a Chelating Agent into a Flame Retardant Finish of a Cotton Flannelette and the Evaluation of Selected

  Treatment 1: 5 launderings in bath I,

  Fabric Properties conducted at Virginia Tech was to

  Treatment 2: 5 launderings in bath II,

  evaluate the use of a chelating agent as part of the flame-retardant finish of cotton flannelette by deter-

  Treatment 3: 10 launderings in bath I,

  mining its effects upon flammability after the fabric is

  Treatment 4: 10 launderings in bath II.

  Chapter 13 One-Factor Experiments: General

  Burn times, in seconds, were recorded as follows:

  13.16 An investigation was conducted to determine

  Treatment

  the source of reduction in yield of a certain chemical 1 2 3 4 product. It was known that the loss in yield occurred in

  13.7 6.2 27.2 18.2 the mother liquor, that is, the material removed at the 23.0 5.4 16.8 8.8 filtration stage. It was thought that different blends 15.7 5.0 12.9 14.5 of the original material might result in different yield 25.5 4.4 14.9 14.7 reductions at the mother liquor stage. The following 15.8 5.0 17.1 17.1 are the percent reductions for 3 batches at each of 4 preselected blends:

  7.1 11.5 9.9 (a) Perform the analysis of variance at the α = 0.05

  (a) Perform an analysis of variance, using a 0.01 level of

  level of significance.

  significance, and determine whether there are any (b) Use Duncan’s multiple-range test to determine significant differences among the treatment means.

  which blends differ.

  (b) Use single-degree-of-freedom contrasts with α = (c) Do part (b) using Tukey’s test.

  0.01 to compare the mean burn time of treatment

  1 versus treatment 2 and also treatment 3 versus 13.17 In the study An Evaluation of the Removal

  treatment 4.

  Method for Estimating Benthic Populations and Diver- sity conducted by Virginia Tech on the Jackson River, 5

  13.14 The study Loss of Nitrogen Through Sweat by different sampling procedures were used to determine Preadolescent Boys Consuming Three Levels of Dietary the species counts. Twenty samples were selected at Protein was conducted by the Department of Human random, and each of the 5 sampling procedures was Nutrition and Foods at Virginia Tech to determine per- repeated 4 times. The species counts were recorded as spiration nitrogen loss at various dietary protein levels. follows: Twelve preadolescent boys ranging in age from 7 years,

  Sampling Procedure

  8 months to 9 years, 8 months, all judged to be clini-

  Substrate

  cally healthy, were used in the experiment. Each boy

  Deple- Modified

  Removal Kick-

  was subjected to one of three controlled diets in which

  tion

  Hess

  Surber Kicknet net

  29, 54, or 84 grams of protein were consumed per day.

  The following data represent the body perspiration ni-

  trogen loss, in milligrams, during the last two days of

  the experimental period:

  Protein Level

  (a) Is there a significant difference in the average

  species counts for the different sampling proce-

  dures? Use a P-value in your conclusion.

  (b) Use Tukey’s test with α = 0.05 to find which sam-

  pling procedures differ.

  (a) Perform an analysis of variance at the 0.05 level 13.18 The following data are values of pressure (psi)

  of significance to show that the mean perspiration in a torsion spring for several settings of the angle be- nitrogen losses at the three protein levels are dif- tween the legs of the spring in a free position: ferent.

  Angle ( ◦ )

  (b) Use Tukey’s test to determine which protein levels

  are significantly different from each other in mean

  nitrogen loss.

  13.15 Use Tukey’s test, with a 0.05 level of signifi-

  cance, to analyze the means of the five different brands

  of headache tablets in Exercise 13.2 on page 518.

  Exercises

  Compute a one-way analysis of variance for this experi- the data of Exercise 13.6 on page 519. Discuss the ment and state your conclusion concerning the effect of results. angle on the pressure in the spring. (From C. R. Hicks,

  Fundamental Concepts in the Design of Experiments, 13.23 In a biological experiment, four concentrations

  Holt, Rinehart and Winston, New York, 1973.)

  of a certain chemical are used to enhance the growth of a certain type of plant over time. Five plants are used at each concentration, and the growth in each plant is

  13.19 It is suspected that the environmental temper- measured in centimeters. The following growth data ature at which batteries are activated affects their life. are taken. A control (no chemical) is also applied. Thirty homogeneous batteries were tested, six at each of five temperatures, and the data are shown below

  Concentration

  (activated life in seconds). Analyze and interpret the

  Control

  data. (From C. R. Hicks, Fundamental Concepts in

  Design of Experiments, Holt, Rinehart and Winston,

  New York, 1973.)

  Temperature ( ◦

  55 60 70 72 65 Use Dunnett’s two-sided test at the 0.05 level of signif- 55 61 72 66 icance to simultaneously compare the concentrations 57 60 72 60 with the control.

  54 60 77 68 65 13.24 The financial structure of a firm refers to the 56 60 77 69 65 way the firm’s assets are divided into equity and debt, and the financial leverage refers to the percentage of assets financed by debt. In the paper The Effect of Fi-

  13.20 The following table (from A. Hald, Statistical nancial Leverage on Return, Tai Ma of Virginia Tech Theory with Engineering Applications, John Wiley claims that financial leverage can be used to increase Sons, New York, 1952) gives tensile strengths (in devi- the rate of return on equity. To say it another way, ations from 340) for wires taken from nine cables to be stockholders can receive higher returns on equity with used for a high-voltage network. Each cable is made the same amount of investment through the use of fi- from 12 wires. We want to know whether the mean nancial leverage. The following data show the rates strengths of the wires in the nine cables are the same. of return on equity using 3 different levels of financial If the cables are different, which ones differ? Use a leverage and a control level (zero debt) for 24 randomly P-value in your analysis of variance.

  selected firms:

  Cable

  Tensile Strength

  Financial Leverage

  Medium High

  7 10 7 8 1 Source: Standard Poor’s Machinery Indus- try Survey, 1975.

  13.21 The printout in Figure 13.5 on page 532 gives (a) Perform the analysis of variance at the 0.05 level of information on Duncan’s test, using PROC GLM in

  significance.

  SAS, for the aggregate data in Example 13.1. Give (b) Use Dunnett’s test at the 0.01 level of significance conclusions regarding paired comparisons using Dun-

  to determine whether the mean rates of return on

  can’s test results.

  equity are higher at the low, medium, and high lev- els of financial leverage than at the control level.

  13.22 Do Duncan’s test for paired comparisons for

  Chapter 13 One-Factor Experiments: General

  The GLM Procedure Duncan’s Multiple Range Test for moisture

  NOTE: This test controls the Type I comparisonwise error rate,

  not the experimentwise error rate.

  Error Degrees of Freedom

  Error Mean Square

  Number of Means

  Critical Range

  Means with the same letter are not significantly different.

  Duncan Grouping

  Figure 13.5: SAS printout for Exercise 13.21.