Inferences about M Lyman Ott Michael Longnecker

A preliminary analysis of the data used a two-sample t test. Solution Computer output for these data is shown here. 1 2 TABLE 6.14 Repair estimates in hundreds of dollars Car Garage I Garage II 1 17.6 17.3 2 20.2 19.1 3 19.5 18.4 4 11.3 11.5 5 13.0 12.7 6 16.3 15.8 7 15.3 14.9 8 16.2 15.3 9 12.2 12.0 10 14.8 14.2 11 21.3 21.0 12 22.1 21.0 13 16.9 16.1 14 17.6 16.7 15 18.4 17.5 Totals: s 1 ⫽ 3.20 s 1 ⫽ 2.94 y 2 ⫽ 16.23 y 1 ⫽ 16.85 Two-Sample T-Test and Confidence Interval Two-sample T for Garage I vs Garage II N Mean StDev SE Mean Garage I 15 16.85 3.20 0.83 Garage II 15 16.23 2.94 0.76 95 CI for mu Garage I mu Garage II: 1.69, 2.92 T-Test mu Garage I mu Garage II vs not : T 0.55 P 0.59 DF 27 From the output, we see there is a consistent difference in the sample means . However, this difference is rather small considering the variability of the measurements s 1 ⫽ 3.20, s 2 ⫽ 2.94. In fact, the computed t-value .55 has a p-value of .59, indicating very little evidence of a difference in the average claim estimates for the two garages. A closer glance at the data in Table 6.14 indicates that something about the conclusion in Example 6.7 is inconsistent with our intuition. For all but one of the 15 cars, the estimate from garage I was higher than that from garage II. From our knowledge of the binomial distribution, the probability of observing garage I esti- mates higher in y ⫽ 14 or more of the n ⫽ 15 trials, assuming no difference p ⫽ .5 for garages I and II, is Thus, if the two garages in fact have the same distribution of estimates, there is approximately a 5 in 10,000 chance of having 14 or more estimates from garage I higher than those from garage II. Using this probability, we would argue that the ⫽ 冢 15 14 冣 .5 14 .5 ⫹ 冢 15 15 冣 .5 15 ⫽ .000488 Py ⫽ 14 or 15 ⫽ Py ⫽ 14 ⫹ Py ⫽ 15 y 1 ⫺ y 2 ⫽ .62 observed estimates are highly contradictory to the null hypothesis of equality of distribution of estimates for the two garages. Why are there such conflicting results from the t test and the binomial calculation? The explanation of the difference in the conclusions from the two procedures is that one of the required conditions for the t test, two samples being independent of each other, has been violated by the manner in which the study was conducted. The adjusters obtained a measurement from both garages for each car. For the two samples to be independent, the adjusters would have to take a random sample of 15 cars to garage I and a different random sample of 15 to garage II. As can be observed in Figure 6.6, the repair estimates for a given car are about the same value, but there is a large variability in the estimates from each garage. The large variability among the 15 estimates from each garage diminishes the relative size of any difference between the two garages. When designing the study, the adjusters recognized that the large differences in the amount of damage suffered by the cars would result in a large variability in the 15 estimates at both garages. By having both garages give an estimate on each car, the adjusters could calculate the difference between the estimates from the garages and hence reduce the large car-to-car variability. This example illustrates a general design principle. In many situations, the available experimental units may be considerably different prior to their random assignment to the treatments with respect to characteristics that may affect the ex- perimental responses. These differences will often then mask true treatment differ- ences. In the previous example, the cars had large differences in the amount of damage suffered during the accident and hence would be expected to have large dif- ferences in their repair estimates no matter what garage gave the repair estimate. When comparing two treatments or groups in which the available experimental units have important differences prior to their assignment to the treatments or groups, the samples should be paired. There are many ways to design experiments to yield paired data. One method involves having the same group of experimental units receive both treatments, as was done in the repair estimates example. A sec- ond method involves having measurements taken before and after the treatment is applied to the experimental units. For example, suppose we want to study the effect of a new medicine proposed to reduce blood pressure. We would record the blood pressure of participants before they received the medicine and then after receiving the medicine. A third design procedure uses naturally occurring pairs such as twins, FIGURE 6.6 Repair estimates from two garages 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Garage II Garage I 1 2 or husbands and wives. A final method pairs the experimental units with respect to factors that may mask differences in the treatments. For example, a study is pro- posed to evaluate two methods for teaching remedial reading. The participants could be paired based on a pretest of their reading ability. After pairing the partici- pants, the two methods are randomly assigned to the participants within each pair. A proper analysis of paired data needs to take into account the lack of independence between the two samples. The sampling distribution for the differ- ence in the sample means, , will have mean and standard error where r measures the amount of dependence between the two samples. When the two samples produce similar measurements, r is positive and the standard error of is smaller than what would be obtained using two independent samples. This was the case in the repair estimates data. The size and sign of r can be deter- mined by examining the plot of the paired data values. The magnitude of r is large when the plotted points are close to a straight line. The sign of r is positive when the plotted points follow an increasing line and negative when plotted points follow a decreasing line. From Figure 6.6, we observe that the estimates are close to an increasing line and thus r will be positive. The use of paired data in the repair estimate study will reduce the variability in the standard error of the difference in the sample means in comparison to using independent samples. The actual analysis of paired data requires us to compute the differences in the n pairs of measurements, d i ⫽ y 1i ⫺ y 2i , and obtain , s d , the mean and standard deviations in the d i s. Also, we must formulate the hypotheses about m 1 and m 2 into hypotheses about the mean of the differences, m d ⫽ m 1 ⫺ m 2 . The conditions re- quired to develop a t procedure for testing hypotheses and constructing confidence intervals for m d are 1. The sampling distribution of the d i s is a normal distribution. 2. The d i s are independent; that is, the pairs of observations are independent. A summary of the test procedure is given here. d y 1 ⫺ y 2 m y 1 ⫺y 2 ⫽ m 1 ⫺ m 2 and s y 1 ⫺y 2 ⫽ A s 2 1 ⫹ s 2 2 ⫺ 2s 1 s 2 r n y 1 ⫺ y 2 Paired t test H : 1. m d ⱕ D D is a specified value, often .0 2. m d ⱖ D

3.

m d ⫽ D H a : 1. m d ⬎ D 2. m d ⬍ D

3.

m d ⫽ D T.S.: R.R.: For a level a Type I error rate and with df ⫽ n ⫺ 1 1. Reject H if t ⱖ t a . 2. Reject H if t ⱕ ⫺t a .

3.

Reject H if |t| ⱖ . Check assumptions and draw conclusions. t a 兾2 t ⫽ d ⫺ D s d 兾 1n The corresponding 1001 ⫺ a confidence interval on m d ⫽ m 1 ⫺ m 2 based on the paired data is shown here. EXAMPLE 6.8 Refer to the data of Example 6.7 and perform a paired t test. Draw a conclusion based on a ⫽ .05. Solution For these data, the parts of the statistical test are H : H a : T.S.: R.R.: For df ⫽ n ⫺ 1 ⫽ 14, reject H if t ⱖ t .05 . Before computing t, we must first calculate and s d . For the data of Table 6.14, we have the differences d i ⫽ garage I estimate ⫺ garage II estimate see Table 6.15. d t ⫽ d s d 兾 1n m d ⬎ m d ⫽ m 1 ⫺ m 2 ⱕ The mean and standard deviation are given here. and s d ⫽ .394 Substituting into the test statistic t, we have Indeed, t ⫽ 6.00 is far beyond all tabulated t values for df ⫽ 14, so the p-value is less than .005; in fact, the p-value is .000016. We conclude that the mean repair estimate for garage I is greater than that for garage II. This conclusion agrees with our intu- itive finding based on the binomial distribution. The point of all this discussion is not to suggest that we typically have two or more analyses that may give very conflicting results for a given situation. Rather, the point is that the analysis must fit the experimental situation; and for this exper- iment, the samples are dependent, demanding we use an analysis appropriate for dependent paired data. After determining that there is a statistically significant difference in the means, we should estimate the size of the difference. A 95 confidence interval for m 1 ⫺ m 2 ⫽ m d will provide an estimate of the size of the difference in the average repair estimate between the two garages: Thus, we are 95 confident that the mean repair estimates differ by a value between 390 and 830. The insurance adjusters determined that a difference of this size is of practical significance. .61 ⫾ 2.145 .394 115 or .61 ⫾ .22 d ⫾ t a 兾2 s d 1n t ⫽ d ⫺ 0 s d 兾 1n ⫽ .61 .394 兾 115 ⫽ 6.00 d ⫽ .61 1001 ⴚ A Confidence Interval for M d Based on Paired Data where n is the number of pairs of observations and hence the number of differences and df ⫽ n ⫺ 1. d ⫾ t a 兾2 s d 1n TABLE 6.15 Difference data from Table 6.14 Car 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 d i .3 1.1 1.1 ⫺.2 .3 .5 .4 .9 .2 .6 .3 1.1 .8 .9 .9 The reduction in standard error of by using the differences d i s in place of the observed values y 1i s and y 2i s will produce a t test having greater power and confidence intervals having smaller width. Is there any loss in using paired data experiments? Yes, the t procedures using the d i s have df ⫽ n ⫺ 1, whereas the t pro- cedures using the individual measurements have df ⫽ n 1 ⫹ n 2 ⫺ 2 ⫽ 2n ⫺ 1. Thus, when designing a study or experiment, the choice between using an independent samples experiment and a paired data experiment will depend on how much differ- ence exists in the experimental units prior to their assignment to the treatments. If there are only small differences, then the independent samples design is more efficient. If the differences in the experimental units are extreme, then the paired data design is more efficient.

6.5 A Nonparametric Alternative:

The Wilcoxon Signed-Rank Test The Wilcoxon signed-rank test, which makes use of the sign and the magnitude of the rank of the differences between pairs of measurements, provides an alternative to the paired t test when the population distribution of the differences is nonnor- mal. The Wilcoxon signed-rank test requires that the population distribution of differences be symmetric about the unknown median M. Let D be a specified hypothesized value of M. The test evaluates shifts in the distribution of differences to the right or left of D ; in most cases, D is 0. The computation of the signed-rank test involves the following steps: 1. Calculate the differences in the n pairs of observations. 2. Subtract D from all the differences.

3.

Delete all zero values. Let n be the number of nonzero values. 4. List the absolute values of the differences in increasing order, and assign them the ranks 1, . . . , n or the average of the ranks for ties. We define the following notation before describing the Wilcoxon signed-rank test: n ⫽ the number of pairs of observations with a nonzero difference T ⫹ ⫽ the sum of the positive ranks; if there are no positive ranks, T ⫽ 0 T ⫺ ⫽ the sum of the negative ranks; if there are no negative ranks, T ⫽ 0 T ⫽ the smaller of T ⫹ and T ⫺ If we group together all differences assigned the same rank, and there are g such groups, the variance of T is where t j is the number of tied ranks in the jth group. Note that if there are no tied ranks, g ⫽ n, and t j ⫽ 1 for all groups. The formula then reduces to The Wilcoxon signed-rank test is presented here. Let M be the median of the population of differences. s 2 T ⫽ nn ⫹ 12n ⫹ 1 24 s 2 T ⫽ 1 24 Bnn ⫹ 12n ⫹ 1 ⫺ 1 2 a j t j t j ⫺ 1t j ⫹ 1 R S T ⫽ A nn ⫹ 12n ⫹ 1 24 M T ⫽ nn ⫹ 1 4 y 1 ⫺ y 2 ␮ T ␴ T g groups t j Wilcoxon Signed-Rank Test H : M ⫽ D D is specified; generally D is set to 0. H a : 1. M ⬎ D 2. M ⬍ D

3.

M ⫽ D n ⱕ 50 T.S.: 1. T ⫽ T ⫺ 2. T ⫽ T ⫹

3.

T ⫽ smaller of T ⫹ and T ⫺ R.R.: For a specified value of a one-tailed .05, .025, .01, or .005; two- tailed .10, .05, .02, .01 and fixed number of nonzero differences n, reject H if the value of T is less than or equal to the appropriate entry in Table 6 in the Appendix. n ⬎ 50 T.S.: Compute the test statistic R.R.: For cases 1 and 2, reject H if z ⬍ ⫺z a ; for case 3, reject H if z ⬍ ⫺z a Ⲑ2 . Check assumptions, place a confidence interval on the median of the differ- ences, and state conclusions. z ⫽ T ⫺ nn ⫹ 1 4 A nn ⫹ 12n ⫹ 1 24 EXAMPLE 6.9 A city park department compared a new formulation of a fertilizer, brand A, to the previously used fertilizer, brand B, on each of 20 different softball fields. Each field was divided in half, with brand A randomly assigned to one half of the field and brand B to the other. Sixty pounds of fertilizers per acre were then applied to the fields. The effect of the fertilizer on the grass grown at each field was measured by the weight in pounds of grass clippings produced by mowing the grass at the fields over a 1-month period. Evaluate whether brand A tends to produce more grass than brand B. The data are given in Table 6.16. Field Brand A Brand B Difference Field Brand A Brand B Difference 1 211.4 186.3 25.1 11 208.9 183.6 25.3 2 204.4 205.7 ⫺ 1.3 12 208.7 188.7 20.0 3 202.0 184.4 17.6 13 213.8 188.6 25.2 4 201.9 203.6 ⫺ 1.7 14 201.6 204.2 ⫺ 2.6 5 202.4 180.4 22.0 15 201.8 181.6 20.1 6 202.0 202.0 16 200.3 208.7 ⫺ 8.4 7 202.4 181.5 20.9 17 201.8 181.5 20.3 8 207.1 186.7 20.4 18 201.5 208.7 ⫺ 7.2 9 203.6 205.7 ⫺ 2.1 19 212.1 186.8 25.3 10 216.0 189.1 26.9 20 203.4 182.9 20.5 TABLE 6.16 Solution Evaluate whether brand A tends to produce more grass than brand B. Plots of the differences in grass yields for the 20 fields are given in Figure 6.7 a