Inferences about M When Population Is Nonnormal

to obtain percentiles. Also, the sample size n is relatively small so we are not too sure about applying the Central Limit Theorem and using the z-tables to obtain percentiles to construct confidence intervals or to test hypotheses. The bootstrap technique consists of the following steps: 1. Select a random sample y 1 , y 2 , . . . , y n of size n from the population and compute the sample mean, , and sample standard deviation, s. 2. Select a random sample of size n, with replacement from y 1 , y 2 , . . . , y n yielding .

3.

Compute the mean and standard deviation of . 4. Compute the value of the statistic 5. Repeat Steps 2 – 4 a large number of times B to obtain Use these values to obtain an approximation to the sampling distribution of . Suppose we have n ⫽ 20 and we select B ⫽ 1,000 bootstrap samples. The steps in obtaining the bootstrap approximation to the sampling distribution of are depicted here. Obtain random sample y 1 , y 2 , . . . , y 20 , from population, and compute and s First bootstrap sample: yields , and Second bootstrap sample: yields , and . . . Bth bootstrap sample: yields , and We then use the B values of to obtain the approximate per- centiles. For example, suppose we want to construct a 95 confidence interval for m and B ⫽ 1,000. We need the lower and upper .025 percentiles, . Thus, we would take the 1,000.025 ⫽ 25th largest value of ⫽ .025 and the 1,000 1 ⫺ .025 ⫽ 975th largest value of ⫽ .975 . The approximate 95 confidence interval for m would be EXAMPLE 5.18 Secondhand smoke is of great concern, especially when it involves young children. Breathing secondhand smoke can be harmful to children’s health, contributing to health problems such as asthma, Sudden Infant Death Syndrome SIDS, bronchi- tis and pneumonia, and ear infections. The developing lungs of young children are severely affected by exposure to secondhand smoke. The Child Protective Services CPS in a city is concerned about the level of exposure to secondhand smoke for children placed by their agency in foster parents care. A method of determining level of exposure is to determine the urinary concentration of cotanine, a metabo- lite of nicotine. Unexposed children will typically have mean cotanine levels of 75 or less. A random sample of 20 children expected of being exposed to secondhand smoke yielded the following urinary concentrations of cotanine: 29, 30, 53, 75, 89, 34, 21, 12, 58, 84, 92, 117, 115, 119, 109, 115, 134, 253, 289, 287 冢 y ⫺ ˆt .025 s 1n , y ⫹ ˆt .975 s 1n 冣 ˆt ˆt ˆt ˆt ˆt .025 and ˆt .975 ˆt: ˆt 1 , ˆt 2 , . . . , ˆt B ˆt B ⫽ y ⫺ y s 兾120 s y y 1 , y 2 , . . . , y 20 ˆt 2 ⫽ y ⫺ y s 兾120 s y y 1 , y 2 , . . . , y 20 ˆt 1 ⫽ y ⫺ y s 兾120 s y y 1 , y 2 , . . . , y 20 y y ⫺ m s 兾 1n y ⫺ m s 兾 1n ˆt 1 , ˆt 2 , . . . , ˆt B . ˆt ⫽ y ⫺ y s 兾 1n y 1 , y 2 , . . . , y n s y y 1 , y 2 , . . . , y n y CPS wants an estimate of the mean cotanine level in the children under their care. From the sample of 20 children, they compute ⫽ 105.75 and s ⫽ 82.429. Construct a 95 confidence interval for the mean cotanine level for children under the supervision of CPS. Solution Because the sample size is relatively small, an assessment of whether the population has a normal distribution is crucial prior to using a confidence interval procedure based on the t distribution. Figure 5.20 displays a normal probability plot for the 20 data values. From the plot, we observe that the data do not fall near the straight line, and the p-value for the test of normality is less than .01. Thus, we would conclude that the data do not appear to follow a normal distribution. The confidence interval based on the t distribution would not be appropriate hence we will use a bootstrap confidence interval. y FIGURE 5.20 Normal probability plot for cotanine data ⫺100 100 C1 200 300 p -value .010 RJ .917 N 20 StDev 82.43 Mean 105.8 5 1

10 20

Percent 30 40 50 60 70 80 90 95 99 One thousand B ⫽ 1,000 samples of size 20 are selected with replacement from the original sample. Table 5.7 displays 5 of the 1,000 samples to illustrate the nature of the bootstrap samples. Original 29 30 53 75 89 34 21 12 58 84 Sample 92 117 115 119 109 115 134 253 289 287 Bootstrap 29 21 12 115 21 89 29 30 21 89 Sample 1 30 84 84 134 58 30 34 89 29 134 Bootstrap 30 92 75 109 115 117 84 89 119 289 Sample 2 115 75 21 92 109 12 289 58 92 30 Bootstrap 53 289 30 92 30 253 89 89 75 119 Sample 3 115 117 253 53 84 34 58 289 92 134 Bootstrap 75 21 115 287 119 75 75 53 34 29 Sample 4 117 115 29 115 115 253 289 134 53 75 Bootstrap 89 119 109 109 115 119 12 29 84 21 Sample 5 34 134 115 134 75 58 30 75 109 134 TABLE 5.7 Bootstrap samples Upon examination of Table 5.7, it can be observed that in each of the bootstrap samples there are repetitions of some of the original data values. This arises due to the sampling with replacement. The following histogram of the 1,000 values of illustrates the effect of the nonnormal nature of the population distribu- tion on the sampling distribution on the t statistic. If the sample had been ran- domly selected from a normal distribution, the histogram would be symmetric, as was depicted in Figure 5.14. The histogram in Figure 5.21 is somewhat left-skewed. ˆt ⫽ y ⫺ y s 兾1n FIGURE 5.21 Histogram of bootstrapped t-statistic 250 200 150 Frequenc y 100 50 –8 –6 –4 –2 2 Values of bootstrap t 4 6 After sorting the 1,000 values of from smallest to largest, we obtain the 25th smallest and 25th largest values ⫺3.288 and 1.776, respectively. We thus have the following percentiles: .025 ⫽ ⫺ 3.288 and .975 ⫽ 1.776 The 95 confidence interval for the mean cotanine concentration is given here using the original sample mean of ⫽ 105.75 and sample standard deviation s ⫽ 82.459: A comparison of these two percentiles to the percentiles from the t distribution Table 2 in the Appendix reveals how much in error our confidence intervals would have been if we would have directly applied the formulas from Section 5.7. From Table 2 in the Appendix, with df ⫽ 19, we have t .025 ⫽ ⫺ 2.093 and t .975 ⫽ 2.093. This would yield a 95 confidence interval on m of Note that the confidence interval using the t distribution is centered about the sam- ple mean; whereas, the bootstrap confidence interval has its lower limit further from the mean than its upper limit. This is due to the fact that the random sample from the population indicated that the population distribution was not symmetric. Thus, we would expect that the sampling distribution of our statistic would not be symmetric due to the relatively small size, n ⫽ 20. We will next apply the bootstrap approximation of the test statistic to obtain a test of hypotheses for the situation where n is relatively small and the population distribution is nonnormal. The method for obtaining the p-value for the bootstrap approximation to the sampling distribution of the test statistic under t ⫽ y ⫺ m s 兾 1n 105.75 ⫾ 2.093 82.429 120 1 67.17, 144.33 1 45.15, 138.48 冢 y ⫺ ˆt .025 s 1n , y ⫹ ˆt .975 s 1n 冣 1 冢 105.75 ⫺ 3.288 82.429 120 , 105.75 ⫹ 1.776 82.459 120 冣 y ˆt ˆt ˆt the null value of m, m involves the following steps: Suppose we want to test the fol- lowing hypotheses: H : m ⱕ m versus H a : m ⬎ m 1. Select a random sample y 1 , y 2 , . . . , y n of size n from the population and compute the value of . 2. Select a random sample of size n, with replacement from y 1 , y 2 , . . . , y n and compute the mean and standard deviation of .

3.

Compute the value of the statistic 4. Repeat Steps 1– 4 a large number of times B to form the approximate sampling distribution of . 5. Let m be the number of values of the statistic that are greater than or equal to the value t computed from the original sample. 6. The bootstrap p-value is . When the hypotheses are H : m ⱖ m versus H a : m ⬍ m , the only change would be to let m be the number of values of the statistic that are less than or equal to the value t computed from the original sample. Finally, when the hypotheses are H : m ⫽ m versus H a : m ⫽ m , let m L be the number of values of the statistic that are less than or equal to the value t computed from the original sample and m U be the number of values of the statistic that are greater than or equal to the value t computed from the original sample. Compute and . Take the p-value to be the minimum of 2p L and 2p U . A point of clarification concerning the procedure described above: The boot- strap test statistic replaces m with the sample mean from the original sample. Recall that when we calculate the p-value of a test statistic, the calculation is always done under the assumption that the null hypothesis is true. In our bootstrap procedure, this requirement results in the bootstrap test statistic having m replaced with the sample mean from the original sample. This ensures that our bootstrap approxima- tion of the sampling distribution of the test statistic is under the null value of m, m . EXAMPLE 5.19 Refer to Example 5.18. The CPS personnel wanted to determine if the mean cota- nine level was greater than 75 for children under their supervision. Based on the sample of 20 children and using a ⫽ .05, do the data support the contention that the mean exceeds 75? Solution The set of hypotheses that we want to test are H : m ⱕ 75 versus H : m ⬎ 75 Because there was a strong indication that the distribution of contanine levels in the population of children under CPS supervision was not normally distributed and because the sample size n was relatively small, the use of the t distribution to compute the p-value may result in a very erroneous decision based on the observed data. Therefore, we will use the bootstrap procedure. First, we calculate the value of the test statistic in the original data: t ⫽ y ⫺ m s 兾 1n ⫽ 105.75 ⫺ 75 82.429 兾 120 ⫽ 1.668 p U ⫽ m U B p L ⫽ m L B ˆt ˆt ˆt m B ˆt y ⫺ m s 兾 1n ˆt ⫽ y ⫺ y s 兾 1n y 1 , y 2 , . . . , y n s y t ⫽ y ⫺ m s 兾 1n Next, we use the 1,000 bootstrap samples generated in Example 5.18, to determine the number of samples, m, with greater than 1.668. From the 1,000 values of , we find that m ⫽ 33 of the B ⫽ 1,000 values of exceeded 1.668. Therefore, our p-value ⫽ m 兾B ⫽ 33兾1000 ⫽ .033 ⬍ .05 ⫽ a. Therefore, we conclude that there is sufficient evidence that the mean cotanine level exceeds 75 in the population of children under CPS supervision. It is interesting to note that if we had used the t distribution with 19 degrees of freedom to compute the p-value, the result would have produced a different conclusion. From Table 2 in the Appendix, p-value ⫽ Pr[t ⱖ 1.668] ⫽ .056 ⬎ .05 ⫽ a Using the t-tables, we would conclude there is insufficient evidence in the data to support the contention that the mean cotanine exceeds 75. The small sample size, n ⫽ 20, and the possibility of non-normal data would make this conclusion suspect. Minitab Steps for Obtaining Bootstrap Sample The steps needed to generate the bootstrap samples are relatively straightforward in most software programs. We will illustrate these steps using the Minitab software. Suppose we have a random sample of 25 observations from a population. We want to generate 1,000 bootstrap samples each consisting of 25 randomly selected with replacement data samples from the original 25 data values. 1. Insert the original 25 data values in column C1. 2. Choose Calc → Calculator. a. Select the expression MeanC1. b. Place K1 in the “Store result in variable:” box. c. Select the expression STDEVC1. d. Place K2 in the “Store result in variable:” box. e. The constants Kl and K2 now contain the mean and standard deviation of the orginal data.

3.

Choose Calc → Random Data rightarrow Sample From Columns. 4. Fill in the menu with the following: a. Check the box Sample with Replacement. b. Store 1,000 rows from Columns C1. c. Store samples in: Columns C2. 5. Repeat the above steps by replacing C2 with C3. 6. Continue repeating the above step until 1,000 data values have been placed in columns C2 –C26. a. The first row of columns, C2 –C26, represents Bootstrap Sample 1, the second row of columns, C2 –C26, represents Bootstrap Sample 2, . . . , row 1,000 represents Bootstrap Sample 1,000. 7. To obtain the mean and standard deviation of each of the 1,000 samples and store them in columns C27 and C28, respectively, follow the following steps: a. Choose Calc → Row Statistics, then fill in the menu with b. Click on Mean. c. Input variables: C2 –C26. d. Store result in: C27. e. Choose Calc → Row Statistics, then fill in the menu with f. Click on Standard Deviation. g. Input variables: C2 –C26. h. Store result in: C28. ˆt ˆt ⫽ y ⫺ 105.75 s 兾120 ˆt ⫽ y ⫺ y s 兾1n The 1,000 bootstrap sample means and standard deviations are now stored in C27 and C28. The sampling distribution of the sample mean and the t statistics can now be obtained from C27 and C28 by graphing the data in C27 using a histogram and calculating the 1,000 values of the t statistic using the following steps: 1. Choose Calc → Calculator. 2. Store results in C29.

3.

In the Expression Box: C27-K1C28sqrt25. The 1,000 values of the t statistics are now stored in C29. Next, sort the data in C29 by the following steps: 1. Select Data → Sort. 2. Column C29.

3.

By C29. 4. Click on Original Columns. The percentiles and p-values can now be obtained from these sorted values.

5.9 Inferences about the Median

When the population distribution is highly skewed or very heavily tailed, the median is more appropriate than the mean as a representation of the center of the population. Furthermore, as was demonstrated in Section 5.7, the t procedures for constructing confidence intervals and for tests of hypotheses for the mean are not appropriate when applied to random samples from such populations with small sample sizes. In this section, we will develop a test of hypotheses and a confidence interval for the pop- ulation median that will be appropriate for all types of population distributions. The estimator of the population median M is based on the order statistics that were discussed in Chapter 3. Recall that if the measurements from a random sample of size n are given by y 1 , y 2 , . . . , y n , then the order statistics are these values ordered from smallest to largest. Let y 1 ⱕ y 2 ⱕ . . . ⱕ y n represent the data in ordered fashion. Thus, y 1 is the smallest data value and y n is the largest data value. The estimator of the population median is the sample median Recall that is computed as follows: If n is an odd number, then ⫽ y m , where m ⫽ n ⫹ 1 兾2. If n is an even number, then ⫽ y m ⫹ y m⫹1 兾2, where m ⫽ n兾2. To take into account the variability of as an estimator of M, we next con- struct a confidence interval for M. A confidence interval for the population me- dian M may be obtained by using the binomial distribution with p ⫽ 0.5. ˆ M ˆ M ˆ M ˆ M ˆ M. 1001 ⴚ ␣ Confidence Interval for the Median A confidence interval for M with level of confidence at least 1001 ⫺ a is given by where L a 兾2 ⫽ C a 2,n ⫹ 1 U a 兾2 ⫽ n ⫺ C a 2,n M L , M U ⫽ y L a 兾2 , y U a 兾2 Table 4 in the Appendix contains values for C a 2,n , which are percentiles from a binomial distribution with p ⫽ .5. Because the confidence limits are computed using the binomial distribution, which is a discrete distribution, the level of confidence of M L , M U will generally be somewhat larger than the specified 1001 ⫺ a. The exact level of confidence is given by Level ⫽ 1 ⫺ 2Pr[Binn, .5 ⱕ C a 2,n ] The following example will demonstrate the construction of the interval. EXAMPLE 5.20 The sanitation department of a large city wants to investigate ways to reduce the amount of recyclable materials that are placed in the city’s landfill. By separating the recyclable material from the remaining garbage, the city could prolong the life of the landfill site. More important, the number of trees needed to be harvested for paper products and the aluminum needed for cans could be greatly reduced. From an analy- sis of recycling records from other cities, it is determined that if the average weekly amount of recyclable material is more than 5 pounds per household, a commercial recycling firm could make a profit collecting the material. To determine the feasibility of the recycling plan, a random sample of 25 households is selected. The weekly weight of recyclable material in poundsweek for each household is given here. 14.2 5.3 2.9 4.2 1.2 4.3 1.1 2.6 6.7 7.8 25.9 43.8 2.7 5.6 7.8 3.9 4.7 6.5 29.5 2.1 34.8 3.6 5.8 4.5 6.7 Determine an appropriate measure of the amount of recyclable waste from a typi- cal household in the city. Normal probability plot of recyclable wastes .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability

20 30

10 40 Recyclable waste pounds per week Boxplot of recyclable wastes 45 40 35 30 25 20 15 10 5 R e cycl ab l e w a st e s poun d s p er w ee k FIGURE 5.22a Boxplot for waste data FIGURE 5.22b Normal probability plot for waste data