Inferences about the Median

Because the confidence limits are computed using the binomial distribution, which is a discrete distribution, the level of confidence of M L , M U will generally be somewhat larger than the specified 1001 ⫺ a. The exact level of confidence is given by Level ⫽ 1 ⫺ 2Pr[Binn, .5 ⱕ C a 2,n ] The following example will demonstrate the construction of the interval. EXAMPLE 5.20 The sanitation department of a large city wants to investigate ways to reduce the amount of recyclable materials that are placed in the city’s landfill. By separating the recyclable material from the remaining garbage, the city could prolong the life of the landfill site. More important, the number of trees needed to be harvested for paper products and the aluminum needed for cans could be greatly reduced. From an analy- sis of recycling records from other cities, it is determined that if the average weekly amount of recyclable material is more than 5 pounds per household, a commercial recycling firm could make a profit collecting the material. To determine the feasibility of the recycling plan, a random sample of 25 households is selected. The weekly weight of recyclable material in poundsweek for each household is given here. 14.2 5.3 2.9 4.2 1.2 4.3 1.1 2.6 6.7 7.8 25.9 43.8 2.7 5.6 7.8 3.9 4.7 6.5 29.5 2.1 34.8 3.6 5.8 4.5 6.7 Determine an appropriate measure of the amount of recyclable waste from a typi- cal household in the city. Normal probability plot of recyclable wastes .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability

20 30

10 40 Recyclable waste pounds per week Boxplot of recyclable wastes 45 40 35 30 25 20 15 10 5 R e cycl ab l e w a st e s poun d s p er w ee k FIGURE 5.22a Boxplot for waste data FIGURE 5.22b Normal probability plot for waste data Solution A boxplot and normal probability of the recyclable waste data Fig- ure 5.22a and b reveal the extreme right skewness of the data. Thus, the mean is not an appropriate representation of the typical household’s potential recyclable material. The sample median and a confidence interval on the population are given by the following computations. First, we order the data from smallest value to largest value: 1.1 1.2 2.1 2.6 2.7 2.9 3.6 3.9

4.2 4.3

4.5 4.7 5.3 5.6 5.8 6.5 6.7 6.7 7.8 7.8 14.2 25.9 29.5 34.8 43.8 The number of values in the data set is an odd number, so the sample median is given by ⫽ y 25⫹1 兾2 ⫽ y 13 ⫽ 5.3 The sample mean is calculated to be ⫽ 9.53. Thus, we have that 20 of the 25 households’ weekly recyclable wastes are less than the sample mean. Note that 12 of the 25 waste values are less and 12 of the 25 are greater than the sample median. Thus, the sample median is more representative of the typical household’s recycla- ble waste than is the sample mean. Next we will construct a 95 confidence inter- val for the population median. From Table 4 in the Appendix, we find C a 2,n ⫽ C .05,25 ⫽ 7 Thus, L .025 ⫽ C .05,25 ⫹ 1 ⫽ 8 U .025 ⫽ n ⫺ C .05,n ⫽ 25 ⫺ 7 ⫽ 18 The 95 confidence interval for the population median is given by M L , M U ⫽ y 8 , y 18 ⫽ 3.9, 6.7 Using the binomial distribution, the exact level of coverage is given by 1 ⫺ 2Pr[Bin 25, .5 ⱕ 7] ⫽ .957, which is slightly larger than the desired level 95. Thus, we are at least 95 confident that the median amount of recyclable waste per household is between 3.9 and 6.7 pounds per week. Large-Sample Approximation When the sample size n is large, we can apply the normal approximation to the bi- nomial distribution to obtain approximations to C a 2,n . The approximate value is given by Because this approximate value for C a 2,n is generally not an integer, we set C a 2,n to be the largest integer that is less than or equal to the approximate value. EXAMPLE 5.21 Using the data in Example 5.20, find a 95 confidence interval for the median using the approximation to C a 2,n . Solution We have n ⫽ 25 and a ⫽ .05. Thus, z .05 兾2 ⫽ 1.96, and C a 2,n ⬇ n 2 ⫺ z a 兾2 A n 4 ⫽ 25 2 ⫺ 1.96A 25 4 ⫽ 7.6 C a 2,n ⬇ n 2 ⫺ z a 兾2 A n 4 y ˆ M Thus, we set C a 2,n ⫽ 7, and our confidence interval is identical to the interval constructed in Example 5.20. If n is larger than 30, the approximate and the exact value of C a 2,n will often be the same integer. In Example 5.20, the city wanted to determine whether the median amount of recyclable material was more than 5 pounds per household per week. We con- structed a confidence interval for the median but we still have not answered the question of whether the median is greater than 5. Thus, we need to develop a test of hypotheses for the median. We will use the ideas developed for constructing a confidence interval for the median in our development of the testing procedures for hypotheses concerning a population median. In fact, a 1001 ⫺ a confidence interval for the population median M can be used to test two-sided hypotheses about M. If we want to test H : M ⫽ M versus H 1 : M ⫽ M at level a, then we construct a 1001 ⫺ a confidence interval for M. If M is contained in the confidence interval, then we fail to reject H . If M is outside the confidence interval, then we reject H . For testing one-sided hypotheses about M, we will use the binomial distribu- tion to determine the rejection region. The testing procedure is called the sign test and is constructed as follows. Let y 1 , . . . , y n be a random sample from a population having median M. Let the null value of M be M and define W i ⫽ y i ⫺ M . The sign test statistic B is the number of positive W i s. Note that B is simply the number of y i s that are greater than M . Because M is the population median, 50 of the data val- ues are greater than M and 50 are less than M. Now, if M ⫽ M , then there is a 50 chance that y i is greater than M and hence a 50 chance that W i is positive. Because the W i s are independent, each W i has a 50 chance of being positive whenever M ⫽ M , and B counts the number of positive W i s under H , B is a binomial random variable with p ⫽ .5 and the percentiles from the binomial distribution with p ⫽ .5 given in Table 4 in the Appendix can be used to construct the rejection region for the test of hypothesis. The statistical test for a population median M is summarized next. Three different sets of hypotheses are given with their corresponding rejection regions. The tests given are appropriate for any population distribution. sign test test for a population median M Summary of a Statistical Test for the Median M Hypotheses: Case 1. H : M ⱕ M vs. H a : M ⬎ M right-tailed test Case 2. H : M ⱖ M vs. H a : M ⬍ M left-tailed test Case 3. H : M ⫽ M vs. H a : M ⫽ M two-tailed test T.S.: Let W i ⫽ y i ⫺ M and B ⫽ number of positive W i s. R.R.: For a probability a of a Type I error, Case 1. Reject H if B ⱖ n ⫺ C a 1,n Case 2. Reject H if B ⱕ C a 1,n Case 3. Reject H if B ⱕ C a 2,n or B ⱖ n ⫺ C a 2,n The following example will illustrate the test of hypotheses for the population median. EXAMPLE 5.22 Refer to Example 5.20. The sanitation department wanted to determine whether the median household recyclable wastes was greater than 5 pounds per week. Test this research hypothesis at level a ⫽ .05 using the data from Exercise 5.20. Solution The set of hypotheses are H : M ⱕ 5 versus H a : M ⬎ 5 The data set consisted of a random sample of n ⫽ 25 households. From Table 4 in the Appendix, we find C a 1, n ⫽ C .05,25 ⫽ 7. Thus, we will reject H : M ⱕ 5 if B ⱖ n ⫺ C a 1, n ⫽ 25 ⫺ 7 ⫽ 18. Let W i ⫽ y i ⫺ M ⫽ y i ⫺ 5, which yields ⫺ 3.9 ⫺ 3.8 ⫺ 2.9 ⫺ 2.4 ⫺ 2.3 ⫺ 2.1 ⫺ 1.4 ⫺ 1.1 ⫺ 0.8 ⫺ 0.7 ⫺ 0.5 ⫺ 0.3 0.3 0.6 0.8 1.5 1.7 1.7 2.8 2.8 9.2 20.9 24.5 29.8 38.8 The 25 values of W i contain 13 positive values. Thus, B ⫽ 13, which is not greater than 18. We conclude the data set fails to demonstrate that the median household level of recyclable waste is greater than 5 pounds. Large-Sample Approximation When the sample size n is larger than the values given in Table 4 in the Appendix, we can use the normal approximation to the binomial distribution to set the rejec- tion region. The standardized version of the sign test is given by When M equals M , B ST has approximately a standard normal distribution. Thus, we have the following decision rules for the three different research hypotheses: Case 1. Reject H : M ⱕ M if B ST ⱖ z a , with p-value ⫽ Prz ⱖ B ST Case 2. Reject H : M ⱖ M if B ST ⱕ ⫺ z a , with p-value ⫽ Prz ⱕ B ST Case 3. Reject H : M ⫽ M if |B ST | ⱖ z a 兾2 , with p-value ⫽ 2Prz ⱖ |B ST | where z a is the standard normal percentile. EXAMPLE 5.23 Using the information in Example 5.22, construct the large-sample approximation to the sign test, and compare your results to those obtained using the exact sign test. Solution Refer to Example 5.22, where we had n ⫽ 25 and B ⫽ 13. We conduct the large-sample approximation to the sign test as follows. We will reject H : M ⱕ 5 in favor of H a : M ⬎ 5 if B ST ⱖ z .05 ⫽ 1.96. Because B ST is not greater than 1.96, we fail to reject H . The p-value ⫽ Prz ⱖ 0.2 ⫽ 1 ⫺ Prz ⬍ 0.2 ⫽ 1 ⫺ .5793 ⫽ .4207 using Table 1 in the Appendix. Thus, we reach the same conclusion as was obtained using the exact sign test. In Section 5.7, we observed that the performance of the t test deteriorated when the population distribution was either very heavily tailed or highly skewed. In Table 5.8, we compute the level and power of the sign test and compare these values to the comparable values for the t test for the four population distributions depicted in Figure 5.19 in Section 5.7. Ideally, the level of the test should remain the same for all population distributions. Also, we want tests having the largest possible power values because the power of a test is its ability to detect false null B ST ⫽ B ⫺ n 兾2 1n兾4 ⫽ 13 ⫺ 25 兾2 125兾4 ⫽ 0.2 B ST ⫽ B ⫺ n 兾2 1n兾4 hypotheses. When the population distribution is either heavy tailed or highly skewed, the level of the t test changes from its stated value of .05. In these situa- tions, the level of the sign test stays the same because the level of the sign test is the same for all distributions. The power of the t test is greater than the power of the sign test when sampling from a population having a normal distribution. However, the power of the sign test is greater than the power of the t test when sampling from very heavily tailed distributions or highly skewed distributions.

5.10 Research Study: Percent Calories from Fat

In Section 5.1 we introduced the potential health problems associated with obesity. The assessment and quantification of a person’s usual diet is crucial in evaluating the degree of relationship between diet and diseases. This is a very difficult task but is important in an effort to monitor dietary behavior among individuals. Rosner, Willett, and Spiegelman 1989, in “Correction of Logistic Regression Relative Risk Estimates and Confidence Intervals for Systematic Within-Person Measurement Error,” Statistics in Medicine, Vol. 8, 1051–1070, describe a nurses’ health study in which the diet of a large sample of women was examined. One of the objectives of the study was to determine the percentage of calories from fat in the diet of a population of nurses and compare this value with the recommended value of 30. The most commonly used method in large nutritional epidemiology studies is the food fre- quency questionnaire FFQ. This questionnaire uses a carefully designed series of questions to determine the dietary intakes of participants in the study. In the nurses’ health study, a sample of nurses completed a single FFQ. These women represented a random sample from a population of nurses. From the information gathered from the questionnaire, the percentage of calories from fat PCF was computed. To minimize missteps in a research study, it is advisable to follow the four- step process outlined in Chapter 1. We will illustrate these steps using the percent calories from fat PCF study described at the beginning of this chapter. The first step is determining what are the goals and objectives of the study. Defining the Problem The researchers in this study would need to answer questions similar to the following: 1. What is the population of interest? 2. What dietary variables may have an effect on a person’s health? n ⴝ 10 n ⴝ 15 n ⴝ 20 M a ⴚ M 兾␴ M a ⴚ M 兾␴ M a ⴚ M 兾␴ Population Test Distribution Statistic Level .2 .6 .8 Level .2 .6 .8 Level .2 .6 .8 Normal t .05 .145 .543 .754 .05 .182 .714 .903 .05 .217 .827 .964 Sign .055 .136 .454 .642 .059 .172 .604 .804 .058 .194 .704 .889 Heavy Tailed t .035 .104 .371 .510 .049 .115 .456 .648 .045 .163 .554 .736 Sign .055 .209 .715 .869 .059 .278 .866 .964 .058 .325 .935 .990 Lightly Skewed t .055 .140 .454 .631 .059 .178 .604 .794 .058 .201 .704 .881 Sign .025 .079 .437 .672 .037 .129 .614 .864 .041 .159 .762 .935 Highly Skewed t .007 .055 .277 .463 .006 .078 .515 .733 .011 .104 .658 .873 Sign .055 .196 .613 .778 .059 .258 .777 .912 .058 .301 .867 .964 TABLE 5.8 Level and power values of the t test versus the sign test

3.

What characteristics of the nurses other than dietary intake may be important in studying the nurses’ health condition? 4. How should the nurses be selected to participate in the study? 5. What hypotheses are of interest to the researchers? The researchers decided that the main variable of interest was the percentage of calories from fat PCF in the diet of nurses. The parameters of interest were the average of PCF values m for the population of nurses, the standard deviation s of PCF for the population of nurses, and the proportion p of nurses having PCF greater than 50. They also wanted to determine if the average PCF for the pop- ulation of nurses exceeded the recommended value of 30. In order to estimate these parameters and test hypotheses about the parame- ters, it was first necessary to determine the sample size required to meet certain specifications imposed by the researchers. The researchers wanted to estimate the mean PCF with a 95 confidence interval having a tolerable error of 3. From pre- vious studies, the values of PCF ranged from 10 to 50. Because we want a 95 confidence interval with width 3, E ⫽ 3 兾2 ⫽ 1.5 and z a 兾2 ⫽ z .025 ⫽ 1.96. Our estimate of s is ⫽ range 兾4 ⫽ 50 ⫺ 10兾4 ⫽ 10. Substituting into the formula for n, we have Thus, a random sample of 171 nurses should give a 95 confidence interval for m with the desired width of 3, provided 10 is a reasonable estimate of s. Three nurses originally selected for the study did not provide information on PCF; therefore, the sample size was only 168. Collecting Data The researchers would need to carefully examine the data from the food frequency questionnaires to determine if the responses were recorded correctly. The data would then be transfered to computer files and prepared for analysis following the steps outlined in Chapter 2. The next step in the study would be to summarize the data through plots and summary statistics. Summarizing Data The PCF values for the 168 women are displayed in Figure 5.23 in a stem-and-leaf di- agram along with a table of summary statistics. A normal probability plot is pro- vided in Figure 5.24 to assess the normality of the distribution of PCF values. From the stem-and-leaf plot and normal probability plot, it appears that the data are nearly normally distributed, with PCF values ranging from 15 to 57. The proportion of the women who have PCF greater than 50 is . From the table of summary statistics in the output, the sample mean is ⫽ 36.919 and the sample standard deviation is s ⫽ 6.728. The researchers want to draw infer- ences from the random sample of 168 women to the population from which they were selected. Thus, we would need to place bounds on our point estimates in order to reflect our degree of confidence in their estimation of the population values. Also, they may be interested in testing hypotheses about the size of the pop- ulation mean PCF m or variance s 2 . For example, many nutritional experts recom- mend that one’s daily diet have no more than 30 of total calories a day from fat. Thus, we would want to test the statistical hypotheses that m is greater than 30 to determine if the average value of PCF for the population of nurses exceeds the recommended value. y ˆ p ⫽ 4 兾168 ⫽ 2.4 n ⫽ z a 兾2 2 ˆs 2 E 2 ⫽ 1.96 2 10 2 1.5 2 ⫽ 170.7 ˆ s