Normal Approximation to the Binomial

any values of n or p, but the task becomes more difficult when n gets large. For example, suppose a sample of 1,000 voters is polled to determine sentiment toward the consolidation of city and county government. What would be the probability of observing 460 or fewer favoring consolidation if we assume that 50 of the entire population favor the change? Here we have a binomial experiment with n ⫽ 1,000 and p, the probability of selecting a person favoring consolidation, equal to .5. To determine the probability of observing 460 or fewer favoring consolidation in the random sample of 1,000 voters, we could compute Py using the binomial formula for y ⫽ 460, 459, . . . , 0. The desired probability would then be There would be 461 probabilities to calculate with each one being somewhat diffi- cult because of the factorials. For example, the probability of observing 460 favor- ing consolidation is A similar calculation would be needed for all other values of y. To justify the use of the Central Limit Theorem, we need to define n random variables, I 1 , . . . . , I n , by The binomial random variable y is the number of successes in the n trials. Now, consider the sum of the random variables I 1 , . . . , I n , I i . A 1 is placed in the sum for each S that occurs and a 0 for each F that occurs. Thus, I i is the number of S’s that occurred during the n trials. Hence, we conclude that . Because the binomial random variable y is the sum of independent random variables, each having the same distribution, we can apply the Central Limit Theorem for sums to y. Thus, the normal distribution can be used to approximate the binomial distribu- tion when n is of an appropriate size. The normal distribution that will be used has a mean and standard deviation given by the following formula: These are the mean and standard deviation of the binomial random variable y. EXAMPLE 4.25 Use the normal approximation to the binomial to compute the probability of ob- serving 460 or fewer in a sample of 1,000 favoring consolidation if we assume that 50 of the entire population favor the change. Solution The normal distribution used to approximate the binomial distribution will have The desired probability is represented by the shaded area shown in Figure 4.25. We calculate the desired area by first computing z ⫽ y ⫺ m s ⫽ 460 ⫺ 500 15.8 ⫽ ⫺ 2.53 s ⫽ 1 np1 ⫺ p ⫽ 11,000.5.5 ⫽ 15.8 m ⫽ np ⫽ 1,000.5 ⫽ 500 m ⫽ np s ⫽ 1np1 ⫺ p y ⫽ a n i⫽1 I i a n i⫽1 a n i⫽1 I i ⫽ 再 1 if the ith trial results in a success if the ith trial results in a failure Py ⫽ 460 ⫽ 1,000 460540 .5 460 .5 540 Py ⫽ 460 ⫹ Py ⫽ 459 ⫹ . . . ⫹ Py ⫽ 0 FIGURE 4.25 Approximating normal distribution for the binomial distribution, m ⫽ 500 and s ⫽ 15.8 f y y 460 500 FIGURE 4.26 Normal approximation to binomial 1 2 3 4 5 6 .05 1.5 2.5 3.5 4.5 5.5 6.5 n = 20 = .30 Referring to Table 1 in the Appendix, we find that the area under the normal curve to the left of 460 for z ⫽ ⫺2.53 is .0057. Thus, the probability of observing 460 or fewer favoring consolidation is approximately .0057. The normal approximation to the binomial distribution can be unsatisfactory if . If p, the probability of success, is small, and n, the sam- ple size, is modest, the actual binomial distribution is seriously skewed to the right. In such a case, the symmetric normal curve will give an unsatisfactory approxi- mation. If p is near 1, so n1 ⫺ p ⬍ 5, the actual binomial will be skewed to the left, and again the normal approximation will not be very accurate. The normal approx- imation, as described, is quite good when np and n1 ⫺ p exceed about 20. In the middle zone, np or n1 ⫺ p between 5 and 20, a modification called a continuity correction makes a substantial contribution to the quality of the approximation. The point of the continuity correction is that we are using the continuous normal curve to approximate a discrete binomial distribution. A picture of the situation is shown in Figure 4.26. The binomial probability that y ⱕ 5 is the sum of the areas of the rectangle above 5, 4, 3, 2, 1, and 0. This probability area is approximated by the area under the superimposed normal curve to the left of 5. Thus, the normal approximation ignores half of the rectangle above 5. The continuity correction simply includes the area between y ⫽ 5 and y ⫽ 5.5. For the binomial distribution with n ⫽ 20 and p ⫽ .30 pictured in Figure 4.26, the correction is to take Py ⱕ 5 as Py ⱕ 5.5. Instead of use The actual binomial probability can be shown to be .4164. The general idea of the continuity correction is to add or subtract .5 from a binomial value before using normal probabilities. The best way to determine whether to add or subtract is to draw a picture like Figure 4.26. Py ⱕ 5.5 ⫽ P[z ⱕ 5.5 ⫺ 20.3 兾 120.3.7] ⫽ Pz ⱕ ⫺.24 ⫽ .4052 Py ⱕ 5 ⫽ P[z ⱕ 5 ⫺ 20.3 兾 120.3.7] ⫽ Pz ⱕ ⫺.49 ⫽ .3121 np ⬍ 5 or n1 ⫺ p ⬍ 5 continuity correction Normal Approximation to the Binomial Probability Distribution For large n and p not too near 0 or 1, the distribution of a binomial random variable y may be approximated by a normal distribution with m ⫽ np and . This approximation should be used only if np ⱖ 5 and n1 ⫺ p ⱖ 5. A continuity correction will improve the quality of the ap- proximation in cases in which n is not overwhelmingly large. s ⫽ 1 np 1 ⫺ p EXAMPLE 4.26 A large drug company has 100 potential new prescription drugs under clinical test. About 20 of all drugs that reach this stage are eventually licensed for sale. What is the probability that at least 15 of the 100 drugs are eventually licensed? Assume that the binomial assumptions are satisfied, and use a normal approximation with continuity correction. Solution The mean of y is m ⫽ 100.2 ⫽ 20; the standard deviation is s ⫽ . The desired probability is that 15 or more drugs are approved. Because y ⫽ 15 is included, the continuity correction is to take the event as y greater than or equal to 14.5.

4.14 Evaluating Whether or Not a Population

Distribution Is Normal In many scientific experiments or business studies, the researcher wishes to deter- mine if a normal distribution would provide an adequate fit to the population dis- tribution. This would allow the researcher to make probability calculations and draw inferences about the population based on a random sample of observations from that population. Knowledge that the population distribution is not normal also may provide the researcher insight concerning the population under study. This may indicate that the physical mechanism generating the data has been al- tered or is of a form different from previous specifications. Many of the statistical procedures that will be discussed in subsequent chapters of this book require that the population distribution has a normal distribution or at least can be adequately approximated by a normal distribution. In this section, we will provide a graphical procedure and a quantitative assessment of how well a normal distribution models the population distribution. The graphical procedure that will be constructed to assess whether a random sample y l , y 2 , . . . , y n was selected from a normal distribution is refered to as a normal probability plot of the data values. This plot is a variation on the quantile plot that was introduced in Chapter 3. In the normal probability plot, we compare the quantiles from the data observed from the population to the corresponding quantiles from the standard normal distribution. Recall that the quantiles from the data are just the data ordered from smallest to largest: y 1 , y 2 , . . . , y n , where y 1 is the smallest value in the data y 1 , y 2 , . . . , y n , y 2 is the second smallest value, and so on until reach- ing y n , which is the largest value in the data. Sample quantiles separate the sample in ⫽ 1 ⫺ .0838 ⫽ .9162 Py ⱖ 14.5 ⫽ P 冢 z ⱖ 14.5 ⫺ 20 4.0 冣 ⫽ Pz ⱖ ⫺1.38 ⫽ 1 ⫺ Pz ⬍ ⫺1.38 1100.2.8 ⫽ 4.0 normal probability plot the same fashion as the population percentiles, which were defined in Section 4.10. Thus, the sample quantile Qu has at least 100u of the data values less than Qu and has at least 1001 ⫺ u of the data values greater than Qu. For example, Q.1 has at least 10 of the data values less than Q.1 and has at least 90 of the data val- ues greater than Q.1. Q.5 has at least 50 of the data values less than Q.5 and has at least 50 of the data values greater than Q.5. Finally, Q.75 has at least 75 of the data values less than Q.75 and has at least 25 of the data values greater than Q.25. This motivates the following definition for the sample quantiles: DEFINITION 4.14 Let y 1 , y 2 , . . . , y n be the ordered values from a data set. The [i ⫺ .5 兾n]th sample quantile, Qi ⫺ .5 兾n is y i . That is, y 1 ⫽ Q.5 兾n is the [.5兾n]th sample quantile, y 2 ⫽ Q1.5 兾n is the [1.5兾n]th sample quantile, . . . , and lastly, y n ⫽ Qn ⫺ .5 兾n] is the [n ⫺ .5兾n]th sample quantile. Suppose we had a sample of n ⫽ 20 observations: y 1 , y 2 , . . . , y 20 . Then, y 1 ⫽ Q.5 兾20 ⫽ Q.025 is the .025th sample quantile, y 2 ⫽ Q1.5 兾20 ⫽ Q.075 is the .075th sample quantile, y 3 ⫽ Q2.5 兾20 ⫽ Q.125 is the .125th sample quantile, . . . , and y 20 ⫽ Q19.5 兾20 ⫽ Q.975 is the .975th sample quantile. In order to evaluate whether a population distribution is normal, a random sample of n observations is obtained, the sample quantiles are computed, and these n quantiles are compared to the corresponding quantiles computed using the con- jectured population distribution. If the conjectured distribution is the normal distribution, then we would use the normal tables to obtain the quantiles z i⫺.5 兾n for i ⫽ 1, 2, . . . , n. The normal quantiles are obtained from the standard normal tables, Table 1, for the n values .5 兾n, 1.5兾n, . . . , n ⫺ .5兾n. For example, if we had n ⫽ 20 data values, then we would obtain the normal quantiles for .5 兾20 ⫽ .025, 1.5 兾20 ⫽ .075, 2.5兾20 ⫽ .125, . . . , 20 ⫺ .5兾20 ⫽ .975. From Table 1, we find that these quantiles are given by z .025 ⫽ ⫺ 1.960, z .075 ⫽ ⫺ 1.440, z .125 ⫽ ⫺ 1.150, . . . , z .975 ⫽ 1.960. The normal quantile plot is obtained by plotting the n pairs of points If the population from which the sample of n values was randomly selected has a normal distribution, then the plotted points should fall close to a straight line. The following example will illustrate these ideas. EXAMPLE 4.27 It is generally assumed that cholesterol readings in large populations have a normal distribution. In order to evaluate this conjecture, the cholesterol readings of n ⫽ 20 patients were obtained. These are given in Table 4.12, along with the corresponding normal quantile values. It is important to note that the cholesterol readings are given in an ordered fashion from smallest to largest. The smallest cholesterol read- ing is matched with the smallest normal quantile, the second-smallest cholesterol reading with the second-smallest quantile, and so on. Obtain the normal quantile plot for the cholesterol data and assess whether the data were selected from a pop- ulation having a normal distribution. z .5 兾n , y 1 ; z 1.5 兾n , y 2 ; z 2.5 兾n , y 3 ; . . . ; z n⫺ .5 兾n , y n .