THE NORMAL DENSITY FUNCTION

Density Required proportion Variable Fixed interval

2.6 THE NORMAL DENSITY FUNCTION

2.33 normal density function. Minitab or similar program recommended 72 CHAPTER 2 DESCRIBING PATTERNS IN DATA CitGrad.dat a. Construct a stem-and-leaf diagram of “ graduate degree.” b. Which cities have an unusually large percentage of workers holding gradu- ate degrees? Refer to Exercise 2.32. Consider the first ten cities in the first column of the list. Construct back-to-back stem-and-leaf diagrams of these ten cities and the remaining cities. Does there appear to be a difference between the two groups of cities with respect to the percentage of workers with graduate degrees? Density histograms provide good visual representations of data distributions. But the appearance of a density histogram can change as the number and width of the class intervals change. The outline of a density histogram, by construction, is not very smooth. Moreover, density histograms are somewhat awkward to use. If we want to find the proportion relative frequency of the data set falling in some fixed interval, we must sum the areas of the vertical bars over the chosen interval. If one or both of the endpoints of our fixed interval fall within class intervals, as in the diagram shown here, then it is necessary to interpolate to find the required propor- tion. When density histograms are symmetric about a single peak and look like the outline of a bell, they can often be closely approximated by a smooth curve known as the You may already be familiar with the bell- shaped normal density curve because it often serves as a model for the distribution of examination scores. Areas under a normal density curve can then approximate density histogram relative frequencies. The advantage of using a single mathematical function, like the normal density function, to represent a distribution of data is that it is always available in a com- pact form. Histograms of data from a variety of sources may display very similar features. If so, they may all be represented by the same mathematical function. This function can be used to make statements about the size of future measurements and to develop procedures that allow us to generalize from a sample to a popula- tion. Large σ Small σ µ σ σ Figure 2.13 normal distribution 4 4 Two Normal Density Functions with Different Standard Deviations The normal density function with mean and standard deviation will be denoted by N , . N , X x 73

2.6 THE NORMAL DENSITY FUNCTION

The normal density function is usually associated with Pierre Laplace and Carl Gauss, who, working somewhat independently in the 18th and 19th centuries, fig- ured prominently in its development. Gauss, motivated by errors in astronomical measurements, derived the function mathematically as a distribution of errors. He called his error distribution the “normal law of errors.” Subsequent scientists and data collectors in a wide variety of fields found that their histograms exhibited the common feature of first gradually rising in height to a maximum and then decreasing in a symmetric manner. Although there are other functions exhibit- ing this property, the normal density seemed to “fit” the data in so many real- life situations that many of its proponents believed that if data did not conform to the normal curve, the data collection process must be suspect. In this con- text, Gauss’s function became known as the and the name held. There are many normal distributions, but all normal curves have the same overall shape. A particular normal distribution is determined once its mean mu and standard deviation sigma are specified. The mean is the balancing point of the normal curve; because a normal distribution is symmetric, it is also the median. Changing the value of the mean changes the location of the normal curve on the horizontal axis. The standard deviation measures spread. As the standard deviation decreases, the normal curve becomes more tightly concentrated about its center mean . Two normal density functions with the same mean but different standard deviations are shown in Figure 2.13. The two points along the horizontal axis at which the normal curve changes from curving more steeply downward to curving less steeply downward beginning to flatten out are located a distance on each side of the mean . Consequently, it is possible to guess the values of and from a graph of the normal density function. So if we want to refer to a normal distribution with 4 and 3, we write 4 3 . Furthermore, we shall use uppercase letters, such as , to represent the variable whose measurements have a theoretical distribution like the normal distribution, and we shall use lowercase letters, for example, , to represent a particular measurement. With a little mathematics, it is possible to show that the total area under any normal curve is 1. In addition, we have the following rule: 2 2 m s s m m s m s m s m s 3 2 1 –1 –2 –3 99.7 of area 95 of area 68 of area v v v v v v Figure 2.14 68 – 95 – 99.7 rule 4 The Normal Distri- bution 68 – 95 – 99.7 Rule s x s x s x x s 74 The Normal Distribution 68 – 95 – 99.7 Rule CHAPTER 2 DESCRIBING PATTERNS IN DATA For any normal density function, 68 of the area under the curve is contained within 1 standard deviation of the mean. 95 of the area under the curve is contained within 2 standard deviations of the mean. 99.7 of the area under the curve is contained within 3 standard devia- tions of the mean. The allows us to think about the nature of normal distributions without having to make repeated mathematical calculations. Note the similarity between this rule and the empirical rule. Recall that the empirical rule talks about the proportion of a data set that falls within 1, 2, and 3 sample standard deviations of the sample mean. Specifically, the empirical rule states that At least 68 of the data fall within of . At least 95 of the data fall within 2 of . At least 99.7 of the data fall within 3 of . In fact, the empirical rule comes from assuming that a data frequency distribution can be approximately represented by a normal distribution with a mean equal to the sample mean and a standard deviation equal to the sample standard deviation . For small data sets, it is difficult to determine whether a normal distribution approximation is warranted because there is little information in few observations. With large data sets, we can get a better picture of the shape of their distributions. If a normal density curve provides an adequate model for the data distribution, the empirical rule will provide an accurate summary of the variation. Figure 2.14 illustrates the 68 – 95 – 99.7 rule for a normal distribution with 0 and the measurements expressed in units of ; so, for example, 2 in the figure stands for 2 m s m s p linear transformation standardized variable. standard normal distribution. 4 4 ` 4 ` 4 ` ` 4 ` ` 4 ` 4 N , x y y a bx y x y a bx X Y a bX X Y a b b Y X N , Y a bX N a b , b X Z X X N , Z N , 75 A Linear Transformation of a Normal Variable p THE STANDARD NORMAL DISTRIBUTION

2.6 THE NORMAL DENSITY FUNCTION

In this section, we focus our attention on a normal variable and patterns of data that are well approximated by the bell-shaped normal curve. We will encounter other data models in this book, and we will often find it convenient to work with standardized variables in those contexts. u u u u 1 2 1 2 2 and so forth. In particular, for 1, the plot in the figure is the 0 1 density function. Normal distributions serve as good data models for scores on psychological tests or subject-matter examinations taken by a broad spectrum of individuals. Measurements from homogeneous biological populations that yield data on, say, bone lengths or corn production, tend to be normally distributed. Data from stable processes collected over time, such as stock rates of return, are often well represented by a normal distribution. Finally, repeated careful measurements of the same quantity, like the moisture content in portions of ground cheese from Chapter 1, are nearly normally distributed. If two variables and are related by the expression then is said to be a of . The name linear transformation comes from the fact that a plot of is a straight line. Let be a variable whose values, theoretically, are normally distributed with mean and standard deviation , and let be a linear transformation of . Then the values of will have mean and standard deviation . In addition, will have a normal distribution. If is distributed as , then is distributed as . One linear combination is particularly convenient for normal data. Define the variable 1 This variable is called a By construction, a standardized variable always has mean 0 and standard deviation 1. Consequently, from our previous result, if has a distribution, has a 0 1 distribution. A normal distribution with mean 0 and standard deviation 1 is called the 2 2 2 s s m s m s m s m s m m s s s m s .8962 1.26 z z . . . Figure 2.15 standardized observations, 4 4 ` 4 4 ` 4 4 4 4 4 Area Under the Curve to the Left of N , z . 4 How to Read from Table 3, Appendix B for 1 26 1 2 06 Z X X Z X Z X Z x s n x x z i , , , n s left z N , , z z . right z . . . N , 76 ? ? ? ? ? ? AREAS UNDER A NORMAL CURVE .00 .06 .0 .. . 1.2 - - - - - - - - - - - - - - - - - - - - .8962 .. . 0 1 1 26 CHAPTER 2 DESCRIBING PATTERNS IN DATA We can turn the expression around and write the variable in terms of the standardized variable . With a little algebra, we have If and are the mean and standard deviation of the normal distribution, then this equation implies that any value of the normal variable can be written as the mean plus a multiple of the standard deviation. In practice, data are often standardized with the sample mean, , playing the role of and the sample standard deviation, , playing the role of . If there are observations, then 1 2 are the and these values have sample mean 0 and sample standard deviation 1 see Exercise 2.46 . If the original data are approximately normally distributed, the standardized observations are approximately normally distributed. Areas under the standard normal curve have been tabulated. Table 3 in Appendix B is a table of the area under the standard normal curve to the of a particular value of . Thus, the table gives the area under the 0 1 curve over the interval ]. Figure 2.15 demonstrates how to read and interpret the standard normal table for 1 26. Since the total area under any normal curve is 1, the area under the standard normal curve to the of 1 26 is 1 8962 1038. Moreover, as we have indicated in Figure 2.14, about .68 actually, .6827 of the area under the curve is between 1 and 1, and about .95 actually, .9545 of the area is between 2 and 2. Of course, since the mean is also the median, .5 of the area under the 0 1 curve is to the left of 0 and .5 of the area is to the right. Using Table 3, the symmetry of the normal density function, and simple arithmetic operations, we can determine any area under the standard normal curve. 2 2 2` 2 2 2 m s m s m s m s z . . . i i 2.01 z 1.45 z –.53 z 4 4 4 4 4 4 4 4 4 4 4 4 Solution and Discussion. Solution and Discussion. z . z . right z . z . z . . . z . z . left z . z z . 77 EXAMPLE 2.14 Using the Normal Table

2.6 THE NORMAL DENSITY FUNCTION

Find the area under the standard normal curve for the following cases: 1. Area to the left of 53 This area can be read directly from Table 3. The table entry corresponding to 53 is .2981. Because of the symmetry of the standard normal curve, .2981 is also the area to the of 53. 2. Area to the right of 1 45 The desired area is 1 Area to the left of 1 45 1 9265 0735 since .9265 is the table entry corresponding to 1 45. Equiva- lently, the area to the right of 1 45 is equal, from symmetry, to the area to the of 1 45. The latter area can be read directly from Table 3 and is, as expected, .0735. 3. Area between 0 and 2 01 2 2 2 2 2 –1.195 .83 z p 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Solution and Discussion. Solution and Discussion. z z . z . z . . . z . z . z . z . . . z . . . . . . Hint: . . . z z x N , , x z x z . N , , N , , . . . . . . . 78 p 2 2 2 2 2 2 CHAPTER 2 DESCRIBING PATTERNS IN DATA Since 1 195 is halfway between 1 200 and 1 190, the table entry corresponding to 1 195 is halfway between the table entries for 1 200 and 1 190, respectively. The area between 0 and 2 01 is given by the area to the left of 2 01 minus the area to the left of 0. We know the area to the left of 0 is .5 verify with the table . Table 3 indicates that the area to the left of 2.01 is .9778 and, consequently, the area between 0 and 2.01 is 9778 5000 4778. 4. Area between 1 195 and 830 Again, the area between 1 195 and 830 is the area to the left of .830 minus the area to the left of 1 195. From Table 3, the area to the left of .830 is .7967. To determine the area to the left of 1 195, we must interpolate since the values in the table are given to only two decimal places. Interpolating between the table entries for 1 200 and 1 190, we find the area to the left of 1 195 to be .1161. The required area is then 7967 1161 6806. When evaluating areas under a normal curve, it is a good idea to sketch the curve and then darken the required area. This will often immediately indicate the arithmetic required if any to determine the area from Table 3 entries. Notice that virtually all the area under the standard normal curve is contained between 3 5 and 3 5. Areas to the left of 3 5 and to the right of 3.5 are extremely small. Table 3 gives .0002 for each area. Consequently, for values more extreme than these, we typically ignore the areas to the left of the negative extreme values and to the right of the positive extreme values. A table of standard normal curve areas and the relationship can be used to find the area under any normal density curve. To illustrate, suppose a normal density function with mean 10 and standard deviation 2 is a good representation of a particular density histogram, and we are interested in the proportion of the data between, say, 6 and 11. This proportion is approximated by the area under the 10 2 curve over the interval [ 6 11 ]. The values of a normal variable can be converted to the values of a standard normal variable. In this case, 10 and 2, so a value of 6 corresponds to a value of 2. Similarly, a value of 11 converts to a standard normal value of 5. The area under the 10 2 function over the interval [ 6 11 ] is exactly the same as the area under the standard normal curve, 0 1 , over the interval [ 2 5 ]. The latter 6 10 2 11 10 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 m s m s –2 –1 1 2 N 0,1 z = x – 10 2 10 Area Area 6 8 12 14 x N 10, 2 Figure 2.16 4 4 4 4 4 4 4 4 4 4 4 4 Solution and Discussion. Area Under the Curve over the Interval [ 6, 11 ] N , N , c d c d N , c, d c d N , , c, d c, d X N , c d x x z . z . z . z . . . . X N , . 79 EXAMPLE 2.15 Determining Areas Under Normal Curves 10 2

2.6 THE NORMAL DENSITY FUNCTION

F G area can be determined with the help of the standard normal table. This situation is illustrated in Figure 2.16. In general, we have the following. Suppose we are interested in the area under the distribution curve between two numbers and with . Then Area under over the interval [ ] Area under 0 1 over the interval Since single points have zero width, the area under a normal curve over the interval [ ] is the same as the area over the interval — the interval without the endpoints. That is, the area under the curve does not change if we include or exclude one or both of the endpoints of the target interval. 1. Suppose is approximately distributed as 100 5 . Determine the area under this normal density between 97 and 110. When 97 and 110, the area under the curve between 97 and 110 is the same as the area under the standard normal density between 97 100 110 100 60 and 2 00 5 5 Using Table 3, the latter area is the area to the left of 2 00 minus the area to the left of 60, or 9772 2743 7029. 2. Suppose is approximately distributed as 12 3 . Determine the area to the right of 8 5. , 2 2 2 2 2 2 2 2 2 m s m s m m s s 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ` 4 4 4 4 Solution and Discussion. Solution and Discussion. x . x . . x . z . z . . . z . z . X N , x x N , x x . x z . z z . . x . x x . . x . N , N , x z x . . 80 CHAPTER 2 DESCRIBING PATTERNS IN DATA 1 2 Since the total area under any normal curve is 1, the area to the right of 8 5 is equal to 1 Area to the left of 8 5 8 5 12 Converting 8 5 to 1 167, we calculate the required area: 3 1 Area to the left of 1 167 1 8784 1216 where .8784 is obtained by interpolating between the entries corresponding to 1 16 and 1 17 in Table 3. 3. Suppose is approximately distributed as 44 6 , and we know that a proportion .90 of the area under this curve is to the left of a value . That is, is the 90th percentile of the 44 6 distribution. Determine . We are given Area to the left of 90 This implies that 44 Area to the left of 90 6 From Table 3, we can determine the value of that has .90 or approximately .90 of the area under the standard normal density to the left of it. Using the table, we find Area to the left of 1 28 8997 which is nearly .90. Setting 44 1 28 6 and solving for , we get 44 1 28 6 51 68. Consequently, 51 68 is the 90th percentile of the 44 6 distribution. On occasion, we might want to judge whether the observed value of a normal variable is, in some sense, unexpected. If area is determined from a curve, an unexpected or unusual value is one that is too far from the mean . Equivalently, the absolute value of is too large. A value for the standardized variable less than 2 or greater than 2 could be considered large because each of the tail areas is .0228 and the combined area 2 0228 0456 is small. We will continue to elaborate on this idea of “unusual” or “unexpected” as we develop the central statistical procedures. 2 2 2 2 2 2 2 2 2 2 2 2 2 m s m m s 2.34 2.35 2.36 4 4 4 4 4 4 4 4 4 4 An Outlier? z . z . z . z . z . z . z . z . z . z . 81

2.6 THE NORMAL DENSITY FUNCTION

John Chase Not all data patterns can be reasonably approximated by a normal curve. Therefore, if a normal distribution is tentatively assumed to be a plausible data model in a particular case, this assumption must be checked once the sample observations are in hand. We consider this in Chapter 7, where we discuss the normal distribution in greater detail. Find the area under the standard normal curve to the left of a. 1 16 b. 24 c. 57 d. 2 1 Find the area under the standard normal curve to the left of a. 77 b. 1 68 c. 21 d. 1 39 Find the area under the standard normal curve to the right of a. 84 b. 2 25 EXERCISES 2 2 2 2 .2643 z b z –1 z .756 e z –z .6528 .20 c z .35 1.82 f a d z .59 2.37 2.38 2.39 2.40 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 z . z . z . z . z . z . z z . z . z . z . z . z . z . z . z . z . z . z . z z . z . z 82 CHAPTER 2 DESCRIBING PATTERNS IN DATA c. 1 d. 1 595 interpolate Find the area under the standard normal curve to the right of a. 21 b. 2 03 c. 67 d. 1 115 interpolate Find the area under the standard normal curve over the interval a. 0 to 37 b. 42 to 1 06 c. 1 62 to 09 d. 25 to 1 966 interpolate Find the area under the standard normal curve over the interval a. 2 07 to 04 b. 1 12 to 35 c. 77 to d. 69 to 1 893 interpolate Identify the values in the following diagrams of the standard normal distribu- tions interpolate as needed . 2 2 2 2 2 2 2 2 2 2 a z b z .125 .20 c z –z .668 2.0 d z .888 2.41

2.42 2.43