NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS

v v v

2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS

. . . a? a b. summary every x , x , , x n x x 57 MEASURES OF LOCATION

2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS

incomes. Compare the two density histograms. You may want to plot back- to-back density histograms. Do your conclusions about the distribution of white household incomes relative to the distribution of nonwhite household incomes change from part Explain. c. Refer to your results in parts and When comparing distributions over class intervals of unequal lengths, is it better to use relative frequency histograms or density histograms? Discuss. As we have seen, data sets can be visually compared using density histograms. More succinct summaries are provided by single numbers that represent particular features of data sets. For example, we may be interested in the center of a data set, or the smallest value, or the typical distance from the center, and so forth. These single-number summaries may be of interest in their own right, or they may be used in conjunction with density histograms to allow more objective comparisons. Why are single-number summaries important? They provide immediate impres- sions of order of magnitude, and they allow simple comparisons. A current U.S. unemployment rate of 6.4 provides us with an immediate indication of the overall jobless situation — particularly when this number is compared with last month’s figure of 6.7. We know that some areas of the country will have unemployment rates higher than 6.4 and some areas will have lower rates, but it is difficult to convey to the general public the nature of unemployment by publishing the entire collection of unemployment rates for, say, all the U.S. standard metropolitan areas. We need a measure of unemployment. At one of the Ford Motor Company plants, it takes a total of 20.4 hours to build a new car. Do you believe that vehicle takes exactly 20.4 hours to build? Of course not. Sometimes it takes more than 20.4 hours, sometimes it takes less. The number 20.4 is a “typical” figure. It is a useful way to summarize one aspect of productivity. It can be compared with the 19.5 hours it takes to build a vehicle at one of the Toyota plants in the United States. Initially, we will concentrate on the following numerical measures of magnitude or location: Mean Median Percentiles Later, we will consider numerical summaries of other features of data sets. To clarify the ideas and to present effectively the associated calculations, it is convenient to use the symbols to represent the measurements in the data set. We introduced this notation in Chapter 1. Now the ’s may be measurements of quantitative variables or numbers assigned to observed categories of qualitative variables. The subscripted notation allows a general discussion since we are not then anchored to a specific set of numbers. 1 2 n i 17 18 19 20 21 22 Tothours Figure 2.9 Sample mean: 4 Solution and Discussion. Total Number of Hours to Build a Vehicle and the Location of the Sample Mean 4 Productivity in Auto Manufacturing x n x x n San Diego Union-Tribune, 58 OURCE EXAMPLE 2.8 Interpreting the Sample Mean Number of Workers Total Number of Hours Plant per Vehicle to Build a Vehicle Nissan truck Smyrna, Tenn. 2.20 17.6 Nissan car Smyrna, Tenn. 2.32 18.6 Toyota car Georgetown, Ky. 2.44 19.5 Ford car Kansas City, Mo. 2.48 19.8 Ford car Atlanta, Ga. 2.49 19.9 Nummi truck Fremont, Calif. 2.52 20.2 Ford car Chicago, Ill. 2.55 20.4 Ford truck Norfolk, Va. 2.70 21.6 Ford truck Louisville, Ky. 2.71 21.7 Chrysler car Belvidere, Ill. 2.72 21.8 CHAPTER 2 DESCRIBING PATTERNS IN DATA S : June 24, 1994. The two most commonly used measures of center are the mean and the median. The sample mean was introduced in Chapter 1. Recall that the sample mean is the sum of the sample measurements divided by the sample size and is denoted by . For measurements 1 To understand how the sample mean indicates the center or middle, we present the following example. Two measures of productivity for the 10 most productive vehicle assembly operations in North America, according to a 1994 Harbour Report, are listed in Table 2.4. Construct a dot diagram for the total hours needed to build a vehicle, and indicate the sample mean on the diagram. The dot diagram, with the value of the sample mean, 20.11, indicated by a fulcrum, is shown in Figure 2.9. If we imagine the horizontal axis of the dot diagram as a weightless bar and the dots representing the data as balls of equal size and weight, the mean is the point at which the bar balances. The sample mean is affected by extreme observations. 1 TABLE 2.4 n i i resistant robust. trimmed mean. Sample median: ` n M n M 59

2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS

Imagine, for example, that the smallest total hours figure, 17.6, is decreased moved to the left in the figure while the other numbers remain the same. To maintain balance, the mean fulcrum must decrease move to the left . If we change 17.6 to 13.3, for example, the sample mean becomes 19.68. Is the sample mean a good measure of center? It is, provided you interpret the center as the balancing point. For large samples, the sample mean is ordinarily not appreciably affected by a few extreme measurements. Summary measures that are not affected by extreme values are said to be or One way to make the sample mean robust is not to include extreme values in its calculation. Suppose we order the observations from smallest to largest and then ignore, say, 5 of the measurements at each end. If we calculate the sample mean from the remaining observations, the result is called the 5 Ignoring 10 of the observations at each end gives the 10 trimmed mean and so forth. A trimmed mean is the balancing point or center of gravity of the measurements from which it is calculated. In this sense, its interpretation is the same as that of the sample mean. Computer programs will usually compute a trimmed mean along with the sample mean. Five percent is a typical amount of trimming. To obtain an even more robust summary statistic, arrange the data from smallest to largest. The sample median is the value that divides the data set in half; that is, 50 of the measurements are less than the median, and 50 are larger than the median. The value that divides the ordered data in half If the number of measurements is odd, the median is the middle measurement. If the number of measurements is even, the median is defined to be the average of the two middle measurements, or the value halfway between them. To calculate the sample median: 1. Arrange the observations in numerical order, from smallest to largest. 2. If the number of observations is odd, the sample median, , is the middle observation, determined by counting 1 2 observations up from the smallest value in the ordered set. 3. If the number of observations is even, the sample median, , is the average of the two middle observations in the ordered set. Notice that the calculation of the median is not influenced by the values of the measurements at the ends of the ordered data set. Consequently, the sample median is a robust measure of location. Moreover, the median corresponds to our intuitive notion of middle: the value that divides the ordered observations exactly in half. 4 4 4 4 ` 4 4 4 4 4 4 Solution and Discussion. Solution and Discussion. ` Death Claim Amounts for Group Life Insurance Plan , , , , , M , x , n M . n . n M , total x M 60 EXAMPLE 2.9 Calculating the Sample Median for an Even Number of Observations EXAMPLE 2.10 Calculating the Sample Median for an Odd Number of Observations 1750 2800 3500 4025 4025 4375 4375 4375 5775 5775 6125 6125 6125 6475 6825 6825 6825 7350 7350 7350 7350 8050 9450 13125 13125 26250 26250 54600 64750 89600 95550 CHAPTER 2 DESCRIBING PATTERNS IN DATA The sample mean and sample median determined from the same data set will, in general, be different. This should not be surprising since they correspond to different notions of center. They measure the overall location of a data set in different ways. The sample mean is the most popular measure of location but, in cases where the mean and median are considerably different, both should be reported. A collection of incomes, for example, 40 000 50 000 58 000 60 000 136 000 is best summarized by the sample median, 58 000, since it will not be influenced by exceptionally large incomes. Large incomes tend to inflate the sample mean, in this case 68 800, and make it less useful as a measure of typical income. The total number of hours needed to build a vehicle are arranged from smallest to largest in Table 2.4 of Example 2.8. Calculate the sample median. There are 10 observations, so the sample median is the average of the two middle values, 19.9 and 20.2; that is, 20 05. The median is in position 1 2 5 5 or halfway between the 5th and 6th largest observations. A University Association Group Life Insurance Plan paid 31 death claims during a recent policy year. The claim amounts are given from smallest to largest in Table 2.5. Calculate the median. Since 31 is odd, the median is the middle observation given, in this case, by counting 16 observations from the smallest number. Thus, 6 825. The median indicates a central value. However, if the payments for claims is important, the total is Number of claims Mean claim 31 , whereas 31 is not related to total payments. 11 2 1 2 TABLE 2.5 3 3 3 n Sample 100 th percentile sample quartiles. 4 4 4 4 ` 4 4 4 p p p p . p . p . We adopt the convention of taking an observed value for the sample percentile except when two adjacent values satisfy the definition, in which case, their average is taken as the percentile. p n np. np np k k k Q Q Q Q Q 61 Sample Quartiles

2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS

Percentiles are numbers that divide the data into percentages. The sample median is the 50th percentile, because the sample median divides an ordered data set in half. : The value in an ordered data set such that at least 100 of the data set is at or less than this value and at least 100 1 of the data set is at or above this value Setting 25, 5, and 75 generates the 25th, 50th, and 75th percentiles, respectively. These numbers, taken as a group, divide the data set into quarters and, not surprisingly, are known as the This procedure is consistent with the way we calculate the sample median. To calculate the sample 100 th percentile, 1. Arrange the observations in numerical order, from smallest to largest. 2. Determine the product Sample size Proportion 3. If is not an integer, round it up to the next integer and find the observation in this position. This value is the percentile. If is an integer, say, , calculate the average of the th and 1 st ordered values. This average is the percentile. Some statistical software packages use slight variations of our definition of per- centiles. For large samples, they all tend to give essentially the same numbers. The sample percentiles used most frequently are the median, and the first and third quartiles. The sample quartiles are summarized here in terms of the percentiles they represent. From these representations, you can see that the first and third quartiles are themselves medians. The first quartile, , is the median of the observations less than the sample median, and the third quartile, , is the median of the observations greater than the sample median. First quartile 25th percentile Second quartile or median 50th percentile Third quartile 75th percentile 1 3 1 2 3 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Solution and Discussion. n . . . . . . . . . . M . p . np . . Q . p . np . . Q . Q . Q M . Q . percentiles are robust measures of location. 62 EXAMPLE 2.11 Calculating Sample Quartiles CHAPTER 2 DESCRIBING PATTERNS IN DATA To illustrate the calculation of sample quartiles, we turn once more to the productivity data listed in Table 2.4 see Example 2.8 . From Table 2.4, the 10 total number of hours needed to build a vehicle are, in order, 17 6 18 6 19 5 19 8 19 9 20 2 20 4 21 6 21 7 21 8 The sample median or 50th percentile or second quartile was calculated in Example 2.9. Recall that 20 05. Calculate the first and third quartiles. To calculate the first quartile, set 25. Then 10 25 2 5. Since 2.5 is not an integer, round it to the next integer, 3, and take the observation in the 3rd position as the required quartile. Thus, 19 5. Three of the 10 observations at least 25 are at or below 19.5, and 8 observations at least 75 are at or above 19.5, confirming that it is the first quartile. Similarly, to get the third quartile, set 75 so that 10 75 7 5. Round 7.5 to the next integer, 8, and take the observation in the 8th position as the required quartile. Consequently, 21 6. Eight of the 10 observations at least 75 are at or below 21.6, and 3 observations at least 25 are at or above it. The three quartiles, 19 6, 20 05, and 21 6, divide the data set into quarters. If, in Example 2.11, the last number in the data set were 25.3 instead of 21.8, the quartiles would not change. Similarly, if the two smallest values were, for example, 16.9 and 18.8 instead of 17.6 and 18.6, respectively, the quartiles would not change. Percentiles in general, and quartiles in particular, are not heavily influenced by the particular values of the observations. Extreme values have no influence on percentiles located toward the center of the distribution. This is what we mean when we say that We have discussed measures of location in terms of the original set of observations. If the data are displayed as dot diagrams, stem-and-leaf diagrams, or density histograms, measures of location can be indicated on the diagrams. We have already seen, for example, with the 800-meter data in Figure 2.3, that the statistical software identifies the median class in its version of the stem-and-leaf diagram and prints the cumulative frequencies from each end of the data distribution. This allows easy identification of the sample quartiles. The sample mean always retains its interpretation as the balancing point. Therefore, its location on the variable axis of a dot diagram and, to a good approximation, a density histogram, is the point at which a fulcrum would just balance the configuration of points or pattern of vertical bars. Because the sample mean is not a robust measure of location, it will typically be larger than the median for a histogram with a long right-hand tail, and less than the median for a histogram with a long left-hand tail. The two measures of location will almost coincide for nearly symmetric histograms, because the balancing point and the value dividing the distribution in half are the same see Exercise 2.22 . 1 3 1 2 3 Sample variance: Sample standard deviation: degrees of freedom Sample range: 4 4 ` 4 4 4 4 4 4 4 4 4 ` ` 4 4 4 s s s x x n s s n x x n n x x x x x n 63 MEASURES OF VARIATION

2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS

We talked about measuring variability in Chapter 1, where we introduced the sample variance and sample standard deviation. Here, we will not only review these measures, we will also introduce the sample range and sample interquartile range as additional measures of variability. The sample variance, , and sample standard deviation, , can be useful single- number summaries of variability. This is particularly true for relatively large, mound- shaped data sets. 1 1 The sample variance is essentially an average squared distance from the mean; consequently, its value can be heavily influenced by observations far from the middle. Since the standard deviation is closely connected to the variance, its value can also be heavily influenced by observations far from the middle. The sample variance and sample standard deviation are not robust measures of variability. The number 1 in the definition of the sample variance or sample standard deviation is called the because it represents the number of deviations from the mean that are “free to vary.” Let’s see what this means. In Chapter 1, we showed that the sum of the deviations is always 0. Consequently, the final deviation can be determined once we know any 1 of the other deviations. For example, given the 4 numbers 2 3 4 1 you may verify that 2, and the first three deviations from the mean are 2 2 0, 3 2 1, and 4 2 2. Since the sum of all deviations must be zero, the last deviation must be 1 2 3. Only 3 1 of the deviations are free to vary. The sample range is simply the difference between the largest and smallest observations. It is the length of the interval that just contains all the data. Largest observation Smallest observation The range is very easy to calculate and interpret. However, by definition, the range is extremely sensitive to the existence of even a single very large or very small value in the data set. It is also not a robust measure of variability. The sample interquartile range is a measure of variability based on the first and third quartiles. 2 2 2 1 2 1 2 3 4 2 2 2 2 2 2 2 2 2 2 2 2 2 n i i i Sample interquartile range: empirical rule. 4 4 4 Cost per KW Capacity in Place for U.S. Public Utilities Q Q x s x s x s x s x s n x s x s 64 Empirical Rule EXAMPLE 2.12 Summarizing Variation with the Empirical Rule, the Range, and the Interquartile Range 151 104 96 168 168 174 136 178 175 202 148 164 197 192 111 150 199 245 113 204 252 173 CHAPTER 2 DESCRIBING PATTERNS IN DATA IR Third quartile First quartile The interquartile range is the length of the interval just containing the middle 50 of the observations. This interval is not centered on the median unless the data distribution is symmetric. Because the interquartile range depends only on the first and third quartile numbers, its value is not affected by a few extreme measurements at each end of the distribution. The sample interquartile range is a robust measure of spread. It is often used to measure spread when the median is used to measure middle. The center and the extent of spread in a data set are key pattern features. For nearly symmetric and mound-shaped data distributions, the mean and standard deviation are worthwhile measures of location and spread. Their usefulness is enhanced by the The empirical rule provides intervals that contain certain proportions of the data when we know only the values of and . This rule works best with large data sets for example, more than 30 numbers that tend to have a mound of values around the mean and fewer values far from the mean in each direction. It gives the approximate proportion of values within 1, 2, and 3 standard deviations of the mean. Approximately 68 of the data lie within 95 of the data lie within 2 99.7 of the data lie within 3 With only the two values and , the empirical rule allows us to create an expanding set of intervals that contain increasing proportions of the data set. When data distributions are skewed to the right or to the left, no single measure of spread is entirely satisfactory because the nature of the variability on one side of the center is different from that on the other side. If a single measure of spread is required, it may be best to use a number that is robust to extreme values. Table 2.6 gives the cost per kilowatt KW capacity in place for a particular year for 22 U.S. public utility companies. Obtain the range, the interquartile range, and the intervals and 2 . Compare the proportions of the observations in the latter intervals with the proportions suggested by the empirical rule. 3 1 TABLE 2.6 2 2 6 6 6 6 6 120 180 240 CostprKW Min Max M Q 1 Q 3 N MEAN MEDIAN TRMEAN STDEV CostprKW 22 168.18 170.50 167.60 41.19 MIN MAX Q1 Q3 CostprKW 96.00 252.00 145.00 197.50 Figure 2.10 five-number summary boxplot. 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Solution and Discussion. Boxplot of Costs per KW Capacity in Place for 22 U.S. Public Util- ities Q Q M . Q . Q Q . . x s . . . , . x s . . . , . Q Q Q Q Q M Q 65 BOXPLOTS

2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS

The Minitab printout follows: From the printout, Min 96, Max 252, 145, 170 5, and 197 5. We can easily calculate Range Max Min 252 96 156 and IR 197 5 145 52 5 All of the observations are within 156 units of one another. The middle 50 of the costs per KW capacity in place are contained in an interval of length 52.5. The empirical rule suggests that about 68 of the costs should fall in the interval 168 18 41 19 or 126 99 209 37 Similarly, about 95 of the observations should fall in the interval 2 168 18 82 38 or 85 80 250 56 In fact, 100 73 of the costs are included in the first interval, and 100 95 of the costs are contained in the second interval. For this relatively small data set, the empirical rule gives a fairly accurate picture of the distribution of cost per KW capacity in place. Together with the smallest and largest observations, the quartiles , , provide a fairly comprehensive five-number summary of a distribution of measurements. Let Min and Max represent the smallest and largest observations in the data set, respectively. The Min Max is represented pictorially as a Figure 2.10 is a horizontal boxplot of the cost per KW capacity in place data from Example 2.12. 1 2 3 3 1 16 21 22 22 1 2 3 1 2 3 2 2 2 2 6 6 6 6 All departments Natural sciences Engineering Social sciences Humanities and arts Education 200 300 400 500 600 700 800 GRE verbal scores p Figure 2.11 Boxplots of Departmental Means for GRE Verbal Scores Validity of the GRE: 1988 – 89 Summary Report. 66 p OURCE MODIFIED BOXPLOTS CHAPTER 2 DESCRIBING PATTERNS IN DATA S : Schneider, L. M., and Briel, J. B. Princeton, N.J.: Educational Testing Service, Sept. 1990. There are five vertical lines in the boxplot: the lines forming the ends of the box rectangle , the line within the box, and the small vertical lines at the ends of the horizontal lines whiskers that extend in opposite directions from the box. These vertical lines correspond to the five summary numbers. Reading from the scale beneath the figure, we see that the vertical line within the box identifies the median. The ends of the box correspond to the 1st and 3rd quartiles, and the lines at the ends of the whiskers denote the minimum Min and maximum Max values. The length of the box is the interquartile range, and the distance between the Min and Max is the overall range. The median line is nearly in the middle of the box and the whiskers are nearly of the same length, so this data distribution is very nearly symmetric. Boxplots are not as informative as stem-and-leaf plots or density histograms be- cause they do not show the patterns of the data within the quartile boundaries. They are, however, useful for assessing symmetry or asymmetry and for comparing distributions. Figure 2.11 shows side-by-side boxplots of average Graduate Record Ex- amination GRE verbal scores for students admitted to graduate study in departments classified according to the general categories displayed. The departmental averages are based on data for students who took the GRE over a five-year period. The center, spread, and range of the distributions of average scores are immediately apparent. We see, for example, that the average GRE verbal scores for Engineering departments are tightly concentrated about a median average score of about 540. The highest median of average verbal score occurs for students admitted to departments in the Humanities and Arts. The interquartile range is about the same for all the categories with the exception of Engineering, where it is smaller. Finally, although there are some differences in overall spread as measured by the range, the median scores do not vary a great deal. Boxplots for departmental averages of GRE quantitative scores are considered in Exercise 2.29. The whiskers in boxplots ordinarily extend to the smallest and largest observa- tions. However, if some of the observations are significantly smaller or larger than the 15 30 45 60 ActualCo Figure 2.12 modified boxplot 4 4 4 4 4 4 4 4 4 4 4 4 ` 4 ` 4 4 Solution and Discussion. Modified Box- plot for Construction Costs Actual Construction Costs . only . . Q . M . Q . . Q Q . . . . . . . Q . Q . . . . Technometrics, 67 OURCE EXAMPLE 2.13 Constructing a Modified Boxplot .918 7.214 14.577 30.028 38.173 15.320 14.837 51.284 34.100 2.003 20.099 4.324 10.523 13.371 1.553 4.069 27.973 7.642 3.692 29.522 15.317 5.292 .707 1.246 1.143 21.571

2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS

S : Schmoyer, R. L. “Asymptotically Valid Prediction Intervals for Linear Models.” Vol. 34, Nov. 1992, pp. 399 – 408. rest — potential outliers — they are not evident using this procedure. Boxplots can be modified to reveal potential outliers by extending the whiskers to the smallest and largest observations only if these points are sufficiently close to the rest of the data. If they are not, those observations far removed from the majority of cases are plotted as individual points. A common measure of closeness is 1 5 IR. A is constructed by extending the whiskers to the smallest and largest observations if these values are within 1 5 IR of the first and third quar- tiles, respectively. Otherwise, the whiskers are extended to the most extreme values still contained in these limits and the remaining observations are plotted individually. Modified boxplots work best for a moderate number of observations. If the number of observations is too large, an inordinate number of outliers may be identified. The actual costs ActualCo in millions of dollars of 26 construction projects at a large industrial facility are given in Table 2.7. Construct a modified boxplot for these data. The modified boxplot is shown in Figure 2.12 and indicates one potential outlier. For the construction costs, you may verify that Min 707, 3 692, 11 947, 21 571, and Max 51 284. Consequently, IR 21 571 3 692 17 879 and 1 5 IR 1 5 17 879 26 819. The smallest observation in the data set, Min .707, is well within 26.819 of 3 692; therefore, the left-hand whisker extends to this smallest value. The number 1 5 IR 21 571 26 819 48 390 is greater than all the construction costs except Max 51.284. Thus the right-hand whisker extends to the largest number in the data set less than or equal to 48.390 here 38.173 and the remaining case, 51.284, is plotted individually. The distribution of construction costs is skewed to the right and the extreme value, 51.284, will have a significant influence on the calculation of, for example, the sample mean. In this example, there is nothing wrong with the number 51.284, but it is highlighted as a project whose construction cost is considerably higher than that of the other projects. 1 3 3 1 1 3 TABLE 2.7 3 3 2 2 3 2.18 2.19 2.20

2.21 2.22