v v
v
2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS
. . .
a? a
b.
summary every
x , x , , x
n x
x
57
MEASURES OF LOCATION
2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS
incomes. Compare the two density histograms. You may want to plot back- to-back density histograms. Do your conclusions about the distribution of
white household incomes relative to the distribution of nonwhite household incomes change from part
Explain. c.
Refer to your results in parts and
When comparing distributions over class intervals of unequal lengths, is it better to use relative frequency
histograms or density histograms? Discuss.
As we have seen, data sets can be visually compared using density histograms. More succinct summaries are provided by single numbers that represent particular features of
data sets. For example, we may be interested in the center of a data set, or the smallest value, or the typical distance from the center, and so forth. These single-number
summaries may be of interest in their own right, or they may be used in conjunction with density histograms to allow more objective comparisons.
Why are single-number summaries important? They provide immediate impres- sions of order of magnitude, and they allow simple comparisons. A current U.S.
unemployment rate of 6.4 provides us with an immediate indication of the overall jobless situation — particularly when this number is compared with last month’s figure
of 6.7. We know that some areas of the country will have unemployment rates higher than 6.4 and some areas will have lower rates, but it is difficult to convey
to the general public the nature of unemployment by publishing the entire collection of unemployment rates for, say, all the U.S. standard metropolitan areas. We need a
measure of unemployment. At one of the Ford Motor Company plants, it takes a total of 20.4 hours to build a
new car. Do you believe that vehicle takes exactly 20.4 hours to build? Of course
not. Sometimes it takes more than 20.4 hours, sometimes it takes less. The number 20.4 is a “typical” figure. It is a useful way to summarize one aspect of productivity. It
can be compared with the 19.5 hours it takes to build a vehicle at one of the Toyota plants in the United States.
Initially, we will concentrate on the following numerical measures of magnitude or location:
Mean Median
Percentiles Later, we will consider numerical summaries of other features of data sets.
To clarify the ideas and to present effectively the associated calculations, it is convenient to use the symbols
to represent the measurements in the
data set. We introduced this notation in Chapter 1. Now the ’s may be measurements
of quantitative variables or numbers assigned to observed categories of qualitative variables. The subscripted
notation allows a general discussion since we are not then anchored to a specific set of numbers.
1 2
n i
17 18
19 20
21 22
Tothours
Figure 2.9 Sample mean:
4
Solution and Discussion.
Total Number of Hours to Build a Vehicle and the Location of the
Sample Mean
4
Productivity in Auto Manufacturing
x
n x
x n
San Diego Union-Tribune,
58
OURCE
EXAMPLE 2.8 Interpreting the Sample Mean
Number of Workers Total Number of Hours
Plant per Vehicle
to Build a Vehicle Nissan truck Smyrna, Tenn.
2.20 17.6
Nissan car Smyrna, Tenn. 2.32
18.6 Toyota car Georgetown, Ky.
2.44 19.5
Ford car Kansas City, Mo. 2.48
19.8 Ford car Atlanta, Ga.
2.49 19.9
Nummi truck Fremont, Calif. 2.52
20.2 Ford car Chicago, Ill.
2.55 20.4
Ford truck Norfolk, Va. 2.70
21.6 Ford truck Louisville, Ky.
2.71 21.7
Chrysler car Belvidere, Ill. 2.72
21.8
CHAPTER 2 DESCRIBING PATTERNS IN DATA
S :
June 24, 1994.
The two most commonly used measures of center are the mean and the median. The sample mean was introduced in Chapter 1. Recall that the sample mean is the sum
of the sample measurements divided by the sample size and is denoted by .
For measurements
1
To understand how the sample mean indicates the center or middle, we present the following example.
Two measures of productivity for the 10 most productive vehicle assembly operations in North America, according to a 1994 Harbour Report, are listed in Table 2.4.
Construct a dot diagram for the total hours needed to build a vehicle, and indicate the sample mean on the diagram.
The dot diagram, with the value of the sample mean, 20.11, indicated by a fulcrum, is shown in Figure 2.9.
If we imagine the horizontal axis of the dot diagram as a weightless bar and the dots representing the data as balls of equal size and weight, the mean is the point
at which the bar balances. The sample mean is affected by extreme observations.
1
TABLE 2.4
n i
i
resistant robust.
trimmed mean.
Sample median:
` n
M n
M
59
2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS
Imagine, for example, that the smallest total hours figure, 17.6, is decreased moved to the left in the figure while the other numbers remain the same. To maintain balance,
the mean fulcrum must decrease move to the left . If we change 17.6 to 13.3, for example, the sample mean becomes 19.68. Is the sample mean a good measure of
center? It is, provided you interpret the center as the balancing point.
For large samples, the sample mean is ordinarily not appreciably affected by a few extreme measurements. Summary measures that are not affected by extreme values
are said to be or
One way to make the sample mean robust is not to include extreme values in its calculation. Suppose we order the observations from
smallest to largest and then ignore, say, 5 of the measurements at each end. If we calculate the sample mean from the remaining observations, the result is called the 5
Ignoring 10 of the observations at each end gives the 10 trimmed mean and so forth.
A trimmed mean is the balancing point or center of gravity of the measurements from which it is calculated. In this sense, its interpretation is the same as that of the
sample mean. Computer programs will usually compute a trimmed mean along with the sample mean. Five percent is a typical amount of trimming.
To obtain an even more robust summary statistic, arrange the data from smallest to largest. The sample median is the value that divides the data set in half; that is, 50
of the measurements are less than the median, and 50 are larger than the median.
The value that divides the ordered data in half
If the number of measurements is odd, the median is the middle measurement. If the number of measurements is even, the median is defined to be the average of the
two middle measurements, or the value halfway between them.
To calculate the sample median: 1.
Arrange the observations in numerical order, from smallest to largest.
2. If the number of observations is odd, the sample median,
, is the middle observation, determined by counting
1 2 observations up from the smallest value in the ordered set.
3. If the number of observations is even, the sample median,
, is the average of the two middle observations in the ordered set.
Notice that the calculation of the median is not influenced by the values of the measurements at the ends of the ordered data set. Consequently, the sample median
is a robust measure of location. Moreover, the median corresponds to our intuitive notion of middle: the value that divides the ordered observations exactly in half.
4 4
4 4
` 4
4
4 4
4 4
Solution and Discussion.
Solution and Discussion.
`
Death Claim Amounts for Group Life Insurance Plan
, ,
, ,
, M
, x
,
n M
. n
.
n M
, total
x M
60
EXAMPLE 2.9 Calculating the Sample Median for an Even Number
of Observations
EXAMPLE 2.10 Calculating the Sample Median for an Odd Number
of Observations
1750 2800
3500 4025
4025 4375
4375 4375
5775 5775
6125 6125
6125 6475
6825 6825
6825 7350
7350 7350
7350 8050
9450 13125
13125 26250
26250 54600
64750 89600
95550
CHAPTER 2 DESCRIBING PATTERNS IN DATA
The sample mean and sample median determined from the same data set will, in general, be different. This should not be surprising since they correspond to different
notions of center. They measure the overall location of a data set in different ways. The sample mean is the most popular measure of location but, in cases where the
mean and median are considerably different, both should be reported. A collection of incomes, for example,
40 000 50 000
58 000 60 000
136 000 is best summarized by the sample median,
58 000, since it will not be influenced by exceptionally large incomes. Large incomes tend to inflate the sample mean, in this
case 68 800, and make it less useful as a measure of typical income.
The total number of hours needed to build a vehicle are arranged from smallest to largest in Table 2.4 of Example 2.8. Calculate the sample median.
There are 10 observations, so the sample median
is the average of the two middle values, 19.9 and 20.2; that is, 20 05. The
median is in position 1 2
5 5 or halfway between the 5th and 6th largest observations.
A University Association Group Life Insurance Plan paid 31 death claims during a recent policy year. The claim amounts are given from smallest to largest in Table 2.5.
Calculate the median.
Since 31 is odd, the median is the middle observation
given, in this case, by counting 16 observations from the smallest number. Thus,
6 825. The median indicates a central value. However, if the
payments for claims is important, the total is Number of claims
Mean claim 31
, whereas 31 is not related to total payments.
11 2
1 2
TABLE 2.5
3 3
3
n
Sample 100 th percentile
sample quartiles.
4 4
4
4 `
4 4
4
p
p p
p .
p .
p .
We adopt the convention of taking an observed value for the sample percentile except when two adjacent values satisfy the definition, in which case, their average is
taken as the percentile.
p n
np. np
np k
k k
Q Q
Q Q
Q
61
Sample Quartiles
2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS
Percentiles are numbers that divide the data into percentages. The sample median is the 50th percentile, because the sample median divides an ordered data set in half.
: The value in an ordered data set such that at least
100 of the data set is at or less than this value and at least 100 1 of the
data set is at or above this value
Setting 25,
5, and 75 generates the 25th, 50th, and 75th percentiles,
respectively. These numbers, taken as a group, divide the data set into quarters and, not surprisingly, are known as the
This procedure is consistent with the way we calculate the sample median.
To calculate the sample 100 th percentile, 1.
Arrange the observations in numerical order, from smallest to largest.
2. Determine the product Sample size Proportion
3. If
is not an integer, round it up to the next integer and find the observation in this position. This value is the percentile. If
is an integer, say, , calculate the average of the th and
1 st ordered values. This average is the percentile.
Some statistical software packages use slight variations of our definition of per- centiles. For large samples, they all tend to give essentially the same numbers.
The sample percentiles used most frequently are the median, and the first and third quartiles. The sample quartiles are summarized here in terms of the percentiles
they represent. From these representations, you can see that the first and third quartiles are themselves medians. The first quartile,
, is the median of the observations less than the sample median, and the third quartile,
, is the median of the observations greater than the sample median.
First quartile 25th percentile
Second quartile or median 50th percentile
Third quartile 75th percentile
1 3
1 2
3
2
4
4 4
4 4
4 4
4 4
4 4
4 4
4
Solution and Discussion.
n .
. .
. .
. .
. .
. M
. p
. np
. .
Q .
p .
np .
. Q
. Q
. Q
M .
Q .
percentiles are robust measures of location.
62
EXAMPLE 2.11 Calculating Sample Quartiles
CHAPTER 2 DESCRIBING PATTERNS IN DATA
To illustrate the calculation of sample quartiles, we turn once more to the productivity data listed in Table 2.4 see Example 2.8 .
From Table 2.4, the 10 total number of hours needed to build a vehicle are, in
order, 17 6
18 6 19 5
19 8 19 9
20 2 20 4
21 6 21 7
21 8 The sample median or 50th percentile or second quartile was calculated in Example
2.9. Recall that 20 05. Calculate the first and third quartiles.
To calculate the first quartile, set 25. Then
10 25 2 5. Since 2.5 is not an integer, round it to the next integer, 3, and take the
observation in the 3rd position as the required quartile. Thus, 19 5. Three of the
10 observations at least 25 are at or below 19.5, and 8 observations at least 75 are at or above 19.5, confirming that it is the first quartile.
Similarly, to get the third quartile, set 75 so that
10 75 7 5. Round
7.5 to the next integer, 8, and take the observation in the 8th position as the required quartile. Consequently,
21 6. Eight of the 10 observations at least 75 are at or below 21.6, and 3 observations at least 25 are at or above it.
The three quartiles, 19 6,
20 05, and 21 6, divide the data
set into quarters. If, in Example 2.11, the last number in the data set were 25.3 instead of 21.8, the
quartiles would not change. Similarly, if the two smallest values were, for example, 16.9 and 18.8 instead of 17.6 and 18.6, respectively, the quartiles would not change.
Percentiles in general, and quartiles in particular, are not heavily influenced by the particular values of the observations. Extreme values have no influence on percentiles
located toward the center of the distribution. This is what we mean when we say that
We have discussed measures of location in terms of the original set of observations. If the data are displayed as dot diagrams, stem-and-leaf diagrams, or density histograms,
measures of location can be indicated on the diagrams. We have already seen, for example, with the 800-meter data in Figure 2.3, that the statistical software identifies
the median class in its version of the stem-and-leaf diagram and prints the cumulative frequencies from each end of the data distribution. This allows easy identification of
the sample quartiles.
The sample mean always retains its interpretation as the balancing point. Therefore, its location on the variable axis of a dot diagram and, to a good approximation, a
density histogram, is the point at which a fulcrum would just balance the configuration of points or pattern of vertical bars.
Because the sample mean is not a robust measure of location, it will typically be larger than the median for a histogram with a long right-hand tail, and less than the
median for a histogram with a long left-hand tail. The two measures of location will almost coincide for nearly symmetric histograms, because the balancing point and the
value dividing the distribution in half are the same see Exercise 2.22 .
1
3 1
2 3
Sample variance: Sample standard deviation:
degrees of freedom
Sample range:
4 4 `
4 4
4 4
4 4
4 4
4 ` `
4 4
4
s s
s x
x n
s s
n x
x n
n x
x x
x x
n
63
MEASURES OF VARIATION
2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS
We talked about measuring variability in Chapter 1, where we introduced the sample variance and sample standard deviation. Here, we will not only review these measures,
we will also introduce the sample range and sample interquartile range as additional measures of variability.
The sample variance, , and sample standard deviation, , can be useful single-
number summaries of variability. This is particularly true for relatively large, mound- shaped data sets.
1 1
The sample variance is essentially an average squared distance from the mean; consequently, its value can be heavily influenced by observations far from the middle.
Since the standard deviation is closely connected to the variance, its value can also be heavily influenced by observations far from the middle. The sample variance and
sample standard deviation are not robust measures of variability.
The number 1 in the definition of the sample variance or sample standard
deviation is called the because it represents the number of
deviations from the mean that are “free to vary.” Let’s see what this means. In Chapter 1, we showed that the sum of the deviations
is always 0. Consequently, the final deviation can be determined once we know any
1 of the other deviations. For example, given the
4 numbers 2
3 4
1 you may verify that
2, and the first three deviations from the mean are 2 2
0, 3
2 1, and 4
2 2. Since the sum of all deviations must be zero, the last
deviation must be 1
2 3. Only 3
1 of the deviations are free to vary.
The sample range is simply the difference between the largest and smallest observations. It is the length of the interval that just contains all the data.
Largest observation Smallest observation
The range is very easy to calculate and interpret. However, by definition, the range is extremely sensitive to the existence of even a single very large or very small value
in the data set. It is also not a robust measure of variability. The sample interquartile range is a measure of variability based on the first and
third quartiles.
2
2 2
1 2
1 2
3 4
2 2
2 2
2 2
2 2
2 2
2 2
2
n i
i
i
Sample interquartile range:
empirical rule.
4 4
4
Cost per KW Capacity in Place for U.S. Public Utilities
Q Q
x s
x s
x s
x s
x s
n x
s x
s
64
Empirical Rule
EXAMPLE 2.12 Summarizing Variation with the Empirical Rule,
the Range, and the Interquartile Range
151 104
96 168
168 174
136 178
175 202
148 164
197 192
111 150
199 245
113 204
252 173
CHAPTER 2 DESCRIBING PATTERNS IN DATA
IR Third quartile
First quartile The interquartile range is the length of the interval just containing the middle
50 of the observations. This interval is not centered on the median unless the data distribution is symmetric. Because the interquartile range depends only on the first and
third quartile numbers, its value is not affected by a few extreme measurements at each end of the distribution. The sample interquartile range is a robust measure of spread.
It is often used to measure spread when the median is used to measure middle.
The center and the extent of spread in a data set are key pattern features. For nearly symmetric and mound-shaped data distributions, the mean and standard
deviation are worthwhile measures of location and spread. Their usefulness is enhanced by the
The empirical rule provides intervals that contain certain proportions of the data when we know only the values of
and . This rule works best with large data sets for example, more than 30 numbers that tend to have a mound of values around the
mean and fewer values far from the mean in each direction. It gives the approximate proportion of values within 1, 2, and 3 standard deviations of the mean.
Approximately 68 of the data lie within
95 of the data lie within 2
99.7 of the data lie within 3
With only the two values and , the empirical rule allows us to create an
expanding set of intervals that contain increasing proportions of the data set. When data distributions are skewed to the right or to the left, no single measure
of spread is entirely satisfactory because the nature of the variability on one side of the center is different from that on the other side. If a single measure of spread is required,
it may be best to use a number that is robust to extreme values.
Table 2.6 gives the cost per kilowatt KW capacity in place for a particular year for 22 U.S. public utility companies. Obtain the range, the interquartile range, and
the intervals and
2 . Compare the proportions of the observations in the latter intervals with the proportions suggested by the empirical rule.
3 1
TABLE 2.6
2 2
6 6
6
6 6
120 180
240 CostprKW
Min Max
M Q
1
Q
3
N MEAN
MEDIAN TRMEAN
STDEV CostprKW
22 168.18
170.50 167.60
41.19 MIN
MAX Q1
Q3 CostprKW
96.00 252.00
145.00 197.50
Figure 2.10 five-number summary
boxplot.
4 4
4 4
4 4
4 4
4
4 4
4
4
4 4
4
4
Solution and Discussion.
Boxplot of Costs per KW Capacity in Place for 22 U.S. Public Util-
ities Q
Q M
. Q
.
Q Q
. .
x s
. .
. ,
.
x s
. .
. ,
.
Q Q
Q
Q Q
M Q
65
BOXPLOTS
2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS
The Minitab printout follows:
From the printout, Min 96, Max
252, 145,
170 5, and 197 5. We can easily calculate
Range Max
Min 252
96 156
and IR
197 5 145
52 5 All of the observations are within 156 units of one another. The middle 50 of the
costs per KW capacity in place are contained in an interval of length 52.5. The empirical rule suggests that about 68 of the costs should fall in the interval
168 18 41 19
or 126 99
209 37 Similarly, about 95 of the observations should fall in the interval
2 168 18
82 38 or
85 80 250 56
In fact, 100
73 of the costs are included in the first interval, and 100
95 of the costs are contained in the second interval. For this relatively small data set, the empirical rule gives a fairly accurate picture of the distribution of cost per KW
capacity in place.
Together with the smallest and largest observations, the quartiles ,
, provide a
fairly comprehensive five-number summary of a distribution of measurements. Let Min and Max represent the smallest and largest observations in the data set, respectively.
The
Min Max
is represented pictorially as a Figure 2.10 is a horizontal boxplot of the cost per KW capacity in place data from
Example 2.12.
1 2
3
3 1
16 21
22 22
1 2
3
1 2
3
2 2
2 2
6 6
6 6
All departments
Natural sciences
Engineering Social
sciences Humanities
and arts Education
200 300
400 500
600 700
800
GRE verbal scores
p
Figure 2.11 Boxplots of Departmental Means for GRE Verbal Scores
Validity of the GRE: 1988 – 89 Summary Report.
66
p
OURCE
MODIFIED BOXPLOTS
CHAPTER 2 DESCRIBING PATTERNS IN DATA
S : Schneider, L. M., and Briel, J. B.
Princeton, N.J.: Educational Testing Service, Sept. 1990.
There are five vertical lines in the boxplot: the lines forming the ends of the box rectangle , the line within the box, and the small vertical lines at the ends of the
horizontal lines whiskers that extend in opposite directions from the box. These vertical lines correspond to the five summary numbers. Reading from the scale beneath
the figure, we see that the vertical line within the box identifies the median. The ends of the box correspond to the 1st and 3rd quartiles, and the lines at the ends of the
whiskers denote the minimum Min and maximum Max values. The length of the box is the interquartile range, and the distance between the Min and Max is the overall
range. The median line is nearly in the middle of the box and the whiskers are nearly of the same length, so this data distribution is very nearly symmetric.
Boxplots are not as informative as stem-and-leaf plots or density histograms be- cause they do not show the patterns of the data within the quartile boundaries.
They are, however, useful for assessing symmetry or asymmetry and for comparing distributions. Figure 2.11 shows side-by-side boxplots of average Graduate Record Ex-
amination GRE verbal scores for students admitted to graduate study in departments classified according to the general categories displayed. The departmental averages are
based on data for students who took the GRE over a five-year period.
The center, spread, and range of the distributions of average scores are immediately apparent. We see, for example, that the average GRE verbal scores for Engineering
departments are tightly concentrated about a median average score of about 540. The highest median of average verbal score occurs for students admitted to departments in
the Humanities and Arts. The interquartile range is about the same for all the categories with the exception of Engineering, where it is smaller. Finally, although there are some
differences in overall spread as measured by the range, the median scores do not vary a great deal. Boxplots for departmental averages of GRE quantitative scores are
considered in Exercise 2.29.
The whiskers in boxplots ordinarily extend to the smallest and largest observa- tions. However, if some of the observations are significantly smaller or larger than the
15 30
45 60
ActualCo
Figure 2.12 modified boxplot
4 4
4 4
4 4
4 4
4 4
4 4
` 4
` 4
4
Solution and Discussion.
Modified Box- plot for Construction Costs
Actual Construction Costs
. only
.
. Q
. M
. Q
. .
Q Q
. .
. .
. .
. Q
. Q
. .
. .
Technometrics,
67
OURCE
EXAMPLE 2.13 Constructing a Modified Boxplot
.918 7.214
14.577 30.028
38.173 15.320
14.837 51.284
34.100 2.003
20.099 4.324
10.523 13.371
1.553 4.069
27.973 7.642
3.692 29.522
15.317 5.292
.707 1.246
1.143 21.571
2.5 NUMERICAL SUMMARIES OF DATA DISTRIBUTIONS
S : Schmoyer, R. L. “Asymptotically Valid Prediction Intervals for Linear
Models.” Vol. 34, Nov. 1992, pp. 399 – 408.
rest — potential outliers — they are not evident using this procedure. Boxplots can be modified to reveal potential outliers by extending the whiskers to the smallest and
largest observations only if these points are sufficiently close to the rest of the data. If they are not, those observations far removed from the majority of cases are plotted as
individual points. A common measure of closeness is 1 5
IR. A
is constructed by extending the whiskers to the smallest and largest observations
if these values are within 1 5 IR of the first and third quar-
tiles, respectively. Otherwise, the whiskers are extended to the most extreme values still contained in these limits and the remaining observations are plotted individually.
Modified boxplots work best for a moderate number of observations. If the number of observations is too large, an inordinate number of outliers may be identified.
The actual costs ActualCo in millions of dollars of 26 construction projects at a large industrial facility are given in Table 2.7. Construct a modified boxplot for these data.
The modified boxplot is shown in Figure 2.12 and indicates one potential outlier. For the construction costs, you may verify that Min
707, 3 692,
11 947, 21 571, and Max
51 284. Consequently, IR 21 571 3 692
17 879 and 1 5 IR 1 5 17 879
26 819. The smallest observation in the data set, Min
.707, is well within 26.819 of 3 692; therefore,
the left-hand whisker extends to this smallest value. The number 1 5 IR
21 571 26 819 48 390 is greater than all the construction costs except Max
51.284. Thus the right-hand whisker extends to the largest number in the data set less than or
equal to 48.390 here 38.173 and the remaining case, 51.284, is plotted individually. The distribution of construction costs is skewed to the right and the extreme
value, 51.284, will have a significant influence on the calculation of, for example, the sample mean. In this example, there is nothing wrong with the number 51.284, but it
is highlighted as a project whose construction cost is considerably higher than that of the other projects.
1 3
3 1
1 3
TABLE 2.7
3 3
2 2
3
2.18
2.19
2.20
2.21 2.22