Outliers and Diagnostics
9.5 Outliers and Diagnostics
outlier,
Before accepting a model as a valid representation of the relationships among the specified
Section 5.3.1 , p. 106
variables, some conditions and assumptions should first be examined. To begin, consider the concept of an outlier, a value of a variable far from most of the data values, and its effect on the
bivariate
estimated model coefficients, b 0 and b 1 . The outlier of interest here is a bivariate outlier, defined
outlier: An outlier with respect to the
with respect to both variables. The bivariate outlier lies outside the patterning of the points
distribution of
in the two-dimensional scatter plot. As discussed in the previous chapter, this patterning is an
paired data values.
ellipse for two normally distributed variables. For the 37 employees in the Employee data set, estimate the following regression model to explain Salary in terms of Years employment.
model estimated from data: ˆ Y Salary = 32710 . 90 + 3249 . 55 X Years
Now consider changing the value of Salary for just one person, Trevon Correll, the person who has worked the longest at the company, 21 years, and has the highest Salary, $124,419.23.
fix function,
What is the impact on the estimated coefficients if that Salary is changed to $40,000? With the
Section 3.2 , p. 54
fix function that one data value was changed and the regression analysis re-run. The result is the following model.
model estimated from outlier data: ˆ Y Salary = 4537 . 62 + 2394 . 64 X Years
The resulting scatter plot with both the original and new regression lines appears in Figure 9.8 . The decrease in the estimated slope coefficient is $854.91. The shift in one data value decreased the impact of each additional Year on Salary from an average of $3250 down to $2395.
Regression I 217
100000 Original Line y
8 0000 Salar
Line with Outlier
Figure 9.8 Regression line with the outlier compared to the original regression line.
As discussed, the least-squares estimation procedure for the regression coefficients minimized the sum of the squared residuals. The problem of outliers in this scenario is that it is the minimization of the squared distance of the actual and fitted values for the response variable. The result is that the large residual of the resulting outlier exerts a disproportionate influence on the estimated model. For this reason an estimated regression model may not well reflect an underlying process if an outlier generated by a different process is included in the analysis. For example, if an employee actually did work at the company for 21 years and still made little money, perhaps the employee is the only part-time employee in the analysis. The process of how to determine Salary for part-time and full-time employees is quite different and should be analyzed separately.
9.5.1 Influence Statistics
Some data values for the predictor and response variables have more impact on the estimation of the resulting model than do other data values. These data values may also be outliers. A way to assess these differential impacts would identify those data points with the most impact. A data influence: Effect point with disproportionate influence, an influential data point, always should be inspected, of a specific set of the values of the and its role in the analysis assessed. Sometimes an influential data point is just a data coding or predictor variables entry error, yet one that would have changed the entire analysis had it remained. Sometimes a and the response variable on the different process generates the influential data point than does the process that generates the estimated model. remaining data values. In such a situation the data point should be analyzed separately. And sometimes an influential data point and may just represent an unusual event, analogous to flipping a fair coin 10 times and getting 8 heads.
influential data
Several different indices of influence are available for diagnostic checking (Belsley, Kuh, & point: A data point with Welsch, 1980). A large residual suggests the possibility of outliers and influential data points, but considerably more there are several difficulties with this use. One problem is to define the meaning of “large”? A way influence than most other data to address this problem is to standardize the residual so that, presuming normality, values larger points. than 2 or 3 likely indicate an outlier and perhaps an influential data point. Standardization of a
218 Regression I
residual in this situation, however, is not quite so straightforward because its standard deviation depends on the value of the predictor variable. This proper standardization that makes this adjustment for the value of the predictor is the Studentized residual.
Another issue to consider is that the estimated regression coefficients minimize the sum of squares of these residuals. If a data point is influential, then by definition there is a greater
case-deletion
adjustment than usual made to the regression estimates to achieve this minimization. To
statistic: A statistic for a data
remedy this problem, a more useful set of diagnostics are the case-deletion statistics. To avoid
point calculated
the confounding of the residual adjusted to the data point, delete the data point and then
without the point in the analysis.
re-compute the regression. Then calculate the residual of the deleted data point from this new model. Fortunately, formulas exist to make these adjustments without an actual physical re-computation. The result is an influence statistic based on a residual for a data point that does not contribute to the estimation of the model.
The version of the Studentized residual when adjusted for case-deletion is the externally
R-Student: A
Studentized residual, also called R-Student. This version of the residual has the additional
standardized residual calculated
advantage of following the t-distribution with degrees of freedom of n − k − 1, where n is the
from the model
sample size and k is the number of predictor variables in the model. The t-distribution provides
estimated with the corresponding
a standard for evaluating the size of R-Student. Except in very small samples, regardless of the
data point deleted
original scale of measurement of the response variable Y, values of R-Student larger than 2 or
from the data.
smaller than − 2 should be infrequent, and values larger than 2.5 or 3 or smaller than − 2 . 5 or − 3 should be rare. Other case-deletion statistics directly assess the influence of a data point on the estimated regression model. One index, Cook’s Distance or D, summarizes the overall influence of a data
Cook’s Distance,
point on all the estimated regression coefficients. Data points with larger
D values than the rest
D , Summary of the distance between
of the data are those that have unusual influence. Fox and Weisberg (1991, p. 34) suggests as
the regression
a cut-off for detecting influential cases, values of
D greater than 4 / ( n − k − 1), where n is the
coefficients calculated with a
number of cases, rows of data, and k is the number of predictor variables.
specific data point
Perhaps the most useful interpretation of Cook’s Distance follows from a comparison of
included and then deleted.
their relative sizes. When one or a few data points result in a large
D value, both in terms of the overall magnitude, and also relative to the remaining values, then an influential case has been identified. These larger values of Cook’s Distance or other influence statistics are more likely in smaller data sets.
DFFITS: Scaled
Another direct index of the influence of a data point is its impact on the fitted value. DFFITS
change in the fitted value for a
represents the number of standard errors that the fitted value for a data point has shifted when it
specific data point
is not present in the sample of data used to estimate the model. Large values of DFFITS indicate
when the point is deleted.
influential data points. A general cutoff to consider is 2, or, a recommended size-adjusted cutoff is 2 ( k + 1) / n. Perhaps a more useful approach, however, is to isolate those data points with large DFFITS values relative to most of the other data points and then try to understand why these data points are so influential.
The Regression function presents these three case-deletion influence indices, labeled rstudent , dffits , and cooks . Only those data points with relatively large values on these indices are of interest, so to conserve space these indices by default are listed only for the 20 data points with the largest value of Cook’s Distance. Listed for each such row of data are the row name, the data values, the fitted value, the residual, and then the three indices.
The rows are by default sorted by Cook’s Distance as shown in Listing 9.3 . The case with the largest Cook’s Distance is for Trevon Correll, who makes the highest Salary. The value of Cook’s Distance, 0.409, is more than twice as high as the next highest value of 0.204. Also this case has the highest R-Student value of 2.330 as well as the largest dffits value of 0.961. The value
Regression I 219
of the residual for Trevon Correll’s Salary is $23,467.73, which is the extent that the Salary is larger than the value fitted with the model. Further examination of this situation beyond the regression analysis may account for this considerably larger Salary than is accounted for by the model.
residual rstudent dffits cooks Correll, Trevon
2.330 0.961 0.409 Capelle, Adam
-1.233 -0.643 0.204 James, Leslie
24 98138.43 110700.15 -12561.725
2.022 0.645 0.191 Korhalkar, Jessica
2.208 0.630 0.178 Hoang, Binh
1.799 0.435 0.089 Billing, Susan
1.535 0.364 0.064 Singh, Niral
1.066 0.304 0.046 Skrotzki, Sara
-0.890 -0.284 0.041 Cassinelli, Anastis
18 81352.33 91202.84 -9850.510
-1.579 -0.268 0.035 Kralik, Laura
Listing 9.3 Residuals and influence indices sorted by Cook’s Distance.
The default settings that control the display of the rows of data and other values can be modified. The res.rows option can change the default of 20 rows displayed to any value up res.rows option: to the number of rows of data, specified by the value of "all" . To turn this option off, specify The number of rows of data to be
a value of 0. The res.sort option can change the sort criterion from the default value of displayed for the "cooks" . Other values are "rstudent" for R-Student, "dffits" for the dffits index and "off" residuals analysis. to leave the rows of data in their original order.
res.sort option: The sort criterion
for the residuals analysis.
9.5.2 Assumptions
As with any statistical procedure, the validity of the analysis requires satisfying the underlying assumptions. The assumptions focus on the properties of the residuals, which ideally only reflect random error. Any systematic content of the residual variable violates one or more of the assumptions. If so, the model is too simple, so explicitly revise the model to account for this systematic information instead of relegating it to the error term. Often this correction includes adding one or more predictor variables, accounting for a nonlinear relationship, or using an estimation procedure other than least-squares.
The least-squares estimation procedure requires the following three assumptions. the average residual value should be zero for each value of the predictor variable
the standard deviation of the residuals should be the same for each value of the predictor variable for data values collected over time, the residuals at one time value should not correlate with the corresponding residuals at other time values
A detailed analysis of the evaluation of the assumptions of regression analysis is well beyond the scope of this book. Fortunately, the first two assumptions can be at least informally evaluated by examining a scatter plot of the residuals with the fitted values. Figure 9.9 is the Regression scatter plot for these variables.
220 Regression I
Correll, Trevon 20000
Fitted Values
Largest Cook's Distance, 0.41, is highlighted
Figure 9.9 Scatter plot of fitted values with residuals with the data point with the largest Cook’s Distance highlighted.
For each value of Y ˆ i , that is, for each vertical line drawn through the scatter plot, the residuals should be approximately evenly distributed in the positive and negative regions. To facilitate this comparison the graph contains a dotted horizontal line drawn through the origin. If the residuals for individual values of Y ˆ i are not evenly balanced about the horizontal zero line, the relationship between response and predictor variables is likely not linear as specified.
The second assumption of least-squares regression is a constant population standard deviation of the estimation errors at all values of X, the equal variances assumption. The value of Y should be no more or less difficult to predict for different values of X. Any difference in the standard deviation of residuals for different values of X should be attributable only to sampling error. That is, the variability of the values of Y around each value of X should be the same. The
heteroscedasticity: violation of this equal variances assumption is heteroscedasticity. Often the pattern exhibited by
Standard deviation of residuals differs
heteroscedasticity is a gradually increasing or decreasing variability as X gets larger or smaller.
depending on the
When heteroscedasticity occurs, the corresponding standard errors of the regression coefficients
value of the predictor variable.
and associated confidence intervals are also incorrect. The third assumption of least-squares estimation is uncorrelated residuals with any other variable, including each other. The correlation of successive residuals usually occurs over time and so typically applies to the analysis of time-oriented data. It is common that this assumption is violated in time-oriented data. For example, sales of swimwear peak in Spring and Summer and decrease in Fall and Winter. The residuals around a regression line over time would reflect this seasonality, systematically decreasing and increasing depending on the time of year. Analysis of time-oriented data typically requires more sophisticated procedures than simply fitting a regression line to the data.
A fourth assumption of regression is that the estimation errors are normally distributed for each value of X. This assumption is not needed for the estimation procedure, but is required for the hypothesis tests and confidence intervals previously described. To facilitate this evaluation Regression provides a density plot and histogram of the residuals, which appears
Regression I 221
in Figure 9.10 . Both the general density curve and the curve that presumes normality are plotted over the histogram. The residuals appear to be at least approximately normal, satisfying the assumption.
Figure 9.10 Distribution of the residuals.
Worked Problems
1 Consider the BodyMeas data set.
?dataBodyMeas
for more information.
> mydata <- Read("BodyMeas", format="lessR")
(a) Predict Weight from Height. Specify the estimated model. Is the slope coefficient significant? Interpret. (b) Identify the obvious outlier. What data value most contributes to the status of this case as an outlier? (c) With the Subset function drop this case from the data table. (d) Re-estimate the model. Is the model reasonably similar or qualitatively different from the
model estimated with the outlier? 2 Separate data tables for men and women.
(a) With Subset create a data table with just women. (b) Estimate a regression model of Weight from Height for just the women.
(c) With Subset create a data table with just men. (d) Estimate a regression model of Weight from Height for just the men.
(e) Compare the models. (Note that the more formal way to provide this comparison is with the technique of indicator variables discussed in the next chapter.)
222 Regression I
?dataCars93 for
3 The Cars93 data set contains much information on 93 1993 car models.
more information.
> mydata <- Read("Cars93", format="lessR")
(a) Build a model to predict MPGhiway from the Weight of the car. (b) Specify the estimated model and interpret the slope coefficient.
(c) Are there outliers? (d) What is the prediction interval for MPGhiway for a car that weighs 2222 lbs?
CHAPTER 10
REGRESSION II