Study of Residuals and Violation of Assumptions (Model Checking)

12.10 Study of Residuals and Violation of Assumptions (Model Checking)

It was suggested earlier in this chapter that the residuals, or errors in the regression fit, often carry information that can be very informative to the data analyst. The

e i =y i −ˆy i , i = 1, 2, . . . , n, which are the numerical counterpart to the ǫ i , the model errors, often shed light on the possible violation of assumptions or the presence of “suspect” data points. Suppose that we let the vector x i denote the values of the regressor variables corresponding to the ith data point, supplemented by a 1 in the initial position. That is,

x ′ i = [1, x 1i ,x 2i ,...,x ki ].

Consider the quantity

h ii ′ =x ′ −1 i (X X) x i ,

i = 1, 2, . . . , n.

12.10 Study of Residuals and Violation of Assumptions (Model Checking) 483 The reader should recognize that h ii was used in the computation of the confidence

intervals on the mean response in Section 12.5. Apart from σ 2 ,h ii represents the variance of the fitted value ˆ y i . The h ii values are the diagonal elements of the HAT matrix

H = X(X ′ X) −1 X ′ ,

which plays an important role in any study of residuals and in other modern aspects of regression analysis (see Myers, 1990, listed in the Bibliography). The term HAT matrix is derived from the fact that H generates the “y-hats,” or the fitted values when multiplied by the vector y of observed responses. That is, ˆ y = Xb, and thus

y = X(X ˆ ′ X) −1 X ′ y = Hy,

where ˆ y is the vector whose ith element is ˆ y i . If we make the usual assumptions that the ǫ i are independent and normally distributed with mean 0 and variance σ 2 , the statistical properties of the residuals are readily characterized. Then

E(e i ) = E(y i − ˆy i )=0 and σ 2 ǫ i = (1 − h ii )σ 2 , for i = 1, 2, . . . , n. (See Myers, 1990, for details.) It can be shown that the HAT

diagonal values are bounded according to the inequality

1 n ≤h ii ≤ 1.

In addition,

h ii = k + 1, the number of regression parameters. As a result, any

i=1

data point whose HAT diagonal element is large, that is, well above the average value of (k + 1)/n, is in a position in the data set where the variance of ˆ y i is relatively large and the variance of a residual is relatively small. As a result, the data analyst can gain some insight into how large a residual may become before its deviation from zero can be attributed to something other than mere chance. Many of the commercial regression computer packages produce the set of studentized residuals.

Here each residual has been divided by an estimate of its standard de- viation, creating a t-like statistic that is designed to give the analyst a scale-free quantity providing information regarding the size of the residual. In addition, standard computer packages often provide values of another set of studentized- type residuals called the R-Student values.

R -Student Residual

i = 1, 2, . . . , n,

s −i 1−h ii

where s −i is an estimate of the error standard deviation, calculated with the ith data point deleted.

484 Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

There are three types of violations of assumptions that are readily detected through use of residuals or residual plots. While plots of the raw residuals, the e i , can be helpful, it is often more informative to plot the studentized residuals. The three violations are as follows:

1. Presence of outliers

2. Heterogeneous error variance

3. Model misspecification In case 1, we choose to define an outlier as a data point where there is a

deviation from the usual assumption E(ǫ i ) = 0 for a specific value of i. If there is

a reason to believe that a specific data point is an outlier exerting a large influence on the fitted model, r i or t i may be informative. The R-Student values can be expected to be more sensitive to outliers than the r i values.

In fact, under the condition that E(ǫ i ) = 0, t i is a value of a random variable following a t-distribution with n − 1 − (k + 1) = n − k − 2 degrees of freedom. Thus,

a two-sided t-test can be used to provide information for detecting whether or not the ith point is an outlier. Although the R-Student statistic t i produces an exact t-test for detection of an outlier at a specific data location, the t-distribution would not apply for simultane- ously testing for outliers at all locations. As a result, the studentized residuals or R -Student values should be used strictly as diagnostic tools without formal hypoth- esis testing as the mechanism. The implication is that these statistics highlight data points where the error of fit is larger than what is expected by chance. R-Student values large in magnitude suggest a need for “checking” the data with whatever resources are possible. The practice of eliminating observations from regression data sets should not be done indiscriminately. (For further information regarding the use of outlier diagnostics, see Myers, 1990, in the Bibliography.)

Illustration of Outlier Detection

Case Study 12.1: Method for Capturing Grasshoppers: In a biological experiment conducted at Virginia Tech by the Department of Entomology, n experimental runs were made with two different methods for capturing grasshoppers. The methods were drop net catch and sweep net catch. The average number of grasshoppers caught within

a set of field quadrants on a given date was recorded for each of the two methods. An additional regressor variable, the average plant height in the quadrants, was also recorded. The experimental data are given in Table 12.10.

The goal is to be able to estimate grasshopper catch by using only the sweep net method, which is less costly. There was some concern about the validity of the fourth data point. The observed catch that was reported using the net drop method seemed unusually high given the other conditions and, indeed, it was felt that the figure might be erroneous. Fit a model of the type

y i =β 0 +β 1 x 1 +β 2 x 2

to the 17 data points and study the residuals to determine if data point 4 is an outlier.

12.10 Study of Residuals and Violation of Assumptions (Model Checking) 485

Table 12.10: Data Set for Case Study 12.1

Drop Net Sweep Net

Plant

Observation

Catch, y

Catch, x 1 Height, x 2 (cm)

Solution : A computer package generated the fitted regression model

y = 3.6870 + 4.1050x ˆ 1 − 0.0367x 2

along with the statistics R 2 = 0.9244 and s 2 = 5.580. The residuals and other diagnostic information were also generated and recorded in Table 12.11. As expected, the residual at the fourth location appears to be unusually high, namely 7.769. The vital issue here is whether or not this residual is larger than one would expect by chance. The residual standard error for point 4 is 2.209. The R-

Student value t 4 is found to be 9.9315. Viewing this as a value of a random variable having a t-distribution with 13 degrees of freedom, one would certainly conclude that the residual of the fourth observation is estimating something greater than 0 and that the suspected measurement error is supported by the study of residuals. Notice that no other residual results in an R-Student value that produces any cause for alarm.

Plotting Residuals for Case Study 12.1

In Chapter 11, we discussed, in some detail, the usefulness of plotting residuals in regression analysis. Violation of model assumptions can often be detected through these plots. In multiple regression, normal probability plotting of residuals or plotting of residuals against ˆ y may be useful. However, it is often preferable to plot studentized residuals.

Keep in mind that the preference for the studentized residuals over ordinary residuals for plotting purposes stems from the fact that since the variance of the

486 Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

Table 12.11: Residual Information for the Data Set of Case Study 12.1 Obs.

ith residual depends on the ith HAT diagonal, variances of residuals will differ if there is a dispersion in the HAT diagonals. Thus, the appearance of a plot of residuals may seem to suggest heterogeneity because the residuals themselves do not behave, in general, in an ideal way. The purpose of using studentized residuals is to provide a type of standardization. Clearly, if σ were known, then under ideal conditions (i.e., a correct model and homogeneous variance), we would have

= 1. So the studentized residuals produce a set of statistics that behave in a standard

way under ideal conditions. Figure 12.5 shows a plot of the R-Student values for the grasshopper data of Case Study 12.1. Note how the value for observation 4 stands out from the rest. The R-Student plot was generated by SAS software. The plot shows the residuals against the ˆ y-values.

Normality Checking

The reader should recall the importance of normality checking through the use of normal probability plotting, as discussed in Chapter 11. The same recommendation holds for the case of multiple linear regression. Normal probability plots can be generated using standard regression software. Again, however, they can be more effective when one does not use ordinary residuals but, rather, studentized residuals or R-Student values.

12.11 Cross Validation, C p , and Other Criteria for Model Selection 487

10 obs 4

Studentized Residual without Current Obs

Predicted Value of Y

Figure 12.5: R-Student values plotted against predicted values for grasshopper data of Case Study 12.1.