Study of Residuals and Violation of Assumptions (Model Checking)

12.10 Study of Residuals and Violation of Assumptions (Model Checking)

  It was suggested earlier in this chapter that the residuals, or errors in the regression fit, often carry information that can be very informative to the data analyst. The

  e i =y i −ˆy i , i = 1, 2, . . . , n, which are the numerical counterpart to the i , the model errors, often shed light on the possible violation of assumptions or the presence of “suspect” data points. Suppose that we let the vector x i denote the values of the regressor variables corresponding to the ith data point, supplemented by a 1 in the initial position. That is,

  x i = [1, x 1i ,x 2i ,...,x ki ].

  Consider the quantity

  h ii =x i (X X ) −1 x i ,

  i = 1, 2, . . . , n.

  12.10 Study of Residuals and Violation of Assumptions (Model Checking)

  The reader should recognize that h ii was used in the computation of the confidence

  intervals on the mean response in Section 12.5. Apart from σ 2 ,h ii represents the

  variance of the fitted value ˆ y i . The h ii values are the diagonal elements of the

  HAT matrix

  H = X(X X ) −1 X ,

  which plays an important role in any study of residuals and in other modern aspects of regression analysis (see Myers, 1990, listed in the Bibliography). The term HAT matrix is derived from the fact that H generates the “y-hats,” or the fitted values when multiplied by the vector y of observed responses. That is, ˆ y = Xb, and thus

  y ˆ = X(X X ) −1 X y = Hy,

  where ˆ y is the vector whose ith element is ˆ y i .

  If we make the usual assumptions that the i are independent and normally

  distributed with mean 0 and variance σ 2 , the statistical properties of the residuals

  are readily characterized. Then

  for i = 1, 2, . . . , n. (See Myers, 1990, for details.) It can be shown that the HAT diagonal values are bounded according to the inequality

  In addition,

  h ii = k + 1, the number of regression parameters. As a result, any

  i=1

  data point whose HAT diagonal element is large, that is, well above the average value of (k + 1)n, is in a position in the data set where the variance of ˆ y i is relatively large and the variance of a residual is relatively small. As a result, the data analyst can gain some insight into how large a residual may become before its deviation from zero can be attributed to something other than mere chance. Many of the commercial regression computer packages produce the set of studentized residuals .

  Here each residual has been divided by an estimate of its standard de- viation , creating a t-like statistic that is designed to give the analyst a scale-free quantity providing information regarding the size of the residual. In addition, standard computer packages often provide values of another set of studentized- type residuals called the R-Student values.

  R-Student Residual

  where s −i is an estimate of the error standard deviation, calculated with the ith data point deleted .

  Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

  There are three types of violations of assumptions that are readily detected through use of residuals or residual plots. While plots of the raw residuals, the e i , can be helpful, it is often more informative to plot the studentized residuals. The three violations are as follows:

  1. Presence of outliers

  2. Heterogeneous error variance

  3. Model misspecification In case 1, we choose to define an outlier as a data point where there is a

  deviation from the usual assumption E( i ) = 0 for a specific value of i. If there is

  a reason to believe that a specific data point is an outlier exerting a large influence on the fitted model, r i or t i may be informative. The R-Student values can be expected to be more sensitive to outliers than the r i values.

  In fact, under the condition that E( i ) = 0, t i is a value of a random variable

  following a t-distribution with n −1−(k +1) = n−k −2 degrees of freedom. Thus,

  a two-sided t-test can be used to provide information for detecting whether or not the ith point is an outlier.

  Although the R-Student statistic t i produces an exact t-test for detection of an outlier at a specific data location, the t-distribution would not apply for simultane- ously testing for outliers at all locations. As a result, the studentized residuals or R-Student values should be used strictly as diagnostic tools without formal hypoth- esis testing as the mechanism. The implication is that these statistics highlight data points where the error of fit is larger than what is expected by chance. R-Student values large in magnitude suggest a need for “checking” the data with whatever resources are possible. The practice of eliminating observations from regression data sets should not be done indiscriminately. (For further information regarding the use of outlier diagnostics, see Myers, 1990, in the Bibliography.)

  Illustration of Outlier Detection

  Case Study 12.1: Method for Capturing Grasshoppers : In a biological experiment conducted

  at Virginia Tech by the Department of Entomology, n experimental runs were made with two different methods for capturing grasshoppers. The methods were drop net catch and sweep net catch. The average number of grasshoppers caught within

  a set of field quadrants on a given date was recorded for each of the two methods. An additional regressor variable, the average plant height in the quadrants, was also recorded. The experimental data are given in Table 12.10.

  The goal is to be able to estimate grasshopper catch by using only the sweep net method, which is less costly. There was some concern about the validity of the fourth data point. The observed catch that was reported using the net drop method seemed unusually high given the other conditions and, indeed, it was felt that the figure might be erroneous. Fit a model of the type

  y i =β 0 +β 1 x 1 +β 2 x 2

  to the 17 data points and study the residuals to determine if data point 4 is an outlier.

  12.10 Study of Residuals and Violation of Assumptions (Model Checking)

  Table 12.10: Data Set for Case Study 12.1

  Drop Net Sweep Net

  Plant

  Observation

  Catch, y

  Catch, x 1 Height, x 2 (cm)

  Solution : A computer package generated the fitted regression model

  y = 3.6870 + 4.1050x ˆ 1 − 0.0367x 2

  along with the statistics R 2 = 0.9244 and s 2 = 5.580. The residuals and other

  diagnostic information were also generated and recorded in Table 12.11.

  As expected, the residual at the fourth location appears to be unusually high, namely 7.769. The vital issue here is whether or not this residual is larger than one would expect by chance. The residual standard error for point 4 is 2.209. The R-

  Student value t 4 is found to be 9.9315. Viewing this as a value of a random variable

  having a t-distribution with 13 degrees of freedom, one would certainly conclude that the residual of the fourth observation is estimating something greater than 0 and that the suspected measurement error is supported by the study of residuals. Notice that no other residual results in an R-Student value that produces any cause for alarm.

  Plotting Residuals for Case Study 12.1

  In Chapter 11, we discussed, in some detail, the usefulness of plotting residuals in regression analysis. Violation of model assumptions can often be detected through these plots. In multiple regression, normal probability plotting of residuals or plotting of residuals against ˆ y may be useful. However, it is often preferable to plot studentized residuals.

  Keep in mind that the preference for the studentized residuals over ordinary residuals for plotting purposes stems from the fact that since the variance of the

  Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

  Table 12.11: Residual Information for the Data Set of Case Study 12.1

  −0.809 −1.577 −1.065

  −0.232 −0.317

  −1.841 −0.141 −1.610 −2.285

  −0.390 −0.695 −0.485

  3.517 −0.301 0.138

  −0.103 −0.140

  −1.270 −0.062 −0.757 −1.042

  −0.3780 −0.6812 −0.4715

  9.9315 −0.2909 0.1329

  −0.0989 −0.1353

  −1.3005 −0.0598 −0.7447 −1.0454

  ith residual depends on the ith HAT diagonal, variances of residuals will differ if there is a dispersion in the HAT diagonals. Thus, the appearance of a plot of residuals may seem to suggest heterogeneity because the residuals themselves do not behave, in general, in an ideal way. The purpose of using studentized residuals is to provide a type of standardization. Clearly, if σ were known, then under ideal conditions (i.e., a correct model and homogeneous variance), we would have

  So the studentized residuals produce a set of statistics that behave in a standard way under ideal conditions. Figure 12.5 shows a plot of the R-Student values for

  the grasshopper data of Case Study 12.1. Note how the value for observation 4 stands out from the rest. The R-Student plot was generated by SAS software. The plot shows the residuals against the ˆ y-values.

  Normality Checking

  The reader should recall the importance of normality checking through the use of normal probability plotting, as discussed in Chapter 11. The same recommendation holds for the case of multiple linear regression. Normal probability plots can be generated using standard regression software. Again, however, they can be more effective when one does not use ordinary residuals but, rather, studentized residuals or R-Student values.

  12.11 Cross Validation, C p , and Other Criteria for Model Selection

  Studentized Residual without Current Obs

  Predicted Value of Y

  Figure 12.5: R-Student values plotted against predicted values for grasshopper data of Case Study 12.1.