Study of Residuals and Violation of Assumptions (Model Checking)

12.10 Study of Residuals and Violation of Assumptions (Model Checking)

It was suggested earlier in this chapter that the residuals, or errors in the regression ﬁt, often carry information that can be very informative to the data analyst. The

e i =y i −ˆy i , i = 1, 2, . . . , n, which are the numerical counterpart to the i , the model errors, often shed light on the possible violation of assumptions or the presence of “suspect” data points. Suppose that we let the vector x i denote the values of the regressor variables corresponding to the ith data point, supplemented by a 1 in the initial position. That is,

x i = [1, x 1i ,x 2i ,...,x ki ].

Consider the quantity

h ii =x i (X X ) −1 x i ,

i = 1, 2, . . . , n.

12.10 Study of Residuals and Violation of Assumptions (Model Checking)

The reader should recognize that h ii was used in the computation of the conﬁdence

intervals on the mean response in Section 12.5. Apart from σ 2 ,h ii represents the

variance of the ﬁtted value ˆ y i . The h ii values are the diagonal elements of the

HAT matrix

H = X(X X ) −1 X ,

which plays an important role in any study of residuals and in other modern aspects of regression analysis (see Myers, 1990, listed in the Bibliography). The term HAT matrix is derived from the fact that H generates the “y-hats,” or the ﬁtted values when multiplied by the vector y of observed responses. That is, ˆ y = Xb, and thus

y ˆ = X(X X ) −1 X y = Hy,

where ˆ y is the vector whose ith element is ˆ y i .

If we make the usual assumptions that the i are independent and normally

distributed with mean 0 and variance σ 2 , the statistical properties of the residuals

are readily characterized. Then

for i = 1, 2, . . . , n. (See Myers, 1990, for details.) It can be shown that the HAT diagonal values are bounded according to the inequality

In addition,

h ii = k + 1, the number of regression parameters. As a result, any

i=1

data point whose HAT diagonal element is large, that is, well above the average value of (k + 1)n, is in a position in the data set where the variance of ˆ y i is relatively large and the variance of a residual is relatively small. As a result, the data analyst can gain some insight into how large a residual may become before its deviation from zero can be attributed to something other than mere chance. Many of the commercial regression computer packages produce the set of studentized residuals .

Here each residual has been divided by an estimate of its standard deviation , creating a t-like statistic that is designed to give the analyst a scale-free quantity providing information regarding the size of the residual. In addition, standard computer packages often provide values of another set of studentized- type residuals called the R-Student values.

R-Student Residual

where s −i is an estimate of the error standard deviation, calculated with the ith data point deleted .

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

There are three types of violations of assumptions that are readily detected through use of residuals or residual plots. While plots of the raw residuals, the e i , can be helpful, it is often more informative to plot the studentized residuals. The three violations are as follows:

1. Presence of outliers

2. Heterogeneous error variance

3. Model misspeciﬁcation In case 1, we choose to deﬁne an outlier as a data point where there is a

deviation from the usual assumption E( i ) = 0 for a speciﬁc value of i. If there is

a reason to believe that a specific data point is an outlier exerting a large influence on the fitted model, r i or t i may be informative. The R-Student values can be expected to be more sensitive to outliers than the r i values.

In fact, under the condition that E( i ) = 0, t i is a value of a random variable

following a t-distribution with n −1−(k +1) = n−k −2 degrees of freedom. Thus,

a two-sided t-test can be used to provide information for detecting whether or not the ith point is an outlier.

Although the R-Student statistic t i produces an exact t-test for detection of an outlier at a speciﬁc data location, the t-distribution would not apply for simultane- ously testing for outliers at all locations. As a result, the studentized residuals or R-Student values should be used strictly as diagnostic tools without formal hypoth- esis testing as the mechanism. The implication is that these statistics highlight data points where the error of ﬁt is larger than what is expected by chance. R-Student values large in magnitude suggest a need for “checking” the data with whatever resources are possible. The practice of eliminating observations from regression data sets should not be done indiscriminately. (For further information regarding the use of outlier diagnostics, see Myers, 1990, in the Bibliography.)

Illustration of Outlier Detection

Case Study 12.1: Method for Capturing Grasshoppers : In a biological experiment conducted

at Virginia Tech by the Department of Entomology, n experimental runs were made with two diﬀerent methods for capturing grasshoppers. The methods were drop net catch and sweep net catch. The average number of grasshoppers caught within

a set of ﬁeld quadrants on a given date was recorded for each of the two methods. An additional regressor variable, the average plant height in the quadrants, was also recorded. The experimental data are given in Table 12.10.

The goal is to be able to estimate grasshopper catch by using only the sweep net method, which is less costly. There was some concern about the validity of the fourth data point. The observed catch that was reported using the net drop method seemed unusually high given the other conditions and, indeed, it was felt that the ﬁgure might be erroneous. Fit a model of the type

y i =β 0 +β 1 x 1 +β 2 x 2

to the 17 data points and study the residuals to determine if data point 4 is an outlier.

12.10 Study of Residuals and Violation of Assumptions (Model Checking)

Table 12.10: Data Set for Case Study 12.1

Drop Net Sweep Net

Plant

Observation

Catch, y

Catch, x 1 Height, x 2 (cm)

Solution : A computer package generated the ﬁtted regression model

y = 3.6870 + 4.1050x ˆ 1 − 0.0367x 2

along with the statistics R 2 = 0.9244 and s 2 = 5.580. The residuals and other

diagnostic information were also generated and recorded in Table 12.11.

As expected, the residual at the fourth location appears to be unusually high, namely 7.769. The vital issue here is whether or not this residual is larger than one would expect by chance. The residual standard error for point 4 is 2.209. The R-

Student value t 4 is found to be 9.9315. Viewing this as a value of a random variable

having a t-distribution with 13 degrees of freedom, one would certainly conclude that the residual of the fourth observation is estimating something greater than 0 and that the suspected measurement error is supported by the study of residuals. Notice that no other residual results in an R-Student value that produces any cause for alarm.

Plotting Residuals for Case Study 12.1

In Chapter 11, we discussed, in some detail, the usefulness of plotting residuals in regression analysis. Violation of model assumptions can often be detected through these plots. In multiple regression, normal probability plotting of residuals or plotting of residuals against ˆ y may be useful. However, it is often preferable to plot studentized residuals.

Keep in mind that the preference for the studentized residuals over ordinary residuals for plotting purposes stems from the fact that since the variance of the

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

Table 12.11: Residual Information for the Data Set of Case Study 12.1

−0.809 −1.577 −1.065

−0.232 −0.317

−1.841 −0.141 −1.610 −2.285

−0.390 −0.695 −0.485

3.517 −0.301 0.138

−0.103 −0.140

−1.270 −0.062 −0.757 −1.042

−0.3780 −0.6812 −0.4715

9.9315 −0.2909 0.1329

−0.0989 −0.1353

−1.3005 −0.0598 −0.7447 −1.0454

ith residual depends on the ith HAT diagonal, variances of residuals will diﬀer if there is a dispersion in the HAT diagonals. Thus, the appearance of a plot of residuals may seem to suggest heterogeneity because the residuals themselves do not behave, in general, in an ideal way. The purpose of using studentized residuals is to provide a type of standardization. Clearly, if σ were known, then under ideal conditions (i.e., a correct model and homogeneous variance), we would have

So the studentized residuals produce a set of statistics that behave in a standard way under ideal conditions. Figure 12.5 shows a plot of the R-Student values for

the grasshopper data of Case Study 12.1. Note how the value for observation 4 stands out from the rest. The R-Student plot was generated by SAS software. The plot shows the residuals against the ˆ y-values.

Normality Checking

The reader should recall the importance of normality checking through the use of normal probability plotting, as discussed in Chapter 11. The same recommendation holds for the case of multiple linear regression. Normal probability plots can be generated using standard regression software. Again, however, they can be more eﬀective when one does not use ordinary residuals but, rather, studentized residuals or R-Student values.

12.11 Cross Validation, C p , and Other Criteria for Model Selection

Studentized Residual without Current Obs

Predicted Value of Y

Figure 12.5: R-Student values plotted against predicted values for grasshopper data of Case Study 12.1.

Study of Residuals and Violation of Assumptions (Model Checking)

12.10 Study of Residuals and Violation of Assumptions (Model Checking)

Parts

Dokumen yang terkait

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

The Correlation between students vocabulary master and reading comprehension

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

Antiremed Kelas 12 Matematika (4)

Transmission of Greek and Arabic Veteri

Services for adults with an autism spect

Dukungan

Links

Study of Residuals and Violation of Assumptions (Model Checking)

12.10 Study of Residuals and Violation of Assumptions (Model Checking)

Parts

Dokumen yang terkait

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

The Correlation between students vocabulary master and reading comprehension

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

Antiremed Kelas 12 Matematika (4)

Transmission of Greek and Arabic Veteri

Services for adults with an autism spect

Dokumen yang Anda mencari sudah siap untuk unduhkan