Assessing Model Adequacy
13.1 Assessing Model Adequacy
A plot of the observed pairs (x i ,y i ) is a necessary first step in deciding on the form of
a mathematical relationship between x and y. It is possible to fit many functions other
than a linear one (y 5 b 0 1 b 1 x ) to the data, using either the principle of least squares or
another fitting method. Once a function of the chosen form has been fitted, it is impor- tant to check the fit of the model to see whether it is in fact appropriate. One way to study the fit is to superimpose a graph of the best-fit function on the scatterplot of the data. However, any tilt or curvature of the best-fit function may obscure some aspects of the fit that should be investigated. Furthermore, the scale on the vertical axis may make it difficult to assess the extent to which observed values deviate from the best-fit function.
residuals and Standardized residuals
A more effective approach to assessment of model adequacy is to compute the fitted
or predicted values yˆ i and the residuals e i 5 y i 2 yˆ i , and then plot various functions
of these computed quantities. We then examine the plots either to confirm our choice of model or for indications that the model is not appropriate. Suppose the simple
be the equation of the estimated regression line. Then the ith residual is e i 5 y i 2 (bˆ 0 1 bˆ 1 x i ). To derive properties of the residuals, let e i 5 Y i 2 Yˆ i , represent the ith residual as a random
linear regression model is correct, and let y 5 bˆ 0 1 bˆ 1 x
variable (rv) before observations are actually made. Then
E (Y i 2 Yˆ i ) 5 E(Y i ) 2 E(bˆ 0 1 bˆ 1 x i )5b 0 1b 1 x i 2 (b 0 1b 1 x i )50 (13.1) Because Yˆ i (5 bˆ 0 1 bˆ 1 x i ) is a linear function of the Y j ’ s, so is Y i 2 Yˆ i (the coefficients
depend on the x ’ j s). Thus the normality of the Y ’ j s implies that each residual is nor- mally distributed. It can also be shown that
3 n S xx 4
2 1 (x
i 2 x )
V (Y i 2 Yˆ i )5s ? 12 2 (13.2)
Replacing s 2 by s 2 and taking the square root of Equation (13.2) gives the estimated
standard deviation of a residual.
Let’s now standardize each residual by subtracting the mean value (zero) and then dividing by the estimated standard deviation.
The standardized residuals are given by
1 (x i 2 x ) 2
y i 2 yˆ i
Î n S xx
e i 5 i5 1,…, n (13.3)
s 12 2
If, for example, a particular standardized residual is 1.5, then the residual itself is
1.5 (estimated) standard deviations larger than what would be expected from fitting the correct model. Notice that the variances of the residuals differ from one another.
In fact, because there is a 2 sign in front of sx i 2 x 2 d , the variance of a residual
decreases as x i moves further away from the center of the data x. Intuitively, this is because the least squares line is pulled toward an observation whose x i value lies far to the right or left of other observations in the sample. Computation of the e i ’s can
544 ChApter 13 Nonlinear and Multiple regression
be tedious, but the most widely used statistical computer packages will provide these values and construct various plots involving them.
ExamplE 13.1
Exercise 19 in Chapter 12 presented data on x 5 burner area liberation rate and y5 NO x emissions. Here we reproduce the data and give the fitted values, residuals, and standardized residuals. The estimated regression line is y 5 245.55 1 1.71x,
and r 2 5 .961. The standardized residuals are not a constant multiple of the residuals because the residual variances differ somewhat from one another.
diagnostic Plots
The basic plots that many statisticians recommend for an assessment of model validity and usefulness are the following:
1. e (or e) on the vertical axis versus x on the horizontal axis—that is, a plot of the (x i ,e i ) pairs [or the (x i ,e i ) pairs]
2. e (or e) on the vertical axis versus yˆ on the horizontal axis—that is, a plot of the (yˆ i ,e i ) pairs [or the (yˆ i ,e i ) pairs]
3. yˆ on the vertical versus y on the horizontal—that is, a plot of the (y i , yˆ i ) pairs
4. A normal probability plot of the standardized residuals Plots 1 and 2 are called residual plots (against the independent variable and fitted
values, respectively). Since yˆ 5 bˆ 0 1 bˆ 1 x is a linear function of x, the general pat-
tern of points in Plot 2 should be identical to that in Plot 1, though the horizontal scales will differ (in multiple regression, there is a Plot 1 for each predictor, and Plot
2 is a single omnibus picture that combines information from all of those). Provided that the chosen model is correct, neither residual plot should exhibit any discernible pattern. The residuals should be randomly distributed about 0 according to a normal distribution, so all or almost all e’s should lie between 22 and 12.
We hope that the fitted model will give predicted y values that are close to their observed counterparts. This would manifest itself in Plot 3 by plotted points falling close to a 45° line. Thus this plot provides a visual assessment of model effectiveness in making predictions. Plot 4 allows the analyst to assess the plausi- bility of assuming that the random deviation « in the model equation has a normal
13.1 Assessing Model Adequacy 545
distribution. If the pattern in the plot departs substantially from linearity, then the
inferential procedures from Chapter 12 based on the t n2 2 distribution should not be
used as a basis for drawing conclusions.
ExamplE 13.2
Figure 13.1 presents a scatterplot of the data and the four plots just recommended. The
(Example 13.1
plot of yˆ versus y confirms the impression given by r 2 that x is effective in predicting y
continued)
and also indicates that there is no observed y for which the predicted value is terribly far off the mark. Both residual plots show no unusual pattern or discrepant values. There is one standardized residual slightly outside the interval (22, 2), but this is not surprising in a sample of size 14. The normal probability plot of the standardized residuals is reasonably straight. In summary, the plots leave us with no qualms about either the appropriateness of a simple linear relationship or the fit to the given data.
1.0 residuals vs. y
1.0 residuals vs. x
2.0 Normal probability plot
3.0 z percentile
Figure 13.1 Plots for the data from Example 13.1
n
546 ChApter 13 Nonlinear and Multiple regression
difficulties and remedies
Although we hope that our analysis will yield plots like those of Figure 13.1, quite frequently the plots will suggest one or more of the following difficulties:
1. A nonlinear probabilistic relationship between x and y is appropriate.
2. The variance of e (and of Y) is not a constant s 2 , but instead depends somehow on x.
3. The selected model fits the data well except for a very few discrepant or outlying data values, which may have greatly influenced the choice of the best-fit function.
4. The error variable e does not have a normal distribution.
5. When the subscript i indicates the time order of the observations, the e i ’s exhibit dependence over time.
6. One or more relevant independent variables have been omitted from the model.
Figure 13.2 presents residual plots corresponding to items 1–3, 5, and 6. In Chap ter 4, we discussed patterns in normal probability plots that cast doubt on the assumption of an underlying normal distribution. Notice that the residuals from the data in Fig ure 13.2(d) with the circled point included would not by themselves necessarily suggest further analysis, yet when a new line is fit with that point deleted, the new line differs considerably from the original line. This type of behavior is more difficult to identify in multiple regression. It is most likely to arise when there is a single (or very few) data point(s) with independent variable value(s) far removed from the remainder of the data.
Time order
Omitted
of observation
independent variable
( e )
( f )
Figure 13.2 Plots that indicate abnormality in data: (a) nonlinear relationship; (b) nonconstant variance; (c) discrepant observation; (d) observation with large influence; (e) dependence in errors; (f) variable omitted
13.1 Assessing Model Adequacy 547
We now indicate briefly what remedies are available for the types of difficul- ties. For a more comprehensive discussion, one or more of the references on regres- sion analysis should be consulted. If the residual plot looks something like that of Figure 13.2(a), exhibiting a curved pattern, then a nonlinear function of x may be fit.
The residual plot of Figure 13.2(b) suggests that, although a straight-line
relationship may be reasonable, the assumption that V sY i d5s 2 for each i is of doubt-
ful validity. When the assumptions of Chapter 12 are valid, it can be shown that
among all unbiased estimators of b 0 and b 1 , the ordinary least squares estimators
have minimum variance. These estimators give equal weight to each (x i ,Y i ). If the variance of Y increases with x, then Y i ’s for large x i should be given less weight than
those with small x i . This suggests that b 0 and b 1 should be estimated by minimizing
f w sb 0 ,b 1 d5 o w i [y i 2 sb 0 1 b 1 x i d] 2 (13.4)
where the w i ’s are weights that decrease with increasing x i . Minimization of Expression (13.4) yields weighted least squares estimates. For example, if the standard deviation
of Y is proportional to x sfor x. 0 d—that is, V sYd 5 kx 2 —then it can be shown that the weights w i 5 1 yx i 2 yield best estimators of b 0 and b 1 . Weighted least squares is
used quite frequently by econometricians (economists who use statistical methods) to estimate parameters.
When plots or other evidence suggest that the data set contains outliers or points having large influence on the resulting fit, one possible approach is to omit these outly- ing points and recompute the estimated regression equation. This would certainly be correct if it were found that the outliers resulted from errors in recording data values or experimental errors. If no assignable cause can be found for the outliers, it is still desirable to report the estimated equation both with and without outliers omitted. Yet another approach is to retain possible outliers but to use an estimation principle that puts relatively less weight on outlying values than does the principle of least squares.
One such principle is MAD (minimize absolute deviations), which selects bˆ 0 and bˆ 1 to minimize ouy i 2 sb 0 1 b 1 x i du. Unlike the estimates of least squares, there are no
nice formulas for the MAD estimates; their values must be found by using an iterative computational procedure. Such procedures are also used when it is suspected that the
e i ’ s have a distribution that is not normal but instead have “heavy tails” (making it much more likely than for the normal distribution that discrepant values will enter the sample); robust regression procedures are those that produce reliable estimates for a wide variety of underlying error distributions. Least squares estimators are not robust in the same way that the sample mean X is not a robust estimator for m.
When a plot suggests time dependence in the error terms, an appropriate analysis may involve a transformation of the y’s or else a model explicitly including
a time variable. Lastly, a plot such as that of Figure 13.2(f), which shows a pattern in the residuals when plotted against an omitted variable, suggests that a multiple regression model that includes the previously omitted variable should be considered.
ExERcisEs Section 13.1 (1–14)
1. Suppose the variables x 5 commuting distance and y 5
b. Repeat part (a) for x 1 5 2 5, x 5 10, x 3 5 15, x 4 5 20,
comuting time are related according to the simple linear
and x 5 50.
regression model with s 5 10.
c. What do the results of parts (a) and (b) imply about
a. If n5 5 observations are made at the x values x 1 5 5,
the deviation of the estimated line from the observa-
x 2 5 3 10, x
4 5 20, and x 5 25, calculate the
tion made at the largest sampled x value?
standard deviations of the five corresponding residuals.
548 ChApter 13 Nonlinear and Multiple regression
2. The x values and standardized residuals for the chlorine
flowetch rate data of Exercise 52 (Section 12.4) are
displayed in the accompanying table. Construct a stan-
dardized residual plot and comment on its appearance.
a. The r 2 value resulting from a least squares fit is
.977. Interpret this value and comment on the appropriateness of assuming an approximate linear
3.50 4.00 b. The residuals, listed in the same order as the x val- ues, are
e .73 2 1.36 1.53 .07
3. Example 12.6 presented the residuals from a simple lin-
ear regression of moisture content y on filtration rate x.
a. Plot the residuals against x. Does the resulting plot
suggest that a straight-line regression function is a
reasonable choice of model? Explain your reasoning. b. Using s 5 .665, compute the values of the standard-
Plot the residuals against elapsed time. What does the
ized residuals. Is e i plot suggest? e i ’s not close to being proportional to the e i ’s? 6. The accompanying scatterplot is based on data provided c. Plot the standardized residuals against x. Does the by authors of the article “Spurious Correlation in the plot differ significantly in general appearance from USEPA Rating Curve Method for Estimating Pollutant the plot of part (a)? Loads” (J. of Envir. Engr., 2008: 610–618) ; here dis- 4. The accompanying data on y 5 normalized energy (J ym 2 ) charge is in ft 3 s as opposed to m 3 s used in the article. The and x 5 intraocular pressure (mmHg) appeared in a scat- point on the far right of the plot corresponds to the obser- terplot in the article “Evaluating the Risk of Eye Injuries: vation (140, 1529.35). The resulting standardized residual Intraocular Pressure During High Speed Projectile is 3.10. Minitab flags the observation with an R for large Impacts” (Current Eye Research, 2012: 43–49) ; an esti- residual and an X for potentially influential observation. mated regression function was superimposed on the plot. Here is some information on the estimated slope: x Full sample s bˆ i .3806 Does this observation appear to have had a substantial a. Here is Minitab output from fitting the simple linear impact on the estimated slope? Explain. regression model. Does the model appear to specify a useful relationship between the two variables? SE Coef Load = –13.58 + 9.905 Discharge S 5 3679.36 R–Sq 5 90.2 R–Sq(adj) 5 89.2 b. The standardized residuals resulting from fitting the simple linear regression model (in the same order as the observations) are .98,21.57, 1.47, .50,2.76,2.84, 1.47,2.85,21.03,2.20, .40, and .81. Construct a plot of e versus x and comment. [Note: The model 69.0107 400 Load (Kg Nday) S R-Sq 92.5 fit in the cited article was not linear.] R-Sq (adj) 92.4 5. As the air temperature drops, river water becomes super- 0 x cooled and ice crystals form. Such ice can significantly affect the hydraulics of a river. The article “Laboratory Discharge (cfs) Study of Anchor Ice Growth” (J. of Cold Regions Engr., 2001: 60–66) described an experiment in which 7. Composite honeycomb sandwich panels are widely ice thickness (mm) was studied as a function of elapsed used in various aerospace structural applications such time (hr) under specified conditions. The following data as ribs, flaps, and rudders. The article “Core Crush was read from a graph in the article: n 5 33; Problem in Manufacturing of Composite Sandwich x5 .17, .33, .50, .67,…, 5.50; y 5 .50, 1.25, 1.50, 2.75, Structures: Mechanisms and Solutions” (Amer. Inst. of Aeronautics and Astronautics J., 2006: 901–907) fit 13.1 Assessing Model Adequacy 549 a line to the following data on x 5 prepreg thickness smmd For each of these four data sets, the values of the sum- and y 5 core crush sd: mary statistics ox i , ox 2 i , oy i , oy i 2 , and ox i y i are virtually x .246 .250 .251 .251 .254 .262 .264 .270 identical, so all quantities computed from these five will be essentially identical for the four sets—the least squares line y sy 5 3 1 .5xd, SSE, s 2 ,r 2 , t intervals, t statistics, and so on. The summary statistics provide no way of distinguish- x .272 .277 .281 .289 .290 .292 .293 ing among the four data sets. Based on a scatterplot and y a residual plot for each set, comment on the appropriate- a. Fit the simple linear regression model. What propor- ness or inappropriateness of fitting a straight-line model; tion of the observed variation in core crush can be include in your comments any specific suggestions for how attributed to the model relationship? a “straight-line analysis” might be modified or qualified. b. Construct a scatterplot. Does the plot suggest that a 10. a. Show that o n i5 1 e i 5 0 when the e i ’s are the residuals linear probabilistic relationship is appropriate? from a simple linear regression. c. Obtain the residuals and standardized residuals, and b. Are the residuals from a simple linear regression then construct residual plots. What do these plots sug- independent of one another, positively correlated, or gest? What type of function should provide a better fit negatively correlated? Explain. to the data than does a straight line? c. Show that o n i5 1 x i e i 5 0 for the residuals from a simple 8. Continuous recording of heart rate can be used to obtain linear regression. (This result along with part (a) shows information about the level of exercise intensity or physi- that there are two linear restrictions on the e i ’s, resulting cal strain during sports participation, work, or other daily in a loss of 2 df when the squared residuals are used to activities. The article estimate s “The Relationship Between Heart 2 .) d. Is it true that o i5 Rate and Oxygen Uptake During Non-Steady State n 1 e i 5 0? Give a proof or a counter Exercise” (Ergonomics, 2000: 1578–1592) reported on a example. study to investigate using heart rate response (x, as a per- 11. a. Express the ith residual Y i 2 Yˆ i (where Yˆ i 5 bˆ 0 1 bˆ 1 x i ) centage of the maximum rate) to predict oxygen uptake (y, in the form oc j Y j , a linear function of the Y j ’s. Then as a percentage of maximum uptake) during exercise. The use rules of variance to verify that V sY i 2 Yˆ i d is given accompanying data was read from a graph in the article. by Expression (13.2). HR 43.5 44.0 44.0 44.5 44.0 45.0 48.0 49.0 b. It can be shown that Yˆ i and Y i 2 Yˆ i (the ith predicted value and residual) are independent of one another. VO 2 22.0 21.0 22.0 21.5 25.5 24.5 30.0 28.0 Use this fact, the relation Y i 5 Yˆ i 1 sY i 2 Yˆ i d, and the expression for V sYˆd from Section 12.4 to again verify c. As x i moves farther away from x, what happens to Use a statistical software package to perform a simple lin- V sYˆ i d and to VsY i 2 Yˆ i d? ear regression analysis, paying particular attention to the 12. a. Could a linear regression result in residuals 23, 227, presence of any unusual or influential observations. 5, 17, 28, 9, and 15? Why or why not? 9. Consider the following four (x, y) data sets; the first three b. Could a linear regression result in residuals 23, 227, have the same x values, so these values are listed only once 5, 17, 28, 212, and 2 corresponding to x values 3, 24, (Frank Anscombe, “Graphs in Statistical Analysis,” 8, 12, 214, 220, and 25? Why or why not? [Hint: See Amer. Statistician, 1973: 17–21) : Exercise 10.] Data Set 1–3 1 2 3 4 4 13. Recall that bˆ 0 1 bˆ 1 x has a normal distribution with expected value b 0 1b 1 x and variance 1 (x 2 x ) 2 5 n o (x i 2 x ) 2 6 so that 1 n o(x i 2 x ) 2 has a standard normal distribution. If S 5 Ï SSE ysn 2 2d is substituted for s, the resulting variable has a t distribu- tion with n 2 2 df. By analogy, what is the distribution of any particular standardized residual? If n 5 25, what is 550 ChApter 13 Nonlinear and Multiple regression the probability that a particular standardized residual falls whereas E sMSPEd 5 s 2 whether or not H 0 is true, outside the interval (22.50, 2.50)? E sMSLFd 5 s 2 if H 0 is true and E sMSLFd . s 2 if H 0 is false. 14. If there is at least one x value at which more than one obser- vation has been made, there is a formal test procedure for The test statistic is F 5 MSLFMSPE, and the corre- testing H 0 :m Y?x 5b 0 1b 1 x for some values b 0 ,b 1 (the sponding P-value is the area under the F c 2 2,n 2 c curve to true regression function is linear) the right of f. versus The following data comes from the article “Changes in Growth Hormone Status Related to Body Weight H a : H 0 is not true (the true regression function is not of Growing Cattle” (Growth, 1977: 241–247) , with linear) x5 body weight and y 5 metabolic clearance ratebody Suppose observations are made at x 1 ,x 2 , …, x c . Let weight. Y 11 ,Y 12 , …, Y 1n 1 denote the n 1 observations when x5x 1 ; …; Y c 1 ,Y c 2 , …, Y cn c denote the n c observations x 110 110 110 230 230 230 360 when x 5 x c . With n 5 on i (the total number of observa- tions), SSE has n 2 2 df. We break SSE into two pieces, y 235 198 173 174 149 124 115 SSPE (pure error) and SSLF (lack of fit), as follows: x 360 360 360 505 505 505 505 SSPE 5 o 2 o (Y 5 Y ij 2 n i Y i? 2 (So c 5 4, n 1 5 n 2 5 3 3, n 5 n 4 5 oo 4.) o a. Test H 0 versus H at level .05 using the lack-of-fit test a SSLF 5 SSE 2 SSPE just described. The n i observations at x i contribute n i 2 1 df to SSPE, b. Does a scatterplot of the data suggest that the rela- so the number of degrees of freedom for SSPE is tionship between x and y is linear? How does this o i sn i 2 1 d 5 n 2 c, and the degrees of freedom for SSLF compare with the result of part (a)? (A nonlinear is n 2 2 2 sn 2 cd 5 c 2 2. Let MSPE 5 SSPEysn 2 cd regression function was used in the article.) and MSLF 5 SSLF ysc 2 2d. Then it can be shown that