Assessing Model Adequacy

13.1 Assessing Model Adequacy

  A plot of the observed pairs (x i ,y i ) is a necessary first step in deciding on the form of

  a mathematical relationship between x and y. It is possible to fit many functions other

  than a linear one (y 5 b 0 1 b 1 x ) to the data, using either the principle of least squares or

  another fitting method. Once a function of the chosen form has been fitted, it is impor- tant to check the fit of the model to see whether it is in fact appropriate. One way to study the fit is to superimpose a graph of the best-fit function on the scatterplot of the data. However, any tilt or curvature of the best-fit function may obscure some aspects of the fit that should be investigated. Furthermore, the scale on the vertical axis may make it difficult to assess the extent to which observed values deviate from the best-fit function.

  residuals and Standardized residuals

  A more effective approach to assessment of model adequacy is to compute the fitted

  or predicted values yˆ i and the residuals e i 5 y i 2 yˆ i , and then plot various functions

  of these computed quantities. We then examine the plots either to confirm our choice of model or for indications that the model is not appropriate. Suppose the simple

  be the equation of the estimated regression line. Then the ith residual is e i 5 y i 2 (bˆ 0 1 bˆ 1 x i ). To derive properties of the residuals, let e i 5 Y i 2 Yˆ i , represent the ith residual as a random

  linear regression model is correct, and let y 5 bˆ 0 1 bˆ 1 x

  variable (rv) before observations are actually made. Then

  E (Y i 2 Yˆ i ) 5 E(Y i ) 2 E(bˆ 0 1 bˆ 1 x i )5b 0 1b 1 x i 2 (b 0 1b 1 x i )50 (13.1) Because Yˆ i (5 bˆ 0 1 bˆ 1 x i ) is a linear function of the Y j ’ s, so is Y i 2 Yˆ i (the coefficients

  depend on the x ’ j s). Thus the normality of the Y ’ j s implies that each residual is nor- mally distributed. It can also be shown that

  3 n S xx 4

  2 1 (x

  i 2 x )

  V (Y i 2 Yˆ i )5s ? 12 2 (13.2)

  Replacing s 2 by s 2 and taking the square root of Equation (13.2) gives the estimated

  standard deviation of a residual.

  Let’s now standardize each residual by subtracting the mean value (zero) and then dividing by the estimated standard deviation.

  The standardized residuals are given by

  1 (x i 2 x ) 2

  y i 2 yˆ i

  Î n S xx

  e i 5 i5 1,…, n (13.3)

  s 12 2

  If, for example, a particular standardized residual is 1.5, then the residual itself is

  1.5 (estimated) standard deviations larger than what would be expected from fitting the correct model. Notice that the variances of the residuals differ from one another.

  In fact, because there is a 2 sign in front of sx i 2 x 2 d , the variance of a residual

  decreases as x i moves further away from the center of the data x. Intuitively, this is because the least squares line is pulled toward an observation whose x i value lies far to the right or left of other observations in the sample. Computation of the e i ’s can

  544 ChApter 13 Nonlinear and Multiple regression

  be tedious, but the most widely used statistical computer packages will provide these values and construct various plots involving them.

  ExamplE 13.1

  Exercise 19 in Chapter 12 presented data on x 5 burner area liberation rate and y5 NO x emissions. Here we reproduce the data and give the fitted values, residuals, and standardized residuals. The estimated regression line is y 5 245.55 1 1.71x,

  and r 2 5 .961. The standardized residuals are not a constant multiple of the residuals because the residual variances differ somewhat from one another.

  diagnostic Plots

  The basic plots that many statisticians recommend for an assessment of model validity and usefulness are the following:

  1. e (or e) on the vertical axis versus x on the horizontal axis—that is, a plot of the (x i ,e i ) pairs [or the (x i ,e i ) pairs]

  2. e (or e) on the vertical axis versus yˆ on the horizontal axis—that is, a plot of the (yˆ i ,e i ) pairs [or the (yˆ i ,e i ) pairs]

  3. yˆ on the vertical versus y on the horizontal—that is, a plot of the (y i , yˆ i ) pairs

  4. A normal probability plot of the standardized residuals Plots 1 and 2 are called residual plots (against the independent variable and fitted

  values, respectively). Since yˆ 5 bˆ 0 1 bˆ 1 x is a linear function of x, the general pat-

  tern of points in Plot 2 should be identical to that in Plot 1, though the horizontal scales will differ (in multiple regression, there is a Plot 1 for each predictor, and Plot

  2 is a single omnibus picture that combines information from all of those). Provided that the chosen model is correct, neither residual plot should exhibit any discernible pattern. The residuals should be randomly distributed about 0 according to a normal distribution, so all or almost all e’s should lie between 22 and 12.

  We hope that the fitted model will give predicted y values that are close to their observed counterparts. This would manifest itself in Plot 3 by plotted points falling close to a 45° line. Thus this plot provides a visual assessment of model effectiveness in making predictions. Plot 4 allows the analyst to assess the plausi- bility of assuming that the random deviation « in the model equation has a normal

  13.1 Assessing Model Adequacy 545

  distribution. If the pattern in the plot departs substantially from linearity, then the

  inferential procedures from Chapter 12 based on the t n2 2 distribution should not be

  used as a basis for drawing conclusions.

  ExamplE 13.2

  Figure 13.1 presents a scatterplot of the data and the four plots just recommended. The

  (Example 13.1

  plot of yˆ versus y confirms the impression given by r 2 that x is effective in predicting y

  continued)

  and also indicates that there is no observed y for which the predicted value is terribly far off the mark. Both residual plots show no unusual pattern or discrepant values. There is one standardized residual slightly outside the interval (22, 2), but this is not surprising in a sample of size 14. The normal probability plot of the standardized residuals is reasonably straight. In summary, the plots leave us with no qualms about either the appropriateness of a simple linear relationship or the fit to the given data.

  1.0 residuals vs. y

  1.0 residuals vs. x

  2.0 Normal probability plot

  3.0 z percentile

  Figure 13.1 Plots for the data from Example 13.1

  n

  546 ChApter 13 Nonlinear and Multiple regression

  difficulties and remedies

  Although we hope that our analysis will yield plots like those of Figure 13.1, quite frequently the plots will suggest one or more of the following difficulties:

  1. A nonlinear probabilistic relationship between x and y is appropriate.

  2. The variance of e (and of Y) is not a constant s 2 , but instead depends somehow on x.

  3. The selected model fits the data well except for a very few discrepant or outlying data values, which may have greatly influenced the choice of the best-fit function.

  4. The error variable e does not have a normal distribution.

  5. When the subscript i indicates the time order of the observations, the e i ’s exhibit dependence over time.

  6. One or more relevant independent variables have been omitted from the model.

  Figure 13.2 presents residual plots corresponding to items 1–3, 5, and 6. In Chap ter 4, we discussed patterns in normal probability plots that cast doubt on the assumption of an underlying normal distribution. Notice that the residuals from the data in Fig ure 13.2(d) with the circled point included would not by themselves necessarily suggest further analysis, yet when a new line is fit with that point deleted, the new line differs considerably from the original line. This type of behavior is more difficult to identify in multiple regression. It is most likely to arise when there is a single (or very few) data point(s) with independent variable value(s) far removed from the remainder of the data.

  Time order

  Omitted

  of observation

  independent variable

  ( e )

  ( f )

  Figure 13.2 Plots that indicate abnormality in data: (a) nonlinear relationship; (b) nonconstant variance; (c) discrepant observation; (d) observation with large influence; (e) dependence in errors; (f) variable omitted

  13.1 Assessing Model Adequacy 547

  We now indicate briefly what remedies are available for the types of difficul- ties. For a more comprehensive discussion, one or more of the references on regres- sion analysis should be consulted. If the residual plot looks something like that of Figure 13.2(a), exhibiting a curved pattern, then a nonlinear function of x may be fit.

  The residual plot of Figure 13.2(b) suggests that, although a straight-line

  relationship may be reasonable, the assumption that V sY i d5s 2 for each i is of doubt-

  ful validity. When the assumptions of Chapter 12 are valid, it can be shown that

  among all unbiased estimators of b 0 and b 1 , the ordinary least squares estimators

  have minimum variance. These estimators give equal weight to each (x i ,Y i ). If the variance of Y increases with x, then Y i ’s for large x i should be given less weight than

  those with small x i . This suggests that b 0 and b 1 should be estimated by minimizing

  f w sb 0 ,b 1 d5 o w i [y i 2 sb 0 1 b 1 x i d] 2 (13.4)

  where the w i ’s are weights that decrease with increasing x i . Minimization of Expression (13.4) yields weighted least squares estimates. For example, if the standard deviation

  of Y is proportional to x sfor x. 0 d—that is, V sYd 5 kx 2 —then it can be shown that the weights w i 5 1 yx i 2 yield best estimators of b 0 and b 1 . Weighted least squares is

  used quite frequently by econometricians (economists who use statistical methods) to estimate parameters.

  When plots or other evidence suggest that the data set contains outliers or points having large influence on the resulting fit, one possible approach is to omit these outly- ing points and recompute the estimated regression equation. This would certainly be correct if it were found that the outliers resulted from errors in recording data values or experimental errors. If no assignable cause can be found for the outliers, it is still desirable to report the estimated equation both with and without outliers omitted. Yet another approach is to retain possible outliers but to use an estimation principle that puts relatively less weight on outlying values than does the principle of least squares.

  One such principle is MAD (minimize absolute deviations), which selects bˆ 0 and bˆ 1 to minimize ouy i 2 sb 0 1 b 1 x i du. Unlike the estimates of least squares, there are no

  nice formulas for the MAD estimates; their values must be found by using an iterative computational procedure. Such procedures are also used when it is suspected that the

  e i ’ s have a distribution that is not normal but instead have “heavy tails” (making it much more likely than for the normal distribution that discrepant values will enter the sample); robust regression procedures are those that produce reliable estimates for a wide variety of underlying error distributions. Least squares estimators are not robust in the same way that the sample mean X is not a robust estimator for m.

  When a plot suggests time dependence in the error terms, an appropriate analysis may involve a transformation of the y’s or else a model explicitly including

  a time variable. Lastly, a plot such as that of Figure 13.2(f), which shows a pattern in the residuals when plotted against an omitted variable, suggests that a multiple regression model that includes the previously omitted variable should be considered.

  ExERcisEs Section 13.1 (1–14)

  1. Suppose the variables x 5 commuting distance and y 5

  b. Repeat part (a) for x 1 5 2 5, x 5 10, x 3 5 15, x 4 5 20,

  comuting time are related according to the simple linear

  and x 5 50.

  regression model with s 5 10.

  c. What do the results of parts (a) and (b) imply about

  a. If n5 5 observations are made at the x values x 1 5 5,

  the deviation of the estimated line from the observa-

  x 2 5 3 10, x

  4 5 20, and x 5 25, calculate the

  tion made at the largest sampled x value?

  standard deviations of the five corresponding residuals.

  548 ChApter 13 Nonlinear and Multiple regression

  2. The x values and standardized residuals for the chlorine

  flowetch rate data of Exercise 52 (Section 12.4) are

  displayed in the accompanying table. Construct a stan-

  dardized residual plot and comment on its appearance.

  a. The r 2 value resulting from a least squares fit is

  .977. Interpret this value and comment on the appropriateness of assuming an approximate linear

  3.50 4.00 b. The residuals, listed in the same order as the x val- ues, are

  e .73 2 1.36 1.53 .07

  3. Example 12.6 presented the residuals from a simple lin-

  ear regression of moisture content y on filtration rate x.

  a. Plot the residuals against x. Does the resulting plot

  suggest that a straight-line regression function is a

  reasonable choice of model? Explain your reasoning. b. Using s 5 .665, compute the values of the standard-

  Plot the residuals against elapsed time. What does the

  ized residuals. Is e i

  plot suggest?

  e i ’s not close to being proportional to the e i ’s?

  6. The accompanying scatterplot is based on data provided

  c. Plot the standardized residuals against x. Does the

  by authors of the article “Spurious Correlation in the

  plot differ significantly in general appearance from

  USEPA Rating Curve Method for Estimating Pollutant

  the plot of part (a)?

  Loads” (J. of Envir. Engr., 2008: 610–618) ; here dis-

  4. The accompanying data on y 5 normalized energy (J ym 2 )

  charge is in ft 3 s as opposed to m 3 s used in the article. The

  and x 5 intraocular pressure (mmHg) appeared in a scat-

  point on the far right of the plot corresponds to the obser-

  terplot in the article “Evaluating the Risk of Eye Injuries:

  vation (140, 1529.35). The resulting standardized residual

  Intraocular Pressure During High Speed Projectile

  is 3.10. Minitab flags the observation with an R for large

  Impacts” (Current Eye Research, 2012: 43–49) ; an esti-

  residual and an X for potentially influential observation.

  mated regression function was superimposed on the plot.

  Here is some information on the estimated slope:

  x

  Full sample

  s bˆ i .3806

  Does this observation appear to have had a substantial

  a. Here is Minitab output from fitting the simple linear

  impact on the estimated slope? Explain.

  regression model. Does the model appear to specify

  a useful relationship between the two variables?

  SE Coef

  Load = –13.58 + 9.905 Discharge

  S 5 3679.36 R–Sq 5 90.2 R–Sq(adj) 5 89.2

  b. The standardized residuals resulting from fitting the

  simple linear regression model (in the same order as

  the observations) are .98,21.57, 1.47, .50,2.76,2.84,

  1.47,2.85,21.03,2.20, .40, and .81. Construct a

  plot of e versus x and comment. [Note: The model 69.0107 400

  Load (Kg Nday)

  S

  R-Sq 92.5

  fit in the cited article was not linear.]

  R-Sq (adj) 92.4

  5. As the air temperature drops, river water becomes super-

  0 x

  cooled and ice crystals form. Such ice can significantly

  affect the hydraulics of a river. The article “Laboratory

  Discharge (cfs)

  Study of Anchor Ice Growth” (J. of Cold Regions

  Engr., 2001: 60–66) described an experiment in which

  7. Composite honeycomb sandwich panels are widely

  ice thickness (mm) was studied as a function of elapsed

  used in various aerospace structural applications such

  time (hr) under specified conditions. The following data

  as ribs, flaps, and rudders. The article “Core Crush

  was read from a graph in the article: n 5 33;

  Problem in Manufacturing of Composite Sandwich

  x5 .17, .33, .50, .67,…, 5.50; y 5 .50, 1.25, 1.50, 2.75,

  Structures: Mechanisms and Solutions” (Amer. Inst.

  of Aeronautics and Astronautics J., 2006: 901–907) fit

  13.1 Assessing Model Adequacy 549

  a line to the following data on x 5 prepreg thickness smmd

  For each of these four data sets, the values of the sum- and y 5 core crush sd: mary statistics ox i , ox 2 i , oy i , oy i 2 , and ox i y i are virtually

  x .246 .250 .251 .251 .254 .262 .264 .270

  identical, so all quantities computed from these five will be essentially identical for the four sets—the least squares line

  y

  sy 5 3 1 .5xd, SSE, s 2 ,r 2 , t intervals, t statistics, and so on. The summary statistics provide no way of distinguish-

  x .272 .277 .281 .289 .290 .292 .293

  ing among the four data sets. Based on a scatterplot and

  y

  a residual plot for each set, comment on the appropriate-

  a. Fit the simple linear regression model. What propor-

  ness or inappropriateness of fitting a straight-line model;

  tion of the observed variation in core crush can be

  include in your comments any specific suggestions for how

  attributed to the model relationship?

  a “straight-line analysis” might be modified or qualified.

  b. Construct a scatterplot. Does the plot suggest that a

  10. a. Show that o n i5 1 e i 5 0 when the e i ’s are the residuals

  linear probabilistic relationship is appropriate?

  from a simple linear regression.

  c. Obtain the residuals and standardized residuals, and

  b. Are the residuals from a simple linear regression

  then construct residual plots. What do these plots sug-

  independent of one another, positively correlated, or

  gest? What type of function should provide a better fit

  negatively correlated? Explain.

  to the data than does a straight line?

  c. Show that o n i5 1 x i e i 5 0 for the residuals from a simple

  8. Continuous recording of heart rate can be used to obtain

  linear regression. (This result along with part (a) shows

  information about the level of exercise intensity or physi-

  that there are two linear restrictions on the e i ’s, resulting

  cal strain during sports participation, work, or other daily

  in a loss of 2 df when the squared residuals are used to

  activities. The article

  estimate s “The Relationship Between Heart 2 .)

  d. Is it true that o i5 Rate and Oxygen Uptake During Non-Steady State n 1 e i 5 0? Give a proof or a counter

  Exercise” (Ergonomics, 2000: 1578–1592) reported on a

  example.

  study to investigate using heart rate response (x, as a per-

  11. a. Express the ith residual Y i 2 Yˆ i (where Yˆ i 5 bˆ 0 1 bˆ 1 x i )

  centage of the maximum rate) to predict oxygen uptake (y,

  in the form oc j Y j , a linear function of the Y j ’s. Then

  as a percentage of maximum uptake) during exercise. The

  use rules of variance to verify that V sY i 2 Yˆ i d is given

  accompanying data was read from a graph in the article.

  by Expression (13.2).

  HR 43.5 44.0 44.0 44.5 44.0 45.0 48.0 49.0

  b. It can be shown that Yˆ i and Y i 2 Yˆ i (the ith predicted value and residual) are independent of one another.

  VO 2 22.0 21.0 22.0 21.5 25.5 24.5 30.0 28.0

  Use this fact, the relation Y i 5 Yˆ i 1 sY i 2 Yˆ i d, and the expression for V sYˆd from Section 12.4 to again verify

  c. As x i moves farther away from x, what happens to

  Use a statistical software package to perform a simple lin-

  V sYˆ i d and to VsY i 2 Yˆ i d?

  ear regression analysis, paying particular attention to the

  12. a. Could a linear regression result in residuals 23, 227,

  presence of any unusual or influential observations.

  5, 17, 28, 9, and 15? Why or why not?

  9. Consider the following four (x, y) data sets; the first three

  b. Could a linear regression result in residuals 23, 227,

  have the same x values, so these values are listed only once

  5, 17, 28, 212, and 2 corresponding to x values 3, 24,

  (Frank Anscombe, “Graphs in Statistical Analysis,”

  8, 12, 214, 220, and 25? Why or why not? [Hint: See

  Amer. Statistician, 1973: 17–21) :

  Exercise 10.]

  Data Set 1–3 1 2 3 4 4

  13. Recall that bˆ 0 1 bˆ 1 x has a normal distribution with

  expected value b 0 1b 1 x and variance 1 (x 2 x ) 2

  5 n o (x i 2 x ) 2 6

  so that

  1 n o(x i 2 x ) 2

  has a standard normal distribution. If S 5 Ï SSE ysn 2 2d

  is substituted for s, the resulting variable has a t distribu-

  tion with n 2 2 df. By analogy, what is the distribution of any particular standardized residual? If n 5 25, what is

  550 ChApter 13 Nonlinear and Multiple regression

  the probability that a particular standardized residual falls

  whereas E sMSPEd 5 s 2 whether or not H 0 is true,

  outside the interval (22.50, 2.50)?

  E sMSLFd 5 s 2 if H 0 is true and E sMSLFd . s 2 if H 0

  is false.

  14. If there is at least one x value at which more than one obser- vation has been made, there is a formal test procedure for

  The test statistic is F 5 MSLFMSPE, and the corre-

  testing H 0 :m Y?x 5b 0 1b 1 x for some values b 0 ,b 1 (the

  sponding P-value is the area under the F c 2 2,n 2 c curve to

  true regression function is linear)

  the right of f.

  versus

  The following data comes from the article “Changes in Growth Hormone Status Related to Body Weight

  H a : H 0 is not true (the true regression function is not

  of Growing Cattle” (Growth, 1977: 241–247) , with

  linear)

  x5 body weight and y 5 metabolic clearance ratebody

  Suppose observations are made at x 1 ,x 2 , …, x c . Let

  weight.

  Y 11 ,Y 12 , …, Y 1n 1 denote the n 1 observations when

  x5x 1 ; …; Y c 1 ,Y c 2 , …, Y cn c denote the n c observations

  x 110 110 110 230 230 230 360

  when x 5 x c . With n 5 on i (the total number of observa- tions), SSE has n 2 2 df. We break SSE into two pieces,

  y 235 198 173 174 149 124 115

  SSPE (pure error) and SSLF (lack of fit), as follows:

  x 360 360 360 505 505 505 505

  SSPE 5

  o 2 o (Y

  5 Y ij 2 n i Y i? 2 (So c 5 4, n 1 5 n 2 5 3 3, n 5 n 4 5 oo 4.) o

  a. Test H 0 versus H at level .05 using the lack-of-fit test

  a

  SSLF 5 SSE 2 SSPE

  just described.

  The n i observations at x i contribute n i 2 1 df to SSPE,

  b. Does a scatterplot of the data suggest that the rela-

  so the number of degrees of freedom for SSPE is

  tionship between x and y is linear? How does this

  o i sn i 2 1 d 5 n 2 c, and the degrees of freedom for SSLF

  compare with the result of part (a)? (A nonlinear

  is n 2 2 2 sn 2 cd 5 c 2 2. Let MSPE 5 SSPEysn 2 cd

  regression function was used in the article.)

  and MSLF 5 SSLF ysc 2 2d. Then it can be shown that

Dokumen yang terkait

AN ALIS IS YU RID IS PUT USAN BE B AS DAL AM P E RKAR A TIND AK P IDA NA P E NY E RTA AN M E L AK U K A N P R AK T IK K E DO K T E RA N YA NG M E N G A K IB ATK AN M ATINYA P AS IE N ( PUT USA N N O MOR: 9 0/PID.B /2011/ PN.MD O)

0 82 16

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

19 819 7

Anal isi s L e ve l Pe r tanyaan p ad a S oal Ce r ita d alam B u k u T e k s M at e m at ik a Pe n u n jang S MK Pr ogr a m Keahl ian T e k n ologi , Kese h at an , d an Pe r tani an Kelas X T e r b itan E r lan gga B e r d asarkan T ak s on om i S OL O

2 99 16

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

1 29 9

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

7 202 3

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

0 63 87

The Correlation between students vocabulary master and reading comprehension

16 145 49

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

8 140 133

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

9 128 37

Transmission of Greek and Arabic Veteri

0 1 22