Test for Linearity of Regression: Data with Repeated Observations

11.9 Test for Linearity of Regression: Data with Repeated Observations

  In certain kinds of experimental situations, the researcher has the capability of obtaining repeated observations on the response for each value of x. Although it is

  not necessary to have these repetitions in order to estimate β 0 and β 1 , nevertheless

  repetitions enable the experimenter to obtain quantitative information concerning the appropriateness of the model. In fact, if repeated observations are generated, the experimenter can make a significance test to aid in determining whether or not the model is adequate.

  11.9 Test for Linearity of Regression: Data with Repeated Observations

  The regression equation is COD = 3.83 + 0.904 Per_Red Predictor

  Coef SE Coef

  2.17 0.038 Per_Red 0.90364 0.05012 18.03 0.000

  S = 3.22954

  R-Sq = 91.3

  R-Sq(adj) = 91.0

  Analysis of Variance

  Residual Error 31

  Obs Per_Red

  COD

  Fit SE Fit Residual St Resid

  Figure 11.14: MINITAB printout of simple linear regression for chemical oxygen demand reduction data; part I.

  Let us select a random sample of n observations using k distinct values of x,

  say x 1 ,x 2 ,...,x n , such that the sample contains n 1 observed values of the random variable Y 1 corresponding to x 1 ,n 2 observed values of Y 2 corresponding to x 2 ,...,

  k

  n k observed values of Y k corresponding to x k . Of necessity, n =

  n i .

  i=1

  Chapter 11 Simple Linear Regression and Correlation

  Figure 11.15: MINITAB printout of simple linear regression for chemical oxygen demand reduction data; part II.

  We define

  y ij = the jth value of the random variable Y

  Hence, if n 4 = 3 measurements of Y were made corresponding to x = x 4 , we would

  indicate these observations by y 41 ,y 42 , and y 43 . Then T i. =y 41 +y 42 +y 43 .

  Concept of Lack of Fit

  The error sum of squares consists of two parts: the amount due to the variation between the values of Y within given values of x and a component that is normally

  11.9 Test for Linearity of Regression: Data with Repeated Observations

  called the lack-of-fit contribution. The first component reflects mere random variation, or pure experimental error, while the second component is a measure of the systematic variation brought about by higher-order terms. In our case, these are terms in x other than the linear, or first-order, contribution. Note that in choosing a linear model we are essentially assuming that this second component does not exist and hence our error sum of squares is completely due to random

  errors. If this should be the case, then s 2 = SSE(n − 2) is an unbiased estimate

  of σ 2 . However, if the model does not adequately fit the data, then the error sum

  of squares is inflated and produces a biased estimate of σ 2 . Whether or not the model fits the data, an unbiased estimate of σ 2 can always be obtained when we

  have repeated observations simply by computing

  for each of the k distinct values of x and then pooling these variances to get

  The numerator of s 2 is a measure of the pure experimental error. A compu-

  tational procedure for separating the error sum of squares into the two components representing pure error and lack of fit is as follows:

  Computation of 1. Compute the pure error sum of squares

  Lack-of-Fit Sum of

  k

  Squares n i

  This sum of squares has n − k degrees of freedom associated with it, and the

  resulting mean square is our unbiased estimate s 2 of σ 2 .

  2. Subtract the pure error sum of squares from the error sum of squares SSE, thereby obtaining the sum of squares due to lack of fit. The degrees of freedom for lack of fit are obtained by simply subtracting (n − 2) − (n − k) = k − 2.

  The computations required for testing hypotheses in a regression problem with repeated measurements on the response may be summarized as shown in Table

  Figures 11.16 and 11.17 display the sample points for the “correct model” and “incorrect model” situations. In Figure 11.16, where the μ Y |x fall on a straight line, there is no lack of fit when a linear model is assumed, so the sample variation around the regression line is a pure error resulting from the variation that occurs among repeated observations. In Figure 11.17, where the μ Y |x clearly do not fall on a straight line, the lack of fit from erroneously choosing a linear model accounts for a large portion of the variation around the regression line, supplementing the pure error.

  Chapter 11 Simple Linear Regression and Correlation

  Table 11.3: Analysis of Variance for Testing Linearity of Regression

  Source of

  Sum of

  Degrees of

  Computed f

  Lack of fit ) pure

  SSE −SSE( pure )

  SSE

  − SSE (pure) −SSE( −2

  Pure error

  SSE (pure)

  n

  −k SSE( s 2 = pure )

  Figure 11.16: Correct linear model with no lack-of- Figure 11.17: Incorrect linear model with lack-of-fit fit component.

  component.

  What Is the Importance in Detecting Lack of Fit?

  The concept of lack of fit is extremely important in applications of regression analysis. In fact, the need to construct or design an experiment that will account for lack of fit becomes more critical as the problem and the underlying mechanism involved become more complicated. Surely, one cannot always be certain that his or her postulated structure, in this case the linear regression model, is correct or even an adequate representation. The following example shows how the error sum of squares is partitioned into the two components representing pure error and lack of fit. The adequacy of the model is tested at the α-level of significance by

  comparing the lack-of-fit mean square divided by s 2 with f α (k − 2, n − k).

  Example 11.8: Observations of the yield of a chemical reaction taken at various temperatures were

  recorded in Table 11.4. Estimate the linear model μ Y |x =β 0 +β 1 x and test for

  lack of fit. Solution : Results of the computations are shown in Table 11.5.

  Conclusion: The partitioning of the total variation in this manner reveals a significant variation accounted for by the linear model and an insignificant amount of variation due to lack of fit. Thus, the experimental data do not seem to suggest the need to consider terms higher than first order in the model, and the null hypothesis is not rejected.

  Exercises

  Table 11.4: Data for Example 11.8

  Table 11.5: Analysis of Variance on Yield-Temperature Data

  Source of

  Sum of

  Degrees of

  Computed f P-Values

  Lack of fit

  Pure error

  Annotated Computer Printout for Test for Lack of Fit

  Figure 11.18 is an annotated computer printout showing analysis of the data of Example 11.8 with SAS. Note the “LOF” with 2 degrees of freedom, represent- ing the quadratic and cubic contribution to the model, and the P -value of 0.22, suggesting that the linear (first-order) model is adequate.

  Dependent Variable: yield

  Sum of

  Source

  DF Squares

  Mean Square

  F Value

  Corrected Total

  R-Square

  Coeff Var

  Root MSE

  yield Mean

  DF Type I SS

  Mean Square

  F Value

  Figure 11.18: SAS printout, showing analysis of data of Example 11.8.

  Exercises

  11.31 Test for linearity of regression in Exercise 11.3 origin (Exercise 11.28) μ Y |x = βx. on page 398. Use a 0.05 level of significance. Comment. (a) Estimate the regression line passing through the

  11.32 Test for linearity of regression in Exercise 11.8

  origin for the following data:

  on page 399. Comment.

  x

  11.33 Suppose we have a linear equation through the

  y

  Chapter 11 Simple Linear Regression and Correlation

  (b) Suppose it is not known whether the true regres- (a) Determine if emitter drive-in time influences gain

  sion should pass through the origin. Estimate the

  in a linear relationship. That is, test H 0 :β 1 = 0,

  linear model μ Y |x =β 0 +β 1 x and test the hypoth-

  where β 1 is the slope of the regressor variable.

  esis that β 0 = 0, at the 0.10 level of significance, (b) Do a lack-of-fit test to determine if the linear rela-

  against the alternative that β 0 = 0.

  tionship is adequate. Draw conclusions.

  11.34 Use an analysis-of-variance approach to test (c) Determine if emitter dose influences gain in a linear

  the hypothesis that β 1 = 0 against the alternative hy-

  relationship. Which regressor variable is the better

  pothesis β 1 = 0 in Exercise 11.5 on page 398 at the

  predictor of gain?

  0.05 level of significance.

  11.37 Organophosphate (OP) compounds are used as

  11.35 The following data are a result of an investiga- pesticides. However, it is important to study their ef- tion as to the effect of reaction temperature x on per- fect on species that are exposed to them. In the labora- cent conversion of a chemical process y. (See Myers, tory study Some Effects of Organophosphate Pesticides Montgomery and Anderson-Cook, 2009.) Fit a simple on Wildlife Species, by the Department of Fisheries linear regression, and use a lack-of-fit test to determine and Wildlife at Virginia Tech, an experiment was con- if the model is adequate. Discuss.

  ducted in which different dosages of a particular OP

  Temperature Conversion

  pesticide were administered to 5 groups of 5 mice (per-

  Observation

  ( ◦ C),

  (), x omysius leucopus). The 25 mice were females of similar y

  43 age and condition. One group received no chemical.

  78 The basic response y was a measure of activity in the

  69 brain. It was postulated that brain activity would de-

  73 crease with an increase in OP dosage. The data are as

  78 Dose, x (mgkg

  Activity, y

  65 Animal

  body weight)

  11.36 Transistor gain between emitter and collector

  in an integrated circuit device (hFE) is related to two

  variables (Myers, Montgomery and Anderson-Cook,

  2009) that can be controlled at the deposition process,

  emitter drive-in time (x 1 , in minutes) and emitter dose

  (x 2 , in ions

  11 4.6 × 10 10.6

  14 ). Fourteen samples were observed

  following deposition, and the resulting data are shown

  in the table below. We will consider linear regression

  models using gain as the response and emitter drive-in

  time or emitter dose as the regressor variable.

  x 1 (drive-in

  x 2 (dose,

  y (gain,

  Obs. time, min) ions ×10 14 ) or hFE)

  (a) Using the model

  find the least squares estimates of β 0 and β 1 .

  (b) Construct an analysis-of-variance table in which

  the lack of fit and pure error have been separated.

  Exercises

  Determine if the lack of fit is significant at the 0.05 11.40 It is of interest to study the effect of population

  level. Interpret the results.

  size in various cities in the United States on ozone con- centrations. The data consist of the 1999 population

  11.38 Heat treating is often used to carburize metal in millions and the amount of ozone present per hour parts such as gears. The thickness of the carburized in ppb (parts per billion). The data are as follows. layer is considered an important feature of the gear,

  Ozone (ppbhour), y

  and it contributes to the overall reliability of the part.

  Because of the critical nature of this feature, a lab test

  is performed on each furnace load. The test is a de-

  structive one, where an actual part is cross sectioned

  and soaked in a chemical for a period of time. This

  test involves running a carbon analysis on the surface

  of both the gear pitch (top of the gear tooth) and the

  gear root (between the gear teeth). The data below

  are the results of the pitch carbon-analysis test for 19

  Soak Time

  Pitch

  Soak Time

  Pitch

  (a) Fit the linear regression model relating ozone con-

  centration to population. Test H 0 :β 1 = 0 using

  the ANOVA approach.

  (b) Do a test for lack of fit. Is the linear model appro-

  priate based on the results of your test?

  (c) Test the hypothesis of part (a) using the pure mean

  square error in the F-test. Do the results change?

  Comment on the advantage of each test.

  11.41 Evaluating nitrogen deposition from the atmo-

  (a) Fit a simple linear regression relating the pitch car- sphere is a major role of the National Atmospheric

  bon analysis y against soak time. Test H :β

  Deposition Program (NADP), a partnership of many 0 1 = 0. agencies. NADP is studying atmospheric deposition

  (b) If the hypothesis in part (a) is rejected, determine and its effect on agricultural crops, forest surface wa-

  if the linear model is adequate.

  ters, and other resources. Nitrogen oxides may affect

  11.39 A regression model is desired relating tempera- the ozone in the atmosphere and the amount of pure ture and the proportion of impurities passing through nitrogen in the air we breathe. The data are as follows: solid helium. Temperature is listed in degrees centi-

  Year

  Nitrogen Oxide

  grade. The data are as follows:

  Temperature ( ◦ C) Proportion of Impurities

  −265.0 2.77 0.475 −270.0 1984 0.705

  −272.5 4.39 0.935 −272.6

  (a) Fit a linear regression model.

  (b) Does it appear that the proportion of impurities

  passing through helium increases as the tempera-

  ture approaches −273 degrees centigrade?

  (c) Find R 2 .

  (d) Based on the information above, does the linear

  model seem appropriate? What additional infor- mation would you need to better answer that ques-

  Chapter 11 Simple Linear Regression and Correlation

  (a) Plot the data.

  tions were used for each level of x. The data are shown

  (b) Fit a linear regression model and find R 2 .

  as follows:

  (c) What can you say about the trend in nitrogen oxide

  Plants per Plot,

  Quantity of Seeds, y

  across time?

  11.42 For a particular variety of plant, researchers

  wanted to develop a formula for predicting the quan-

  tity of seeds (in grams) as a function of the density of

  plants. They conducted a study with four levels of the Is a simple linear regression model adequate for ana- factor x, the number of plants per plot. Four replica- lyzing this data set?