Test for Linearity of Regression: Data with Repeated Observations
11.9 Test for Linearity of Regression: Data with Repeated Observations
In certain kinds of experimental situations, the researcher has the capability of obtaining repeated observations on the response for each value of x. Although it is
not necessary to have these repetitions in order to estimate β 0 and β 1 , nevertheless
repetitions enable the experimenter to obtain quantitative information concerning the appropriateness of the model. In fact, if repeated observations are generated, the experimenter can make a significance test to aid in determining whether or not the model is adequate.
11.9 Test for Linearity of Regression: Data with Repeated Observations
The regression equation is COD = 3.83 + 0.904 Per_Red Predictor
Coef SE Coef
2.17 0.038 Per_Red 0.90364 0.05012 18.03 0.000
S = 3.22954
R-Sq = 91.3
R-Sq(adj) = 91.0
Analysis of Variance
Residual Error 31
Obs Per_Red
COD
Fit SE Fit Residual St Resid
Figure 11.14: MINITAB printout of simple linear regression for chemical oxygen demand reduction data; part I.
Let us select a random sample of n observations using k distinct values of x,
say x 1 ,x 2 ,...,x n , such that the sample contains n 1 observed values of the random variable Y 1 corresponding to x 1 ,n 2 observed values of Y 2 corresponding to x 2 ,...,
k
n k observed values of Y k corresponding to x k . Of necessity, n =
n i .
i=1
Chapter 11 Simple Linear Regression and Correlation
Figure 11.15: MINITAB printout of simple linear regression for chemical oxygen demand reduction data; part II.
We define
y ij = the jth value of the random variable Y
Hence, if n 4 = 3 measurements of Y were made corresponding to x = x 4 , we would
indicate these observations by y 41 ,y 42 , and y 43 . Then T i. =y 41 +y 42 +y 43 .
Concept of Lack of Fit
The error sum of squares consists of two parts: the amount due to the variation between the values of Y within given values of x and a component that is normally
11.9 Test for Linearity of Regression: Data with Repeated Observations
called the lack-of-fit contribution. The first component reflects mere random variation, or pure experimental error, while the second component is a measure of the systematic variation brought about by higher-order terms. In our case, these are terms in x other than the linear, or first-order, contribution. Note that in choosing a linear model we are essentially assuming that this second component does not exist and hence our error sum of squares is completely due to random
errors. If this should be the case, then s 2 = SSE(n − 2) is an unbiased estimate
of σ 2 . However, if the model does not adequately fit the data, then the error sum
of squares is inflated and produces a biased estimate of σ 2 . Whether or not the model fits the data, an unbiased estimate of σ 2 can always be obtained when we
have repeated observations simply by computing
for each of the k distinct values of x and then pooling these variances to get
The numerator of s 2 is a measure of the pure experimental error. A compu-
tational procedure for separating the error sum of squares into the two components representing pure error and lack of fit is as follows:
Computation of 1. Compute the pure error sum of squares
Lack-of-Fit Sum of
k
Squares n i
This sum of squares has n − k degrees of freedom associated with it, and the
resulting mean square is our unbiased estimate s 2 of σ 2 .
2. Subtract the pure error sum of squares from the error sum of squares SSE, thereby obtaining the sum of squares due to lack of fit. The degrees of freedom for lack of fit are obtained by simply subtracting (n − 2) − (n − k) = k − 2.
The computations required for testing hypotheses in a regression problem with repeated measurements on the response may be summarized as shown in Table
Figures 11.16 and 11.17 display the sample points for the “correct model” and “incorrect model” situations. In Figure 11.16, where the μ Y |x fall on a straight line, there is no lack of fit when a linear model is assumed, so the sample variation around the regression line is a pure error resulting from the variation that occurs among repeated observations. In Figure 11.17, where the μ Y |x clearly do not fall on a straight line, the lack of fit from erroneously choosing a linear model accounts for a large portion of the variation around the regression line, supplementing the pure error.
Chapter 11 Simple Linear Regression and Correlation
Table 11.3: Analysis of Variance for Testing Linearity of Regression
Source of
Sum of
Degrees of
Computed f
Lack of fit ) pure
SSE −SSE( pure )
SSE
− SSE (pure) −SSE( −2
Pure error
SSE (pure)
n
−k SSE( s 2 = pure )
Figure 11.16: Correct linear model with no lack-of- Figure 11.17: Incorrect linear model with lack-of-fit fit component.
component.
What Is the Importance in Detecting Lack of Fit?
The concept of lack of fit is extremely important in applications of regression analysis. In fact, the need to construct or design an experiment that will account for lack of fit becomes more critical as the problem and the underlying mechanism involved become more complicated. Surely, one cannot always be certain that his or her postulated structure, in this case the linear regression model, is correct or even an adequate representation. The following example shows how the error sum of squares is partitioned into the two components representing pure error and lack of fit. The adequacy of the model is tested at the α-level of significance by
comparing the lack-of-fit mean square divided by s 2 with f α (k − 2, n − k).
Example 11.8: Observations of the yield of a chemical reaction taken at various temperatures were
recorded in Table 11.4. Estimate the linear model μ Y |x =β 0 +β 1 x and test for
lack of fit. Solution : Results of the computations are shown in Table 11.5.
Conclusion: The partitioning of the total variation in this manner reveals a significant variation accounted for by the linear model and an insignificant amount of variation due to lack of fit. Thus, the experimental data do not seem to suggest the need to consider terms higher than first order in the model, and the null hypothesis is not rejected.
Exercises
Table 11.4: Data for Example 11.8
Table 11.5: Analysis of Variance on Yield-Temperature Data
Source of
Sum of
Degrees of
Computed f P-Values
Lack of fit
Pure error
Annotated Computer Printout for Test for Lack of Fit
Figure 11.18 is an annotated computer printout showing analysis of the data of Example 11.8 with SAS. Note the “LOF” with 2 degrees of freedom, represent- ing the quadratic and cubic contribution to the model, and the P -value of 0.22, suggesting that the linear (first-order) model is adequate.
Dependent Variable: yield
Sum of
Source
DF Squares
Mean Square
F Value
Corrected Total
R-Square
Coeff Var
Root MSE
yield Mean
DF Type I SS
Mean Square
F Value
Figure 11.18: SAS printout, showing analysis of data of Example 11.8.
Exercises
11.31 Test for linearity of regression in Exercise 11.3 origin (Exercise 11.28) μ Y |x = βx. on page 398. Use a 0.05 level of significance. Comment. (a) Estimate the regression line passing through the
11.32 Test for linearity of regression in Exercise 11.8
origin for the following data:
on page 399. Comment.
x
11.33 Suppose we have a linear equation through the
y
Chapter 11 Simple Linear Regression and Correlation
(b) Suppose it is not known whether the true regres- (a) Determine if emitter drive-in time influences gain
sion should pass through the origin. Estimate the
in a linear relationship. That is, test H 0 :β 1 = 0,
linear model μ Y |x =β 0 +β 1 x and test the hypoth-
where β 1 is the slope of the regressor variable.
esis that β 0 = 0, at the 0.10 level of significance, (b) Do a lack-of-fit test to determine if the linear rela-
against the alternative that β 0 = 0.
tionship is adequate. Draw conclusions.
11.34 Use an analysis-of-variance approach to test (c) Determine if emitter dose influences gain in a linear
the hypothesis that β 1 = 0 against the alternative hy-
relationship. Which regressor variable is the better
pothesis β 1 = 0 in Exercise 11.5 on page 398 at the
predictor of gain?
0.05 level of significance.
11.37 Organophosphate (OP) compounds are used as
11.35 The following data are a result of an investiga- pesticides. However, it is important to study their ef- tion as to the effect of reaction temperature x on per- fect on species that are exposed to them. In the labora- cent conversion of a chemical process y. (See Myers, tory study Some Effects of Organophosphate Pesticides Montgomery and Anderson-Cook, 2009.) Fit a simple on Wildlife Species, by the Department of Fisheries linear regression, and use a lack-of-fit test to determine and Wildlife at Virginia Tech, an experiment was con- if the model is adequate. Discuss.
ducted in which different dosages of a particular OP
Temperature Conversion
pesticide were administered to 5 groups of 5 mice (per-
Observation
( ◦ C),
(), x omysius leucopus). The 25 mice were females of similar y
43 age and condition. One group received no chemical.
78 The basic response y was a measure of activity in the
69 brain. It was postulated that brain activity would de-
73 crease with an increase in OP dosage. The data are as
78 Dose, x (mgkg
Activity, y
65 Animal
body weight)
11.36 Transistor gain between emitter and collector
in an integrated circuit device (hFE) is related to two
variables (Myers, Montgomery and Anderson-Cook,
2009) that can be controlled at the deposition process,
emitter drive-in time (x 1 , in minutes) and emitter dose
(x 2 , in ions
11 4.6 × 10 10.6
14 ). Fourteen samples were observed
following deposition, and the resulting data are shown
in the table below. We will consider linear regression
models using gain as the response and emitter drive-in
time or emitter dose as the regressor variable.
x 1 (drive-in
x 2 (dose,
y (gain,
Obs. time, min) ions ×10 14 ) or hFE)
(a) Using the model
find the least squares estimates of β 0 and β 1 .
(b) Construct an analysis-of-variance table in which
the lack of fit and pure error have been separated.
Exercises
Determine if the lack of fit is significant at the 0.05 11.40 It is of interest to study the effect of population
level. Interpret the results.
size in various cities in the United States on ozone con- centrations. The data consist of the 1999 population
11.38 Heat treating is often used to carburize metal in millions and the amount of ozone present per hour parts such as gears. The thickness of the carburized in ppb (parts per billion). The data are as follows. layer is considered an important feature of the gear,
Ozone (ppbhour), y
and it contributes to the overall reliability of the part.
Because of the critical nature of this feature, a lab test
is performed on each furnace load. The test is a de-
structive one, where an actual part is cross sectioned
and soaked in a chemical for a period of time. This
test involves running a carbon analysis on the surface
of both the gear pitch (top of the gear tooth) and the
gear root (between the gear teeth). The data below
are the results of the pitch carbon-analysis test for 19
Soak Time
Pitch
Soak Time
Pitch
(a) Fit the linear regression model relating ozone con-
centration to population. Test H 0 :β 1 = 0 using
the ANOVA approach.
(b) Do a test for lack of fit. Is the linear model appro-
priate based on the results of your test?
(c) Test the hypothesis of part (a) using the pure mean
square error in the F-test. Do the results change?
Comment on the advantage of each test.
11.41 Evaluating nitrogen deposition from the atmo-
(a) Fit a simple linear regression relating the pitch car- sphere is a major role of the National Atmospheric
bon analysis y against soak time. Test H :β
Deposition Program (NADP), a partnership of many 0 1 = 0. agencies. NADP is studying atmospheric deposition
(b) If the hypothesis in part (a) is rejected, determine and its effect on agricultural crops, forest surface wa-
if the linear model is adequate.
ters, and other resources. Nitrogen oxides may affect
11.39 A regression model is desired relating tempera- the ozone in the atmosphere and the amount of pure ture and the proportion of impurities passing through nitrogen in the air we breathe. The data are as follows: solid helium. Temperature is listed in degrees centi-
Year
Nitrogen Oxide
grade. The data are as follows:
Temperature ( ◦ C) Proportion of Impurities
−265.0 2.77 0.475 −270.0 1984 0.705
−272.5 4.39 0.935 −272.6
(a) Fit a linear regression model.
(b) Does it appear that the proportion of impurities
passing through helium increases as the tempera-
ture approaches −273 degrees centigrade?
(c) Find R 2 .
(d) Based on the information above, does the linear
model seem appropriate? What additional infor- mation would you need to better answer that ques-
Chapter 11 Simple Linear Regression and Correlation
(a) Plot the data.
tions were used for each level of x. The data are shown
(b) Fit a linear regression model and find R 2 .
as follows:
(c) What can you say about the trend in nitrogen oxide
Plants per Plot,
Quantity of Seeds, y
across time?
11.42 For a particular variety of plant, researchers
wanted to develop a formula for predicting the quan-
tity of seeds (in grams) as a function of the density of
plants. They conducted a study with four levels of the Is a simple linear regression model adequate for ana- factor x, the number of plants per plot. Four replica- lyzing this data set?