Test for Linearity of Regression: Data with Repeated Observations
11.9 Test for Linearity of Regression: Data with Repeated Observations
In certain kinds of experimental situations, the researcher has the capability of obtaining repeated observations on the response for each value of x. Although it is
not necessary to have these repetitions in order to estimate β 0 and β 1 , nevertheless repetitions enable the experimenter to obtain quantitative information concerning the appropriateness of the model. In fact, if repeated observations are generated, the experimenter can make a significance test to aid in determining whether or not the model is adequate.
11.9 Test for Linearity of Regression: Data with Repeated Observations 417
The regression equation is COD = 3.83 + 0.904 Per_Red Predictor
Coef SE Coef
2.17 0.038 Per_Red 0.90364 0.05012 18.03 0.000
S = 3.22954
R-Sq = 91.3%
R-Sq(adj) = 91.0%
Analysis of Variance
1 3390.6 3390.6 325.08 0.000 Residual Error 31
Obs Per_Red
COD
Fit SE Fit Residual St Resid
0.52 Figure 11.14: MINITAB printout of simple linear regression for chemical oxygen
demand reduction data; part I.
Let us select a random sample of n observations using k distinct values of x, say x 1 ,x 2 ,...,x n , such that the sample contains n 1 observed values of the random variable Y 1 corresponding to x 1 ,n 2 observed values of Y 2 corresponding to x 2 ,...,
n k observed values of Y k corresponding to x k . Of necessity, n = n i .
i=1
418 Chapter 11 Simple Linear Regression and Correlation
0.576 (35.185, 37.537) (29.670, 43.052) Figure 11.15: MINITAB printout of simple linear regression for chemical oxygen
demand reduction data; part II.
We define
y ij = the jth value of the random variable Y
Hence, if n 4 = 3 measurements of Y were made corresponding to x = x 4 , we would
indicate these observations by y 41 ,y 42 , and y 43 . Then T i. =y 41 +y 42 +y 43 .
Concept of Lack of Fit
The error sum of squares consists of two parts: the amount due to the variation between the values of Y within given values of x and a component that is normally
11.9 Test for Linearity of Regression: Data with Repeated Observations 419 called the lack-of-fit contribution. The first component reflects mere random
variation, or pure experimental error, while the second component is a measure of the systematic variation brought about by higher-order terms. In our case, these are terms in x other than the linear, or first-order, contribution. Note that in choosing a linear model we are essentially assuming that this second component does not exist and hence our error sum of squares is completely due to random
errors. If this should be the case, then s 2 = SSE/(n − 2) is an unbiased estimate of σ 2 . However, if the model does not adequately fit the data, then the error sum of squares is inflated and produces a biased estimate of σ 2 . Whether or not the model fits the data, an unbiased estimate of σ 2 can always be obtained when we have repeated observations simply by computing
for each of the k distinct values of x and then pooling these variances to get
The numerator of s 2 is a measure of the pure experimental error. A compu- tational procedure for separating the error sum of squares into the two components representing pure error and lack of fit is as follows:
Computation of
1. Compute the pure error sum of squares
Lack-of-Fit Sum of Squares n k i
(y ij − ¯y i. ) 2 .
i=1 j=1
This sum of squares has n − k degrees of freedom associated with it, and the
resulting mean square is our unbiased estimate s 2 of σ 2 .
2. Subtract the pure error sum of squares from the error sum of squares SSE, thereby obtaining the sum of squares due to lack of fit. The degrees of freedom for lack of fit are obtained by simply subtracting (n − 2) − (n − k) = k − 2.
The computations required for testing hypotheses in a regression problem with repeated measurements on the response may be summarized as shown in Table
11.3. Figures 11.16 and 11.17 display the sample points for the “correct model” and “incorrect model” situations. In Figure 11.16, where the μ Y |x fall on a straight line, there is no lack of fit when a linear model is assumed, so the sample variation around the regression line is a pure error resulting from the variation that occurs among repeated observations. In Figure 11.17, where the μ Y |x clearly do not fall on a straight line, the lack of fit from erroneously choosing a linear model accounts for a large portion of the variation around the regression line, supplementing the pure error.
420 Chapter 11 Simple Linear Regression and Correlation
Table 11.3: Analysis of Variance for Testing Linearity of Regression Source of
Sum of
Degrees of
Mean
Computed f Regression
Lack of fit ) SSE−SSE( SSE − SSE (pure) k−2
SSE−SSE( pure )
Pure error
SSE (pure)
n−k
= ) pure
s 2 SSE(
Figure 11.16: Correct linear model with no lack-of- Figure 11.17: Incorrect linear model with lack-of-fit fit component.
component.
What Is the Importance in Detecting Lack of Fit?
The concept of lack of fit is extremely important in applications of regression analysis. In fact, the need to construct or design an experiment that will account for lack of fit becomes more critical as the problem and the underlying mechanism involved become more complicated. Surely, one cannot always be certain that his or her postulated structure, in this case the linear regression model, is correct or even an adequate representation. The following example shows how the error sum of squares is partitioned into the two components representing pure error and lack of fit. The adequacy of the model is tested at the α-level of significance by
comparing the lack-of-fit mean square divided by s 2 with f α (k − 2, n − k). Example 11.8: Observations of the yield of a chemical reaction taken at various temperatures were
recorded in Table 11.4. Estimate the linear model μ Y |x =β 0 +β 1 x and test for lack of fit. Solution : Results of the computations are shown in Table 11.5. Conclusion: The partitioning of the total variation in this manner reveals a significant variation accounted for by the linear model and an insignificant amount of variation due to lack of fit. Thus, the experimental data do not seem to suggest the need to consider terms higher than first order in the model, and the null hypothesis is not rejected.
Exercises 421
Table 11.4: Data for Example 11.8 y (%) x( ◦ C) y (%) x( ◦ C)
Table 11.5: Analysis of Variance on Yield-Temperature Data
Source of
Sum of
Degrees of
Mean
Variation
Computed f P-Values Regression
Lack of fit
1.81 0.2241 Pure error
Annotated Computer Printout for Test for Lack of Fit
Figure 11.18 is an annotated computer printout showing analysis of the data of Example 11.8 with SAS. Note the “LOF” with 2 degrees of freedom, represent- ing the quadratic and cubic contribution to the model, and the P -value of 0.22, suggesting that the linear (first-order) model is adequate.
Dependent Variable: yield
Sum of
Source
Pr > F Model
DF Squares
Mean Square
F Value
Corrected Total
R-Square
Coeff Var
Root MSE
yield Mean
Pr > F temperature
DF Type I SS
Mean Square
F Value
Figure 11.18: SAS printout, showing analysis of data of Example 11.8.
Exercises
11.31 Test for linearity of regression in Exercise 11.3 origin (Exercise 11.28) μ Y |x = βx. on page 398. Use a 0.05 level of significance. Comment. (a) Estimate the regression line passing through the
11.32 Test for linearity of regression in Exercise 11.8 origin for the following data: on page 399. Comment.
0.5 1.5 3.2 4.2 5.1 6.5 11.33 Suppose we have a linear equation through the
422 Chapter 11 Simple Linear Regression and Correlation (b) Suppose it is not known whether the true regres- (a) Determine if emitter drive-in time influences gain
sion should pass through the origin. Estimate the in a linear relationship. That is, test H 0 :β 1 = 0,
where β 1 is the slope of the regressor variable. esis that β 0 = 0, at the 0.10 level of significance, (b) Do a lack-of-fit test to determine if the linear rela- against the alternative that β 0 tionship is adequate. Draw conclusions.
linear model μ Y |x =β 0 +β 1 x and test the hypoth-
11.34 Use an analysis-of-variance approach to test (c) Determine if emitter dose influences gain in a linear the hypothesis that β 1 = 0 against the alternative hy-
relationship. Which regressor variable is the better pothesis β 1 predictor of gain?
0.05 level of significance. 11.37 Organophosphate (OP) compounds are used as 11.35 The following data are a result of an investiga- pesticides. However, it is important to study their ef- tion as to the effect of reaction temperature x on per- fect on species that are exposed to them. In the labora- cent conversion of a chemical process y. (See Myers, tory study Some Effects of Organophosphate Pesticides Montgomery and Anderson-Cook, 2009.) Fit a simple on Wildlife Species, by the Department of Fisheries linear regression, and use a lack-of-fit test to determine and Wildlife at Virginia Tech, an experiment was con- if the model is adequate. Discuss.
ducted in which different dosages of a particular OP Temperature Conversion
pesticide were administered to 5 groups of 5 mice (per- Observation
( ◦ C), x
omysius leucopus). The 25 mice were females of similar 1 200
43 age and condition. One group received no chemical. 2 250
(%), y
78 The basic response y was a measure of activity in the 3 200
69 brain. It was postulated that brain activity would de- 4 250
73 crease with an increase in OP dosage. The data are as 5 189.65
48 follows:
6 260.35 78 Dose, x (mg/kg Activity, y 7 225
(moles/liter/min) 8 225
65 Animal
body weight)
81 5 0.0 9.0 11.36 Transistor gain between emitter and collector
6 2.3 11.0 in an integrated circuit device (hFE) is related to two
7 2.3 11.3 variables (Myers, Montgomery and Anderson-Cook,
8 2.3 9.9 2009) that can be controlled at the deposition process,
9 2.3 9.2 emitter drive-in time (x 1 , in minutes) and emitter dose
10 2.3 10.1 (x 2 , in ions × 10 14 ). Fourteen samples were observed
11 4.6 10.6 following deposition, and the resulting data are shown
12 4.6 10.4 in the table below. We will consider linear regression
13 4.6 8.8 models using gain as the response and emitter drive-in
14 4.6 11.1 time or emitter dose as the regressor variable.
17 9.2 7.8 Obs. time, min) ions ×10 ) or hFE)
x 1 (drive-in
x 2 (dose, 14 y (gain,
(a) Using the model
find the least squares estimates of β 0 and β 1 . 13 255
(b) Construct an analysis-of-variance table in which 14 340
the lack of fit and pure error have been separated.
Exercises 423 Determine if the lack of fit is significant at the 0.05
11.40 It is of interest to study the effect of population level. Interpret the results.
size in various cities in the United States on ozone con- centrations. The data consist of the 1999 population
11.38 Heat treating is often used to carburize metal in millions and the amount of ozone present per hour parts such as gears. The thickness of the carburized in ppb (parts per billion). The data are as follows. layer is considered an important feature of the gear,
Ozone (ppb/hour), y Population, x and it contributes to the overall reliability of the part.
0.6 Because of the critical nature of this feature, a lab test
4.9 is performed on each furnace load. The test is a de-
0.2 structive one, where an actual part is cross sectioned
0.5 and soaked in a chemical for a period of time. This
1.1 test involves running a carbon analysis on the surface
0.1 of both the gear pitch (top of the gear tooth) and the
1.1 gear root (between the gear teeth). The data below
2.3 are the results of the pitch carbon-analysis test for 19
2.3 Soak Time
Pitch
(a) Fit the linear regression model relating ozone con- 0.58 0.013
Soak Time
Pitch
centration to population. Test H 0 :β 1 = 0 using 0.66 0.016
the ANOVA approach.
(b) Do a test for lack of fit. Is the linear model appro- 0.66 0.016
priate based on the results of your test? 0.66 0.015
(c) Test the hypothesis of part (a) using the pure mean 1.00 0.014
square error in the F-test. Do the results change? 1.17 0.021
Comment on the advantage of each test. 1.17 0.018
1.17 0.019 11.41 Evaluating nitrogen deposition from the atmo- (a) Fit a simple linear regression relating the pitch car- sphere is a major role of the National Atmospheric
bon analysis y against soak time. Test H 0 :β 1 = 0. Deposition Program (NADP), a partnership of many agencies. NADP is studying atmospheric deposition (b) If the hypothesis in part (a) is rejected, determine and its effect on agricultural crops, forest surface wa-
if the linear model is adequate. ters, and other resources. Nitrogen oxides may affect 11.39 A regression model is desired relating tempera- the ozone in the atmosphere and the amount of pure
ture and the proportion of impurities passing through nitrogen in the air we breathe. The data are as follows: solid helium. Temperature is listed in degrees centi-
Nitrogen Oxide grade. The data are as follows:
C) Proportion of Impurities
3.95 (a) Fit a linear regression model.
3.14 (b) Does it appear that the proportion of impurities
3.44 passing through helium increases as the tempera-
3.63 ture approaches −273 degrees centigrade?
4.50 (c) Find R 2 .
3.95 (d) Based on the information above, does the linear
5.24 model seem appropriate? What additional infor-
3.30 mation would you need to better answer that ques-
424 Chapter 11 Simple Linear Regression and Correlation (a) Plot the data.
tions were used for each level of x. The data are shown
(b) Fit a linear regression model and find R 2 .
as follows:
(c) What can you say about the trend in nitrogen oxide
Quantity of Seeds, y across time?
Plants per Plot,
(grams) 10 12.6 11.0 12.1 10.9
11.42 For a particular variety of plant, researchers 20 15.3 16.1 14.9 15.6 wanted to develop a formula for predicting the quan-
30 17.9 18.3 18.6 17.8 tity of seeds (in grams) as a function of the density of
40 19.2 19.6 18.9 20.0 plants. They conducted a study with four levels of the Is a simple linear regression model adequate for ana- factor x, the number of plants per plot. Four replica- lyzing this data set?