Test for Linearity of Regression: Data with Repeated Observations

11.9 Test for Linearity of Regression: Data with Repeated Observations

In certain kinds of experimental situations, the researcher has the capability of obtaining repeated observations on the response for each value of x. Although it is

not necessary to have these repetitions in order to estimate β 0 and β 1 , nevertheless repetitions enable the experimenter to obtain quantitative information concerning the appropriateness of the model. In fact, if repeated observations are generated, the experimenter can make a significance test to aid in determining whether or not the model is adequate.

11.9 Test for Linearity of Regression: Data with Repeated Observations 417

The regression equation is COD = 3.83 + 0.904 Per_Red Predictor

Coef SE Coef

2.17 0.038 Per_Red 0.90364 0.05012 18.03 0.000

S = 3.22954

R-Sq = 91.3%

R-Sq(adj) = 91.0%

Analysis of Variance

1 3390.6 3390.6 325.08 0.000 Residual Error 31

Obs Per_Red

COD

Fit SE Fit Residual St Resid

0.52 Figure 11.14: MINITAB printout of simple linear regression for chemical oxygen

demand reduction data; part I.

Let us select a random sample of n observations using k distinct values of x, say x 1 ,x 2 ,...,x n , such that the sample contains n 1 observed values of the random variable Y 1 corresponding to x 1 ,n 2 observed values of Y 2 corresponding to x 2 ,...,

n k observed values of Y k corresponding to x k . Of necessity, n = n i .

i=1

418 Chapter 11 Simple Linear Regression and Correlation

0.576 (35.185, 37.537) (29.670, 43.052) Figure 11.15: MINITAB printout of simple linear regression for chemical oxygen

demand reduction data; part II.

We define

y ij = the jth value of the random variable Y

Hence, if n 4 = 3 measurements of Y were made corresponding to x = x 4 , we would

indicate these observations by y 41 ,y 42 , and y 43 . Then T i. =y 41 +y 42 +y 43 .

Concept of Lack of Fit

The error sum of squares consists of two parts: the amount due to the variation between the values of Y within given values of x and a component that is normally

11.9 Test for Linearity of Regression: Data with Repeated Observations 419 called the lack-of-fit contribution. The first component reflects mere random

variation, or pure experimental error, while the second component is a measure of the systematic variation brought about by higher-order terms. In our case, these are terms in x other than the linear, or first-order, contribution. Note that in choosing a linear model we are essentially assuming that this second component does not exist and hence our error sum of squares is completely due to random

errors. If this should be the case, then s 2 = SSE/(n − 2) is an unbiased estimate of σ 2 . However, if the model does not adequately fit the data, then the error sum of squares is inflated and produces a biased estimate of σ 2 . Whether or not the model fits the data, an unbiased estimate of σ 2 can always be obtained when we have repeated observations simply by computing

for each of the k distinct values of x and then pooling these variances to get

The numerator of s 2 is a measure of the pure experimental error. A compu- tational procedure for separating the error sum of squares into the two components representing pure error and lack of fit is as follows:

Computation of

1. Compute the pure error sum of squares

Lack-of-Fit Sum of Squares n k i

(y ij − ¯y i. ) 2 .

i=1 j=1

This sum of squares has n − k degrees of freedom associated with it, and the

resulting mean square is our unbiased estimate s 2 of σ 2 .

2. Subtract the pure error sum of squares from the error sum of squares SSE, thereby obtaining the sum of squares due to lack of fit. The degrees of freedom for lack of fit are obtained by simply subtracting (n − 2) − (n − k) = k − 2.

The computations required for testing hypotheses in a regression problem with repeated measurements on the response may be summarized as shown in Table

11.3. Figures 11.16 and 11.17 display the sample points for the “correct model” and “incorrect model” situations. In Figure 11.16, where the μ Y |x fall on a straight line, there is no lack of fit when a linear model is assumed, so the sample variation around the regression line is a pure error resulting from the variation that occurs among repeated observations. In Figure 11.17, where the μ Y |x clearly do not fall on a straight line, the lack of fit from erroneously choosing a linear model accounts for a large portion of the variation around the regression line, supplementing the pure error.

420 Chapter 11 Simple Linear Regression and Correlation

Table 11.3: Analysis of Variance for Testing Linearity of Regression Source of

Sum of

Degrees of

Mean

Computed f Regression

Lack of fit ) SSE−SSE( SSE − SSE (pure) k−2

SSE−SSE( pure )

Pure error

SSE (pure)

n−k

= ) pure

s 2 SSE(

Figure 11.16: Correct linear model with no lack-of- Figure 11.17: Incorrect linear model with lack-of-fit fit component.

component.

What Is the Importance in Detecting Lack of Fit?

The concept of lack of fit is extremely important in applications of regression analysis. In fact, the need to construct or design an experiment that will account for lack of fit becomes more critical as the problem and the underlying mechanism involved become more complicated. Surely, one cannot always be certain that his or her postulated structure, in this case the linear regression model, is correct or even an adequate representation. The following example shows how the error sum of squares is partitioned into the two components representing pure error and lack of fit. The adequacy of the model is tested at the α-level of significance by

comparing the lack-of-fit mean square divided by s 2 with f α (k − 2, n − k). Example 11.8: Observations of the yield of a chemical reaction taken at various temperatures were

recorded in Table 11.4. Estimate the linear model μ Y |x =β 0 +β 1 x and test for lack of fit. Solution : Results of the computations are shown in Table 11.5. Conclusion: The partitioning of the total variation in this manner reveals a significant variation accounted for by the linear model and an insignificant amount of variation due to lack of fit. Thus, the experimental data do not seem to suggest the need to consider terms higher than first order in the model, and the null hypothesis is not rejected.

Exercises 421

Table 11.4: Data for Example 11.8 y (%) x( ◦ C) y (%) x( ◦ C)

Table 11.5: Analysis of Variance on Yield-Temperature Data

Source of

Sum of

Degrees of

Mean

Variation

Computed f P-Values Regression

Lack of fit

1.81 0.2241 Pure error

Annotated Computer Printout for Test for Lack of Fit

Figure 11.18 is an annotated computer printout showing analysis of the data of Example 11.8 with SAS. Note the “LOF” with 2 degrees of freedom, represent- ing the quadratic and cubic contribution to the model, and the P -value of 0.22, suggesting that the linear (first-order) model is adequate.

Dependent Variable: yield

Sum of

Source

Pr > F Model

DF Squares

Mean Square

F Value

Corrected Total

R-Square

Coeff Var

Root MSE

yield Mean

Pr > F temperature

DF Type I SS

Mean Square

F Value

Figure 11.18: SAS printout, showing analysis of data of Example 11.8.

Exercises

11.31 Test for linearity of regression in Exercise 11.3 origin (Exercise 11.28) μ Y |x = βx. on page 398. Use a 0.05 level of significance. Comment. (a) Estimate the regression line passing through the

11.32 Test for linearity of regression in Exercise 11.8 origin for the following data: on page 399. Comment.

0.5 1.5 3.2 4.2 5.1 6.5 11.33 Suppose we have a linear equation through the

422 Chapter 11 Simple Linear Regression and Correlation (b) Suppose it is not known whether the true regres- (a) Determine if emitter drive-in time influences gain

sion should pass through the origin. Estimate the in a linear relationship. That is, test H 0 :β 1 = 0,

where β 1 is the slope of the regressor variable. esis that β 0 = 0, at the 0.10 level of significance, (b) Do a lack-of-fit test to determine if the linear rela- against the alternative that β 0 tionship is adequate. Draw conclusions.

linear model μ Y |x =β 0 +β 1 x and test the hypoth-

11.34 Use an analysis-of-variance approach to test (c) Determine if emitter dose influences gain in a linear the hypothesis that β 1 = 0 against the alternative hy-

relationship. Which regressor variable is the better pothesis β 1 predictor of gain?

0.05 level of significance. 11.37 Organophosphate (OP) compounds are used as 11.35 The following data are a result of an investiga- pesticides. However, it is important to study their ef- tion as to the effect of reaction temperature x on per- fect on species that are exposed to them. In the labora- cent conversion of a chemical process y. (See Myers, tory study Some Effects of Organophosphate Pesticides Montgomery and Anderson-Cook, 2009.) Fit a simple on Wildlife Species, by the Department of Fisheries linear regression, and use a lack-of-fit test to determine and Wildlife at Virginia Tech, an experiment was con- if the model is adequate. Discuss.

ducted in which different dosages of a particular OP Temperature Conversion

pesticide were administered to 5 groups of 5 mice (per- Observation

( ◦ C), x

omysius leucopus). The 25 mice were females of similar 1 200

43 age and condition. One group received no chemical. 2 250

(%), y

78 The basic response y was a measure of activity in the 3 200

69 brain. It was postulated that brain activity would de- 4 250

73 crease with an increase in OP dosage. The data are as 5 189.65

48 follows:

6 260.35 78 Dose, x (mg/kg Activity, y 7 225

(moles/liter/min) 8 225

65 Animal

body weight)

81 5 0.0 9.0 11.36 Transistor gain between emitter and collector

6 2.3 11.0 in an integrated circuit device (hFE) is related to two

7 2.3 11.3 variables (Myers, Montgomery and Anderson-Cook,

8 2.3 9.9 2009) that can be controlled at the deposition process,

9 2.3 9.2 emitter drive-in time (x 1 , in minutes) and emitter dose

10 2.3 10.1 (x 2 , in ions × 10 14 ). Fourteen samples were observed

11 4.6 10.6 following deposition, and the resulting data are shown

12 4.6 10.4 in the table below. We will consider linear regression

13 4.6 8.8 models using gain as the response and emitter drive-in

14 4.6 11.1 time or emitter dose as the regressor variable.

17 9.2 7.8 Obs. time, min) ions ×10 ) or hFE)

x 1 (drive-in

x 2 (dose, 14 y (gain,

(a) Using the model

find the least squares estimates of β 0 and β 1 . 13 255

(b) Construct an analysis-of-variance table in which 14 340

the lack of fit and pure error have been separated.

Exercises 423 Determine if the lack of fit is significant at the 0.05

11.40 It is of interest to study the effect of population level. Interpret the results.

size in various cities in the United States on ozone con- centrations. The data consist of the 1999 population

11.38 Heat treating is often used to carburize metal in millions and the amount of ozone present per hour parts such as gears. The thickness of the carburized in ppb (parts per billion). The data are as follows. layer is considered an important feature of the gear,

Ozone (ppb/hour), y Population, x and it contributes to the overall reliability of the part.

0.6 Because of the critical nature of this feature, a lab test

4.9 is performed on each furnace load. The test is a de-

0.2 structive one, where an actual part is cross sectioned

0.5 and soaked in a chemical for a period of time. This

1.1 test involves running a carbon analysis on the surface

0.1 of both the gear pitch (top of the gear tooth) and the

1.1 gear root (between the gear teeth). The data below

2.3 are the results of the pitch carbon-analysis test for 19

2.3 Soak Time

Pitch

(a) Fit the linear regression model relating ozone con- 0.58 0.013

Soak Time

Pitch

centration to population. Test H 0 :β 1 = 0 using 0.66 0.016

the ANOVA approach.

(b) Do a test for lack of fit. Is the linear model appro- 0.66 0.016

priate based on the results of your test? 0.66 0.015

(c) Test the hypothesis of part (a) using the pure mean 1.00 0.014

square error in the F-test. Do the results change? 1.17 0.021

Comment on the advantage of each test. 1.17 0.018

1.17 0.019 11.41 Evaluating nitrogen deposition from the atmo- (a) Fit a simple linear regression relating the pitch car- sphere is a major role of the National Atmospheric

bon analysis y against soak time. Test H 0 :β 1 = 0. Deposition Program (NADP), a partnership of many agencies. NADP is studying atmospheric deposition (b) If the hypothesis in part (a) is rejected, determine and its effect on agricultural crops, forest surface wa-

if the linear model is adequate. ters, and other resources. Nitrogen oxides may affect 11.39 A regression model is desired relating tempera- the ozone in the atmosphere and the amount of pure

ture and the proportion of impurities passing through nitrogen in the air we breathe. The data are as follows: solid helium. Temperature is listed in degrees centi-

Nitrogen Oxide grade. The data are as follows:

C) Proportion of Impurities

3.95 (a) Fit a linear regression model.

3.14 (b) Does it appear that the proportion of impurities

3.44 passing through helium increases as the tempera-

3.63 ture approaches −273 degrees centigrade?

4.50 (c) Find R 2 .

3.95 (d) Based on the information above, does the linear

5.24 model seem appropriate? What additional infor-

3.30 mation would you need to better answer that ques-

424 Chapter 11 Simple Linear Regression and Correlation (a) Plot the data.

tions were used for each level of x. The data are shown

(b) Fit a linear regression model and find R 2 .

as follows:

(c) What can you say about the trend in nitrogen oxide

Quantity of Seeds, y across time?

Plants per Plot,

(grams) 10 12.6 11.0 12.1 10.9

11.42 For a particular variety of plant, researchers 20 15.3 16.1 14.9 15.6 wanted to develop a formula for predicting the quan-

30 17.9 18.3 18.6 17.8 tity of seeds (in grams) as a function of the density of

40 19.2 19.6 18.9 20.0 plants. They conducted a study with four levels of the Is a simple linear regression model adequate for ana- factor x, the number of plants per plot. Four replica- lyzing this data set?

Dokumen yang terkait

Optimal Retention for a Quota Share Reinsurance

0 0 7

Digital Gender Gap for Housewives Digital Gender Gap bagi Ibu Rumah Tangga

0 0 9

Challenges of Dissemination of Islam-related Information for Chinese Muslims in China Tantangan dalam Menyebarkan Informasi terkait Islam bagi Muslim China di China

0 0 13

Family is the first and main educator for all human beings Family is the school of love and trainers of management of stress, management of psycho-social-

0 0 26

THE EFFECT OF MNEMONIC TECHNIQUE ON VOCABULARY RECALL OF THE TENTH GRADE STUDENTS OF SMAN 3 PALANGKA RAYA THESIS PROPOSAL Presented to the Department of Education of the State Islamic College of Palangka Raya in Partial Fulfillment of the Requirements for

0 3 22

GRADERS OF SMAN-3 PALANGKA RAYA ACADEMIC YEAR OF 20132014 THESIS Presented to the Department of Education of the State College of Islamic Studies Palangka Raya in Partial Fulfillment of the Requirements for the Degree of Sarjana Pendidikan Islam

0 0 20

A. Research Design and Approach - The readability level of reading texts in the english textbook entitled “Bahasa Inggris SMA/MA/MAK” for grade XI semester 1 published by the Ministry of Education and Culture of Indonesia - Digital Library IAIN Palangka R

0 1 12

A. Background of Study - The quality of the english textbooks used by english teachers for the tenth grade of MAN Model Palangka Raya Based on Education National Standard Council (BSNP) - Digital Library IAIN Palangka Raya

0 0 15

1. The definition of textbook - The quality of the english textbooks used by english teachers for the tenth grade of MAN Model Palangka Raya Based on Education National Standard Council (BSNP) - Digital Library IAIN Palangka Raya

0 0 38

CHAPTER IV DISCUSSION - The quality of the english textbooks used by english teachers for the tenth grade of MAN Model Palangka Raya Based on Education National Standard Council (BSNP) - Digital Library IAIN Palangka Raya

0 0 95