Cross Validation, C p , and Other Criteria for Model Selection

12.11 Cross Validation, C p , and Other Criteria for Model Selection

For many regression problems, the experimenter must choose among various alter- native models or model forms that are developed from the same data set. Quite often, the model that best predicts or estimates mean response is required. The

experimenter should take into account the relative sizes of the s 2 -values for the can- didate models and certainly the general nature of the confidence intervals on the mean response. One must also consider how well the model predicts response val- ues that were not used in building the candidate models. The models should

be subjected to cross validation. What are required, then, are cross-validation errors rather than fitting errors. Such errors in prediction are the PRESS resid- uals

δ i =y i − ˆy i,−i ,

i = 1, 2, . . . , n,

where ˆ y i,−i is the prediction of the ith data point by a model that did not make use of the ith point in the calculation of the coefficients. These PRESS residuals are calculated from the formula

i = 1, 2, . . . , n.

1−h ii

(The derivation can be found in Myers, 1990.)

Use of the PRESS Statistic

The motivation for PRESS and the utility of PRESS residuals are very simple to understand. The purpose of extracting or setting aside data points one at a time is

488 Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models to allow the use of separate methodologies for fitting and assessment of a specific

model. For assessment of a model, the “−i” indicates that the PRESS residual gives a prediction error where the observation being predicted is independent of

the model fit. Criteria that make use of the PRESS residuals are given by

|δ i | and PRESS =

i=1

i=1

(The term PRESS is an acronym for prediction sum of squares.) We suggest that both of these criteria be used. It is possible for PRESS to be dominated by

one or only a few large PRESS residuals. Clearly, the criterion on |δ i | is less

i=1

sensitive to a small number of large values. In addition to the PRESS statistic itself, the analyst can simply compute an R 2 -like statistic reflecting prediction performance. The statistic is often called R 2 pred and is given as follows:

R 2 of Prediction Given a fitted model with a specific value for PRESS, R 2 pred is given by

R 2 pred PRESS =1− n . (y i − ¯y) 2

i=1

Note that R 2 pred is merely the ordinary R 2 statistic with SSE replaced by the PRESS statistic. In the following case study, an illustration is provided in which many candidate models are fit to a set of data and the best model is chosen. The sequential procedures described in Section 12.9 are not used. Rather, the role of the PRESS residuals and other statistical values in selecting the best regression equation is illustrated.

Case Study 12.2: Football Punting: Leg strength is a necessary characteristic of a successful punter in American football. One measure of the quality of a good punt is the “hang time.” This is the time that the ball hangs in the air before being caught by the punt returner. To determine what leg strength factors influence hang time and to de- velop an empirical model for predicting this response, a study on The Relationship Between Selected Physical Performance Variables and Football Punting Ability was conducted by the Department of Health, Physical Education, and Recreation at Virginia Tech. Thirteen punters were chosen for the experiment, and each punted

a football 10 times. The average hang times, along with the strength measures used in the analysis, were recorded in Table 12.12.

Each regressor variable is defined as follows:

1. RLS, right leg strength (pounds)

2. LLS, left leg strength (pounds)

3. RHF, right hamstring muscle flexibility (degrees)

4. LHF, left hamstring muscle flexibility (degrees)

12.11 Cross Validation, C p , and Other Criteria for Model Selection 489

5. Power, overall leg strength (foot-pounds) Determine the most appropriate model for predicting hang time.

Table 12.12: Data for Case Study 12.2 Hang Time, RLS, LLS, RHF, LHF,

Power, Punter

y (sec)

95 95 240.57 Solution : In the search for the best of the candidate models for predicting hang time, the

information in Table 12.13 was obtained from a regression computer package. The models are ranked in ascending order of the values of the PRESS statistic. This display provides enough information on all possible models to enable the user to

eliminate from consideration all but a few models. The model containing x 2 and x 5 (LLS and Power), denoted by x 2 x 5 , appears to be superior for predicting punter

hang time. Also note that all models with low PRESS, low s 2 , low |δ i |, and

i=1

high R 2 -values contain these two variables. In order to gain some insight from the residuals of the fitted regression

ˆ y i =b 0 +b 2 x 2i +b 5 x 5i ,

the residuals and PRESS residuals were generated. The actual prediction model (see Exercise 12.47 on page 494) is given by

y = 1.10765 + 0.01370x ˆ 2 + 0.00429x 5 .

Residuals, HAT diagonal values, and PRESS values are listed in Table 12.14. Note the relatively good fit of the two-variable regression model to the data. The PRESS residuals reflect the capability of the regression equation to predict hang time if independent predictions were to be made. For example, for punter number 4, the hang time of 4.180 would encounter a prediction error of 0.039 if the model constructed by using the remaining 12 punters were used. For this model, the average prediction error or cross-validation error is

1 n |δ i | = 0.1489 second,

13 i=1

490 Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

Table 12.13: Comparing Different Regression Models

Model

s 2 |δ i |

PRESS R 2

x 2 x 5 0.036907

x 1 x 2 x 5 0.041001

x 2 x 4 x 5 0.037708

x 2 x 3 x 5 0.039636

x 1 x 2 x 4 x 5 0.042265

x 1 x 2 x 3 x 5 0.044578

x 2 x 3 x 4 x 5 0.042421

x 1 x 3 x 5 0.053664

x 1 x 4 x 5 0.056279

x 1 x 5 0.059621

x 2 x 3 0.056153

x 1 x 3 0.059400

x 1 x 2 x 3 x 4 x 5 0.048302

x 3 x 5 0.065678

x 1 x 2 0.068402

x 1 x 3 x 4 0.065414

x 2 x 3 x 4 0.062082

x 2 x 4 0.063744

x 1 x 2 x 3 0.059670

x 3 x 4 0.080605

x 1 x 4 0.069965

x 1 x 3 x 4 x 5 0.059169

x 1 x 2 x 4 0.064143

x 3 x 4 x 5 0.072505

x 1 x 2 x 3 x 4 0.066088

x 4 x 5 0.105648

which is small compared to the average hang time for the 13 punters. We indicated in Section 12.9 that the use of all possible subset regressions is often advisable when searching for the best model. Most commercial statistics software packages contain an all possible regressions routine. These algorithms compute various criteria for all subsets of model terms. Obviously, criteria such as

R 2 ,s 2 , and PRESS are reasonable for choosing among candidate subsets. Another very popular and useful statistic, particularly for areas in the physical sciences and engineering, is the C p statistic, described below.

12.11 Cross Validation, C p , and Other Criteria for Model Selection 491

Table 12.14: PRESS Residuals

The C p Statistic

Quite often, the choice of the most appropriate model involves many considerations. Obviously, the number of model terms is important; the matter of parsimony is

a consideration that cannot be ignored. On the other hand, the analyst cannot

be pleased with a model that is too simple, to the point where there is serious underspecification. A single statistic that represents a nice compromise in this regard is the C p statistic. (See Mallows, 1973, in the Bibliography.)

The C p statistic appeals nicely to common sense and is developed from con- siderations of the proper compromise between excessive bias incurred when one underfits (chooses too few model terms) and excessive prediction variance pro- duced when one overfits (has redundancies in the model). The C p statistic is a simple function of the total number of parameters in the candidate model and the

mean square error s 2 .

We will not present the entire development of the C p statistic. (For details, the reader is referred to Myers, 1990, in the Bibliography.) The C p for a particular subset model is an estimate of the following:

(Bias ˆ (p) 2 2 Var(ˆ y i )+ 2 y i ) .

σ i=1

σ i=1

It turns out that under the standard least squares assumptions indicated earlier in this chapter, and assuming that the “true” model is the model containing all candidate variables,

Var(ˆ y i )=p

(number of parameters in the candidate model)

i=1

492 Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models (see Review Exercise 12.63) and an unbiased estimate of

1 n 2 2 (Bias ˆ y i ) 2 1 is given by (s 2 2 2 (2 Bias ˆ y i ) = −σ )(n − p)

. In the above, s 2 is the mean square error for the candidate model and σ 2 is the

σ i=1

σ i=1

population error variance. Thus, if we assume that some estimate ˆ σ 2 is available

for σ 2 ,C p is given by the following equation:

C p Statistic

(s 2 − ˆσ 2 )(n − p)

C p =p+

, where p is the number of model parameters, s 2 is the mean square error for the

candidate model, and ˆ σ 2 is an estimate of σ 2 .

Obviously, the scientist should adopt models with small values of C p . The reader should note that, unlike the PRESS statistic, C p is scale-free. In addition, one can gain some insight concerning the adequacy of a candidate model by ob- serving its value of C p . For example, C p > p indicates a model that is biased due to being an underfitted model, whereas C p ≈ p indicates a reasonable model.

There is often confusion concerning where ˆ σ 2 comes from in the formula for C p . Obviously, the scientist or engineer does not have access to the population quantity σ 2 . In applications where replicated runs are available, say in an experimental design situation, a model-independent estimate of σ 2 is available (see Chapters 11 and 15). However, most software packages use ˆ σ 2 as the mean square error from the most complete model. Obviously, if this is not a good estimate, the bias portion of the C p statistic can be negative. Thus, C p can be less than p.

Example 12.12: Consider the data set in Table 12.15, in which a maker of asphalt shingles is interested in the relationship between sales for a particular year and factors that influence sales. (The data were taken from Kutner et al., 2004, in the Bibliography.)

Of the possible subset models, three are of particular interest. These three are x 2 x 3 ,x 1 x 2 x 3 , and x 1 x 2 x 3 x 4 . The following represents pertinent information for comparing the three models. We include the PRESS statistics for the three models to supplement the decision making.

Model

C p x 2 x 3 0.9940 0.9913 44.5552 782.1896 11.4013

R 2 R 2 pred

s 2 PRESS

x 1 x 2 x 3 0.9970 0.9928 24.7956 643.3578 3.4075 x 1 x 2 x 3 x 4 0.9971 0.9917 26.2073 741.7557

5.0 It seems clear from the information in the table that the model x 1 ,x 2 ,x 3 is

preferable to the other two. Notice that, for the full model, C p = 5.0. This occurs since the bias portion is zero, and ˆ σ 2 = 26.2073 is the mean square error from the full model. Figure 12.6 is a SAS PROC REG printout showing information for all possible regressions. Here we are able to show comparisons of other models with (x 1 ,x 2 ,x 3 ). Note that (x 1 ,x 2 ,x 3 ) appears to be quite good when compared to all models. As a final check on the model (x 1 ,x 2 ,x 3 ), Figure 12.7 shows a normal proba- bility plot of the residuals for this model.

12.11 Cross Validation, C p , and Other Criteria for Model Selection 493

Table 12.15: Data for Example 12.12

Promotional

Active

Competing Potential,

Sales, y District Accounts, x 1 Accounts, x 2 Brands, x 3 x 4 (thousands)

Dependent Variable: sales

Number in

Adjusted

Model C(p) R-Square R-Square MSE Variables in Model

24.79560 x1 x2 x3

26.20728 x1 x2 x3 x4

44.55518 x2 x3

48.54787 x2 x3 x4

2526.96144 x1 x3 x4

2384.14286 x3 x4

2673.83349 x1 x3

3956.75275 x1 x2 x4

3663.99357 x1 x2

3699.64814 x2 x4

6603.45109 x1 x4

Figure 12.6: SAS printout of all possible subsets on sales data for Example 12.12.

494 Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

⫺ 2 Sample Quantiles ⫺ 4

⫺ 1 0 1 Theoretical Quantiles

Figure 12.7: Normal probability plot of residuals using the model x 1 x 2 x 3 for Example 12.12.

Exercises

12.47 Consider the “hang time” punting data given (b) Use stepwise regression with a significance level of

0.10 to select a combination of variables. (a) Verify the regression equation shown on page 489.

in Case Study 12.2, using only the variables x 2 and x 3 .

(c) Generate values for s 2 ,R 2 , PRESS, and |δ i | for (b) Predict punter hang time for a punter with LLS =

i=1

180 pounds and Power = 260 foot-pounds. the entire set of 31 models. Use this information (c) Construct a 95% confidence interval for the mean

to determine the best combination of variables for hang time of a punter with LLS = 180 pounds and

predicting punting distance. Power = 260 foot-pounds.

(d) For the final model you choose, plot the standard- ized residuals against Y and do a normal probabil-

12.48 For the data of Exercise 12.15 on page 452, use ity plot of the ordinary residuals. Comment. the techniques of

Distance, y (ft) (a) forward selection with a 0.05 level of significance to

Punter

choose a linear regression model; 1 162.50 2 144.00

(b) backward elimination with a 0.05 level of signifi- 3 147.50 cance to choose a linear regression model;

4 163.50 (c) stepwise regression with a 0.05 level of significance

5 192.00 to choose a linear regression model.

6 171.75 12.49 Use the techniques of backward elimination

7 162.00 with α = 0.05 to choose a prediction equation for the

8 104.93 data of Table 12.8.

9 105.67 12.50 For the punter data in Case Study 12.2,

10 117.59 an additional response, “punting distance,” was also

11 140.25 recorded. The average distance values for each of the

12 150.17 13 punters are given.

13 165.16 (a) Using the distance data rather than the hang times,

estimate a multiple linear regression model of the 12.51 The following is a set of data for y, the amount type

of money (in thousands of dollars) contributed to the alumni association at Virginia Tech by the Class of

μ Y |x 1 ,x 2 ,x 3 ,x 4 ,x 5 1960, and x, the number of years following graduation:

=β 0 +β 1 x 1 +β 2 x 2 +β 3 x 3 +β 4 x 4 +β 5 x 5

for predicting punting distance.

Exercises 495 y

(a) Fit a multiple linear regression to the data. 812.52

12 (b) Compute t-tests on coefficients. Give P-values. 1211.50

13 (c) Comment on the quality of the fitted model. 1348.00

15 12.55 Rayon whiteness is an important factor for sci- 2567.50

16 entists dealing in fabric quality. Whiteness is affected 2526.50

17 by pulp quality and other processing variables. Some (a) Fit a regression model of the type

of the variables include acid bath temperature, ◦ 1 C (x ); cascade acid concentration, % (x 2 ); water temperature,

C (x 3 ); sulfide concentration, % (x 4 ); amount of chlo- rine bleach, lb/min (x 5 ); and blanket finish tempera- (b) Fit a quadratic model of the type

μ Y |x =β 0 +β 1 x.

6 C (x ). A set of data from rayon specimens is given here. The response, y, is the measure of white- μ

x 1 x 2 x 3 x 4 x 5 x 6 (c) Determine which of the models in (a) or (b) is

2 2 88.7 43 0.211 85 0.243 0.606 preferable. Use s 48 ,R , and the PRESS residuals

89.3 42 0.604 89 0.237 0.600 55 to support your decision.

12.52 For the model of Exercise 12.50(a), test the hy- 83.4 52 0.370 93 0.198 0.485 54 pothesis

47.3 51 0.702 86 0.198 0.478 63 Use a P-value in your conclusion.

12.53 For the quadratic model of Exercise 12.51(b), 87.9 43 0.525 85 0.199 0.437 63 give estimates of the variances and covariances of the

90.3 45 0.486 84 0.189 0.499 58 estimates of β 1 and β 11 .

12.54 A client from the Department of Mechanical (a) Use the criteria MSE, C p , and PRESS to find the Engineering approached the Consulting Center at Vir-

“best” model from among all subset models. ginia Tech for help in analyzing an experiment dealing with gas turbine engines. The voltage output of en- (b) Plot standardized residuals against Y and do a

gines was measured at various combinations of blade normal probability plot of residuals for the “best” speed and sensor extension.

model. Comment.

Speed, x 1 Extension,

(volts) (in./sec)

12.56 In an effort to model executive compensation 1.95 6336

x 2 (in.)

for the year 1979, 33 firms were selected, and data were 2.50 7099

gathered on compensation, sales, profits, and employ- 2.93 8026

ment. The following data were gathered for the year 1.69 6230

Sales, x 1 Profits, x 2 Employ- 1.55 6522

sation, y

Firm (thousands) (millions) (millions) ment, x 3 1.94 7310

(cont.)

496 Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models Compen-

sation, y

Sales, x 1 Profits, x 2 Employ-

Firm (thousands) (millions) (millions) ment, x 3 10

Consider the model

(a) Fit the regression with the model above. (b) Is a model with a subset of the variables preferable

to the full model? 12.57 The pull strength of a wire bond is an impor-

tant characteristic. The following data give informa-

tion on pull strength y, die height x 1 , post height x 2 , loop height x 3 , wire length x 4 , bond width on the die x 5 ,

and bond width on the post x 6 . (From Myers, Mont-

gomery, and Anderson-Cook, 2009.) (a) Fit a regression model using all independent vari-

ables. (b) Use stepwise regression with input significance level

0.25 and removal significance level 0.05. Give your final model.

(c) Use all possible regression models and compute R 2 ,

C p ,s 2 , and adjusted R 2 for all models.

(d) Give the final model.

(e) For your model in part (d), plot studentized resid- uals (or R-Student) and comment.

12.58 For Exercise 12.57, test H 0 :β 1 =β 6 = 0. Give

P-values and comment.

12.59 In Exercise 12.28, page 462, we have the fol- lowing data concerning wear of a bearing:

y (wear) x 1 (oil viscosity)

x 2 (load)

(a) The following model may be considered to describe

the data:

y i =β 0 +β 1 x 1i +β 2 x 2i +β 12 x 1i x 2i +ǫ i , for i = 1, 2, . . . , 6. The x 1 x 2 is an “interaction”

term. Fit this model and estimate the parameters. (b) Use the models (x 1 ), (x 1 ,x 2 ), (x 2 ), (x 1 ,x 2 ,x 1 x 2 ) and compute PRESS, C p , and s 2 to determine the

“best” model.

Dokumen yang terkait

Optimal Retention for a Quota Share Reinsurance

0 0 7

Digital Gender Gap for Housewives Digital Gender Gap bagi Ibu Rumah Tangga

0 0 9

Challenges of Dissemination of Islam-related Information for Chinese Muslims in China Tantangan dalam Menyebarkan Informasi terkait Islam bagi Muslim China di China

0 0 13

Family is the first and main educator for all human beings Family is the school of love and trainers of management of stress, management of psycho-social-

0 0 26

THE EFFECT OF MNEMONIC TECHNIQUE ON VOCABULARY RECALL OF THE TENTH GRADE STUDENTS OF SMAN 3 PALANGKA RAYA THESIS PROPOSAL Presented to the Department of Education of the State Islamic College of Palangka Raya in Partial Fulfillment of the Requirements for

0 3 22

GRADERS OF SMAN-3 PALANGKA RAYA ACADEMIC YEAR OF 20132014 THESIS Presented to the Department of Education of the State College of Islamic Studies Palangka Raya in Partial Fulfillment of the Requirements for the Degree of Sarjana Pendidikan Islam

0 0 20

A. Research Design and Approach - The readability level of reading texts in the english textbook entitled “Bahasa Inggris SMA/MA/MAK” for grade XI semester 1 published by the Ministry of Education and Culture of Indonesia - Digital Library IAIN Palangka R

0 1 12

A. Background of Study - The quality of the english textbooks used by english teachers for the tenth grade of MAN Model Palangka Raya Based on Education National Standard Council (BSNP) - Digital Library IAIN Palangka Raya

0 0 15

1. The definition of textbook - The quality of the english textbooks used by english teachers for the tenth grade of MAN Model Palangka Raya Based on Education National Standard Council (BSNP) - Digital Library IAIN Palangka Raya

0 0 38

CHAPTER IV DISCUSSION - The quality of the english textbooks used by english teachers for the tenth grade of MAN Model Palangka Raya Based on Education National Standard Council (BSNP) - Digital Library IAIN Palangka Raya

0 0 95