14 Wine Quality Table 12-13 presents data on taste-testing 38 brands of pinot noir wine (the data

Example 12-14 Wine Quality Table 12-13 presents data on taste-testing 38 brands of pinot noir wine (the data

  were fi rst reported in an article by Kwan, Kowalski, and Skogenboe in an article in the Journal

  Agricultural and Food Chemistry (1979, Vol. 27), and it also appears as one of the default data sets in the Minitab software package). The response variable is y = quality, and we wish to fi nd the “best” regression equation that relates quality to the other fi ve parameters.

  Figure 12-12 is the matrix of scatter plots for the wine quality data. We notice that there are some indications of possible linear relationships between quality and the regressors, but there is no obvious visual impression of which regressors would be appropriate. Table 12-14 lists all possible regressions output from the software. In this analysis, we asked the computer software to present the best three equations for each subset size. Note that the computer soft- ware reports the values of R R 2 ,

  adj C p and S = MS E for each model. From Table 12-14 we see that the three-variable equation with x 2 = aroma, x 4 = fl avor, and x 5 = oakiness produces the minimum C p equation whereas the four-variable

  model, which adds x 2

  1 = clarity to the previous three regressors, results in maximum R adj (or minimum MS E ). The three-

  variable model is

  ˆy =.+. 6 47 0 580 x 2 +. 1 20 4 −. 0 602 5

  and the four-variable model is

  ˆy =.+. 4 99 1 79 x 1 +. 0 530 2 1 26 4 −. 0 659 5

  5"- t 12-13 Wine Quality Data

  x 1 x 2 x 3 x 4 x 5 y

  x 1 x 2 x 3 x 4 x 5 y

  Clarity Aroma Body Flavor Oakiness Quality

  Clarity Aroma Body Flavor Oakiness Quality

  Section 12-6Aspects of Multiple Regression Modeling

  FIGURE 12-12

  A matrix of scatter plots from computer software for the wine quality data.

  5"- t 12-14 All Possible Regressions Computer Output for the Wine Quality Data

  Best Subsets Regression: Quality versus Clarity, Aroma, . . .

  Response is quality

  R-Sq (adj)

  C–p

  S

  y a y r s

  X

  X

  X

  X

  X

  X

  X

  X

  X

  X

  X

  X

  X

  Chapter 12Multiple Linear Regression

  These models should now be evaluated further using residual plots and the other techniques discussed earlier in the chapter to see whether either model is satisfactory with respect to the underlying assumptions and to determine whether one of them is preferable. It turns out that the residual plots do not reveal any major problems with either model. The value of PRESS for the three-variable model is 56.0524, and for the four-variable model, it is 60.3327. Because PRESS is smaller in the model with three regressors, and because it is the model with the smallest number of predictors, it would likely be the preferred choice.

  Stepwise Regression Stepwise regression is probably the most widely used variable selection technique. The pro- cedure iteratively constructs a sequence of regression models by adding or removing variables at each step. The criterion for adding or removing a variable at any step is usually expressed

  in terms of a partial F-test. Let f in

  be the value of the F-random variable for adding a variable

  to the model, and let f out

  be the value of the F-random variable for removing a variable from

  the model. We must have f in ≥ f out , and usually f in = f out .

  Stepwise regression begins by forming a one-variable model using the regressor variable that has the highest correlation with the response variable Y . This will also be the regressor producing

  the largest F-statistic. For example, suppose that at this step, x 1 is selected. At the second step, the

  remaining K −1 candidate variables are examined, and the variable for which the partial F-statistic

  SS R β|ββ j 1 , 0

  ) (12-49)

  F j =

  MS E ( x,x j 1 )

  is a maximum is added to the equation provided that f j > f in . In Equation 12-49, MS E (, xx j 1 ) denotes the mean square for error for the model containing both x 1 and x j . Suppose that this

  procedure indicates that x 2 should be added to the model. Now the stepwise regression algo-

  rithm determines whether the variable x 1 added at the fi rst step should be removed. This is

  done by calculating the F-statistic

  SS R β|ββ 2 , 0

  ) (12-50)

  F 1 =

  MS E ( x,x 1 2 )

  If the calculated value f 1 < f out , the variable x 1 is removed; otherwise it is retained, and we

  would attempt to add a regressor to the model containing both x 1 and x 2 .

  In general, at each step the set of remaining candidate regressors is examined, and the regressor with the largest partial F-statistic is entered provided that the observed value of f exceeds f in . Then the partial F-statistic for each regressor in the model is calculated, and the

  regressor with the smallest observed value of F is deleted if the observed f < f out . The proce-

  dure continues until no other regressors can be added to or removed from the model.

  Stepwise regression is almost always performed using a computer program. The analyst exer- cises control over the procedure by the choice of f in and f out . Some stepwise regression computer programs require that numerical values be specifi ed for f in and f out . Because the number of

  degrees of freedom on MS E depends on the number of variables in the model, which changes

  from step to step, a fi xed value of f in and f out causes the type I and type II error rates to vary. Some computer programs allow the analyst to specify the type I error levels for f in and f out . However, the “advertised” signifi cance level is not the true level because the variable selected is the one that maximizes (or minimizes) the partial F-statistic at that stage. Sometimes it is useful to experiment with different values of f in and f out (or different advertised type I error rates) in several different runs to see whether this substantially affects the choice of the fi nal model.