14 Wine Quality Table 12-13 presents data on taste-testing 38 brands of pinot noir wine (the data
Example 12-14 Wine Quality Table 12-13 presents data on taste-testing 38 brands of pinot noir wine (the data
were fi rst reported in an article by Kwan, Kowalski, and Skogenboe in an article in the Journal
Agricultural and Food Chemistry (1979, Vol. 27), and it also appears as one of the default data sets in the Minitab software package). The response variable is y = quality, and we wish to fi nd the “best” regression equation that relates quality to the other fi ve parameters.
Figure 12-12 is the matrix of scatter plots for the wine quality data. We notice that there are some indications of possible linear relationships between quality and the regressors, but there is no obvious visual impression of which regressors would be appropriate. Table 12-14 lists all possible regressions output from the software. In this analysis, we asked the computer software to present the best three equations for each subset size. Note that the computer soft- ware reports the values of R R 2 ,
adj C p and S = MS E for each model. From Table 12-14 we see that the three-variable equation with x 2 = aroma, x 4 = fl avor, and x 5 = oakiness produces the minimum C p equation whereas the four-variable
model, which adds x 2
1 = clarity to the previous three regressors, results in maximum R adj (or minimum MS E ). The three-
variable model is
ˆy =.+. 6 47 0 580 x 2 +. 1 20 4 −. 0 602 5
and the four-variable model is
ˆy =.+. 4 99 1 79 x 1 +. 0 530 2 1 26 4 −. 0 659 5
5"- t 12-13 Wine Quality Data
x 1 x 2 x 3 x 4 x 5 y
x 1 x 2 x 3 x 4 x 5 y
Clarity Aroma Body Flavor Oakiness Quality
Clarity Aroma Body Flavor Oakiness Quality
Section 12-6Aspects of Multiple Regression Modeling
FIGURE 12-12
A matrix of scatter plots from computer software for the wine quality data.
5"- t 12-14 All Possible Regressions Computer Output for the Wine Quality Data
Best Subsets Regression: Quality versus Clarity, Aroma, . . .
Response is quality
R-Sq (adj)
C–p
S
y a y r s
X
X
X
X
X
X
X
X
X
X
X
X
X
Chapter 12Multiple Linear Regression
These models should now be evaluated further using residual plots and the other techniques discussed earlier in the chapter to see whether either model is satisfactory with respect to the underlying assumptions and to determine whether one of them is preferable. It turns out that the residual plots do not reveal any major problems with either model. The value of PRESS for the three-variable model is 56.0524, and for the four-variable model, it is 60.3327. Because PRESS is smaller in the model with three regressors, and because it is the model with the smallest number of predictors, it would likely be the preferred choice.
Stepwise Regression Stepwise regression is probably the most widely used variable selection technique. The pro- cedure iteratively constructs a sequence of regression models by adding or removing variables at each step. The criterion for adding or removing a variable at any step is usually expressed
in terms of a partial F-test. Let f in
be the value of the F-random variable for adding a variable
to the model, and let f out
be the value of the F-random variable for removing a variable from
the model. We must have f in ≥ f out , and usually f in = f out .
Stepwise regression begins by forming a one-variable model using the regressor variable that has the highest correlation with the response variable Y . This will also be the regressor producing
the largest F-statistic. For example, suppose that at this step, x 1 is selected. At the second step, the
remaining K −1 candidate variables are examined, and the variable for which the partial F-statistic
SS R β|ββ j 1 , 0
) (12-49)
F j =
MS E ( x,x j 1 )
is a maximum is added to the equation provided that f j > f in . In Equation 12-49, MS E (, xx j 1 ) denotes the mean square for error for the model containing both x 1 and x j . Suppose that this
procedure indicates that x 2 should be added to the model. Now the stepwise regression algo-
rithm determines whether the variable x 1 added at the fi rst step should be removed. This is
done by calculating the F-statistic
SS R β|ββ 2 , 0
) (12-50)
F 1 =
MS E ( x,x 1 2 )
If the calculated value f 1 < f out , the variable x 1 is removed; otherwise it is retained, and we
would attempt to add a regressor to the model containing both x 1 and x 2 .
In general, at each step the set of remaining candidate regressors is examined, and the regressor with the largest partial F-statistic is entered provided that the observed value of f exceeds f in . Then the partial F-statistic for each regressor in the model is calculated, and the
regressor with the smallest observed value of F is deleted if the observed f < f out . The proce-
dure continues until no other regressors can be added to or removed from the model.
Stepwise regression is almost always performed using a computer program. The analyst exer- cises control over the procedure by the choice of f in and f out . Some stepwise regression computer programs require that numerical values be specifi ed for f in and f out . Because the number of
degrees of freedom on MS E depends on the number of variables in the model, which changes
from step to step, a fi xed value of f in and f out causes the type I and type II error rates to vary. Some computer programs allow the analyst to specify the type I error levels for f in and f out . However, the “advertised” signifi cance level is not the true level because the variable selected is the one that maximizes (or minimizes) the partial F-statistic at that stage. Sometimes it is useful to experiment with different values of f in and f out (or different advertised type I error rates) in several different runs to see whether this substantially affects the choice of the fi nal model.