The Multiple Regression Model
10.2 The Multiple Regression Model
The multiple regression model contains two or more predictor variables to account for the values of a single response variable. Express the fitted value of Y as a function, here linear, of a set of
m predictor variables with corresponding regression coefficients b 0 , b 1 , …, b m .
Y ˆ = b 0 + b 1 X 1 + b 2 X 2 +...+ b m X m
The form of the model remains the same as with the single-predictor models from the previous chapter, just more predictor variables. The estimation procedure is the same least-squares procedure defined for a single-predictor
residuals,
model. This procedure provides the regression coefficients that minimize the sum of the squared
Section 9.3 , p. 209
residuals. Define the residual for the i th data point as Y i −ˆ Y i . The lessR function Regression provides the same output as from the single-predictor model, the model estimates, model fit, residuals, and prediction intervals. There are also some new considerations that result from multiple predictor variables that contribute to the output.
Why add more predictor variables to the model? As discussed in the previous chapter, the
purposes of
analysis of a regression model has two primary purposes:
regression
analysis: Forecast unknown response
Y: Forecast unknown values of Y, given of the value of each of the predictor variables ˆ
values and relate the response
b j : Relate how the values of the response variable change as the values of the predictor
variable to the
variables change
predictors.
These goals of regression analysis remain the same for multiple regression, which typically can better achieve these goals compared to the use of only a single-predictor model.
To benefit from adding new predictors, ideally each new predictor variable adds new,
new predictor
relevant information to the model. That is, choose predictor variables that satisfy the following
variable
two conditions.
criterion: The
predictor provides new, relevant information.
New Information: A proposed predictor variable is relatively uncorrelated with the predictor variables already in the model. Relevant Information: A proposed predictor variable correlates with Y.
Each additional predictor variable generally leads to better fit with a decreased sum of the
s e , Section 9.3.1 ,
squared residuals. This decrease indicates improved fit with a lower standard deviation of the
p. 210
residuals, s e , and a larger R 2 . Unlike R 2 , R 2 adj accounts for the number of predictor variables in
R 2 , Section 9.3.2 ,
the model relative to the overall sample size. As such, R 2 adj has the desirable property that adding
p. 211
predictor variables that do not contribute much to the overall reduction in the sum of squared residuals may even indicate worse fit with a lower value.
A distinguishing characteristic of multiple regression is that each additional predictor
variable also changes the size and meaning of the slope coefficients for the remaining predictor
partial slope
coefficient b j :
variables. The regression coefficient b j isolates the relation of predictor X j to response Y with
Average change in
the effects of the remaining predictor variables held constant. The analysis controls the effects
Y for each unit increase in X j with
of the remaining predictor variables as if their values were identical for all rows of data in the
values of all other predictors held
analysis. Each slope coefficient partials out the effects of the other predictor variables in the
constant.
model, so these slope coefficients are called partial slope coefficients.
Regression II 225
Consider the slope coefficient in a simple (one-predictor) model. The slope coefficient indicates the impact of a one-unit increase in the value of the single predictor variable X on the average change in the response variable Y. The extent of this change, however, results from two different sources. First, the coefficient summarizes the effect of any direct causal impact that X has on Y. Second, the coefficient also summarizes the influence of all variables on Y related to X. Change the value of the predictor variable X and all the values of variables correlated with X also gross effect: change. The changes in all of these related variables also influence Y. The subsequent average Effect of a change in the value of the change in the value of the response variable, the gross effect of X on Y, reflects the cumulative predictor variable
impact of all of these changes. and all related
variables on the
For a multiple regression model consider the partial slope coefficient for predictor variable average change of
X the response j . The size of this partial slope coefficient reflects the impact of a change in predictor X j on Y variable. with the values of the other variables in the model held constant. As a result, the addition of
a new predictor variable to a model generally changes the values of the remaining coefficients. As more potential extraneous influences are removed, the average change in Y that results from
a change in X j more resembles the direct causal impact of X j on Y, referred to as a net effect. The extent of a net effect, however, is relative. If all variables correlated with X j that impact Y net effect: Effect are included in the model, then the net effect reflects the direct causal impact of X j and Y. In of one variable on the response reality some of these extraneous variables, perhaps most, are not included in the model and so variable with the are not controlled.
values of related variables held
Which interpretation, net effect or gross effect, is preferred? The answer depends on the constant. question asked. Is the focus on the effect of a direct change in X j on Y for all other variables held constant? Or should this effect be assessed with the other predictors freely varying? If the goal is to understand the direct causal impact of a predictor variable on the response variable, then these other influences should be included in the model as additional predictor variables. If these other variables correlated with the predictor variable do not appear in the model, the corresponding slope coefficient, perhaps inadvertently, includes their impact on the response variable. If the goal, however, is to assess the overall change on the response variable, then the one-predictor model regression coefficient provides the desired estimate.
The control achieved with multiple regression is statistical control. For example, and statistical
illustrated later in more detail, consider the relation of primary interest between Gender and control: Multiple regression controls Salary. Statistical control allows for this relationship to be examined by holding variables the values of
constant such as the number of Years employed. When experimental control with random potentially
confounding
assignment is not possible, statistical control is the next best alternative to isolating direct variables. causal influences. Unfortunately, not all potential confounding variables can be identified, measured and then entered into the regression model. Holding the values of the other variables experimental constant is an attempt, usually not entirely successful, to achieve the equivalence attained with control, Section 6.3.4 , experimental control.
p. 138
10.2.1 Multiple Regression Example
An educational researcher investigates the conditions that contribute to reading success. As part of this (simulated) research project at the end of the school year, 100 students in the same grade were administered a test of their Reading ability with results on a percentage scale from
0 to 100. Verbal aptitude was measured with a separate test, again on a scale of 0 to 100. Also measured was the number of days Absent from school during the school year and family Income in $1000s. The data are in the file called Reading that is downloaded as part of lessR .
mydata <- Read("Reading", format="lessR")
226 Regression II
Gross Effects
Separately regressing the measure of Reading ability on each of the remaining variables demonstrates a statistically significant relation between Reading ability and each predictor variable.
Y ˆ = 38 . 407 + 0 . 523 X Verbal , for H 0 :β Verbal = 0 , p -value = 0 . 000 Y ˆ = 83 . 477 − 2 . 833 X Absent , for H 0 :β Absent = 0 , p -value = 0 . 000
Y ˆ = 62 . 282 + 0 . 170 X Income , for H 0 :β Income = 0 , p -value = 0 . 000 The gross effect for each predictor variable is significant, with the corresponding p-value
05. More specifically, each p-value is approximately equal to 0 to within three decimal digits.
less than α= 0 .
A change in each of the predictor variables leads to a change in Reading ability as expressed by the gross effect for each predictor. In this sample of 100 students, an increase in one point on the Verbal aptitude test yields, on average, an increase of a little more than one-half point on the Reading test. Each additional day of being Absent yields an average decrease of more than 2.8 points on the Reading test. And each additional $1000 of family income results in an average increase of 0.17 points on the Reading test.
Net Effects
There is, however, no control on the estimation of each of these three slope coefficients of their respective gross effects. When one of these predictor variables varies, so do all the variables correlated with the predictor variable, including the values of the other two predictor variables in the study.
Scenario Statistically control for the effects of other predictor variables Assess these relationships regarding Reading ability with the imposition of statistical control. Use multiple regression estimates that convey the net effect of each predictor on Reading ability, relative to the other predictor variables.
Only with a control that holds the values of these correlated variables constant can the underlying direct impact of a predictor variable on the response variable be estimated.
lessR Input Multiple regression > Regression(Reading ∼ Verbal + Absent + Income)
The resulting partial slope coefficients and the associated hypothesis tests and confidence intervals are found in Listing 10.1 .
Regression II 227
Upper 95% (Intercept)
Estimate
Std Err t-value p-value
Listing 10.1 Estimated model coefficients, hypothesis tests, and confidence intervals for the three predictor model of Reading ability.
The estimated multiple regression model follows.
Y ˆ Reading = 64 . 437 + 0 . 204 X Verbal − 2 . 043 X Absent + 0 . 033 X Income
The pattern of results of the estimated net effects with multiple regression differs from that obtained with the estimated gross effects from three separate one-predictor regressions. Consider Income, which has an estimated net effect coefficient considerably less than the corresponding estimated gross effect. The gross effect is significant, but the partial slope coefficient for Income is not significant, not statistically detectable from zero.
Net effect of Income on Reading:
p -value = 0 . 364 > α = 0 . 05 so do not reject H 0 :β Income = 0 For students who have the same Verbal aptitude and are Absent the same number of days
of school, there is no detected effect of Income on Reading ability. The net effect of Income cannot be distinguished from zero.
The net effect for Verbal aptitude also drops considerably from the gross effect, less than half, from 0.523 to 0.204. The effect, however, does remain significant.
Net effect of Verbal aptitude on Reading:
p -value = 0 . 037 < α = 0 . 05 so reject H 0 :β Verbal = 0
When the effects of the variables Absent and Income are controlled, that is, held constant, a 1-point increase in the Verbal attitude test score results in an average increase of Reading ability of 0.2 points.
Controlling for Verbal aptitude and family Income only slightly mitigates the effect of days Absent on Reading ability. The estimate of the gross effect is − 2 . 833. The net effect estimate is − 2 . 043.
Net effect of days Absent on Reading:
p -value = 0 . 000 < α = 0 . 05 so reject H 0 :β Absent = 0
The effect of days Absent applies for all students with the same Verbal aptitude and family Income. Days Absent appears to be closer to having a direct causal impact on Reading ability than do the other two predictor variables. Days Absent probably directly contributes to less Reading ability because of missed classroom instruction and practice. The variable is also likely
a proxy for a more abstract concept, general Motivation. The motivated student both is more likely not to miss class and is also more interested in learning, both attending class and paying
228 Regression II
more attention when in class. This additional hypothesis would be tested in a later study that also included measures of Motivation beyond just class attendance.
standardize option,
The previous chapter introduced the standardize option, set to TRUE . The effect of this
Section 9.2.2 , p. 207
specification is to first standardize each variable before estimating the model. Though not applied here, this option applies to the multiple regression model and can be invoked for any analysis.
Model Fit and Validity
As discussed in the previous chapter, the estimated model coefficients cannot be properly interpreted without several other considerations satisfied. One consideration is the model fit in terms of the size of the residuals. The relevant output appears in Listing 10.2 .
Standard deviation of residuals: 10.19 for 96 degrees of freedom
If normal, the approximate 95% range of residuals about each fitted value is 2*t-cutoff*10.19, with a 95% interval t-cutoff of 1.985 95% range of variation: 40.46
R-squared: 0.448
Adjusted R-squared: 0.430
Null hypothesis that population R-squared=0 F-statistic: 25.934
df: 3 and 96
p-value: 0.000
Listing 10.2 Model fit in terms of fit indices based on the size of the residuals.
The population R 2 is apparently larger than zero because the corresponding p-value is equal to 0 within three decimal digits. Its sample value of R 2 = 0 . 448 is of moderate size and compares favorably with the value typically obtained in the social and behavioral sciences. The difficulty in precise prediction is also consistent with many other studies. Here the 95% range of variation based on the sample residuals spans over 40 points on a 100-point scale for Reading ability. The sizes of the actual prediction intervals are a little larger, up to a value of 42.85 from the Regression output.
In terms of influential data points and outliers there do not appear to be any in this data set. From the Regression output the largest R-student value in magnitude is 2.35 and the largest Cook’s Distance is only 0.12. In terms of assumptions, the residuals are at least approximately normally distributed and, as shown in Figure 10.1 , the plot of the residuals and fitted values does not indicate any noticeable patterns.
In terms of fit, influence, and assumptions the interpretation of the estimated coefficients appears valid.
10.2.2 Collinearity
Ideally, each predictor variable relates to the response variable and contributes new information compared to the remaining predictor variables in the regression model. A potential problem
collinearity:
is that two or more highly related predictor variables provide redundant information. When
Property of two or more highly
a predictor variable linearly relates to one or more of the remaining predictor variables, the
related variables.
predictor variables are collinear. Little gain in predictive efficiency results from the addition to the model of a new predictor variable that substantially correlates with an existing predictor variable.
Regression II 229
50 60 70 8 0 90 Fitted Values
Largest Cook's Distance, 0.12, is highlighted
Figure 10.1 Diagnostic plot of Residuals vs. Fitted Values.
Conversely, dropping one of the redundant variables from the model does not substantially diminish predictive accuracy.
Another issue is that multiple regression analysis estimates the net effect of a unit change in each predictor variable on the response variable, with the value of all other predictor variables held constant. The more collinearity, the more difficult to separate the effects of the collinear predictor variables from each other. This difficulty in separation of effects increases the standard errors of the slope coefficients for the collinear predictor variables. When multiple predictor variables represent different names for essentially the same concept, the standard errors of the estimates are inflated because the effects cannot be easily disentangled from each other. The null model, large standard errors for the individual coefficients preclude statistical significance.
Section 9.3.2 , p. 211
A large correlation between two predictor variables results in collinearity. A more general method for detecting collinearity considers the relationship between each predictor variable with all the remaining predictors. To evaluate this relationship, create a regression model in
which the j th predictor, X j , becomes the response variable, and all remaining predictors are the predictor variables. If X j is collinear with one or more of the remaining predictors, then the resulting R 2 j would be large, such as over the somewhat arbitrary but conventional threshold of 0.8.
From this value of R 2 j for the j th predictor, define two equivalent indices of collinearity: collinearity
Tolerance and the variance inflation factor or VIF.
indicators of a predictor variable: Tolerance and VIF, each based on the
R 2 1 of regressing 1
Tolerance( b j ) = 1 − R 2 j
and
VIF( b j ) =
the predictor on all
Tolerance 1 − R 2 j
others.
230 Regression II
Tolerance of the j th predictor variable is R 2 j subtracted from 1. So R 2 j larger than 0.8 stipulates the same threshold of collinearity for tolerance less than 0.2. VIF is just the reciprocal of Tolerance, so the same threshold is stipulated when VIF is larger than 1 / 0 . 2 = 5. The Regression function reports the values of Tolerance and VIF for each of the predictor variables, here shown in Listing 10.3 .
Listing 10.3 Collinearity indicators for each of the three predictor variables.
All the values of Tolerance are well above 0.2 and all the values of VIF, then, well below 5. Collinearity is not an issue for this analysis.
10.2.3 Model Selection
Collinearity occurs when one or more predictor variables do not add new information to the remaining predictor variables. A task related to collinearity analysis is to pare down an initial set of predictor variables to the minimal set of predictors that each provide new and relatively unique information, and also relate to the response variable. For example, if two predictor variables are collinear, one possibility is to remove the collinear predictor from the model that
model selection:
is the least related to the response variable. The task of selecting an appropriate set of predictor
Choose the minimal set of
variables from a larger initial set of predictor variables submitted to a regression analysis is model
predictors that
selection.
achieves satisfactory fit
Several different procedures exist for model selection. Perhaps the most useful procedure
relative to all
is all possible subsets, an analysis of the fit of all possible models from a given set of predictor
possible models.
2 variables. The procedure computes a fit index such as R 2 R for every possible combination of adj , Section 9.3.2 , adj p. 212
predictor variables in the model. The procedure that provides this analysis is the function leaps written by Lumley and Miller (2009) from the package of the same name. The leaps package
all possible
is automatically downloaded and installed with the lessR package. The Regression function
subsets: Model selection
automatically accesses the leaps function and then displays the modified output by default, as
procedure that lists
shown in Listing 10.4 .
the fit of all possible models from the submitted model.
Verbal Absent Income
R2adj X’s 1 1 0 0.431
Listing 10.4 All possible regression models evaluated in terms of R 2 adj .
Each row of the output in Listing 10.4 is for the analysis of a different model. A column with
a 1 indicates that the considered model contains the corresponding variable and a 0 indicates
Regression II 231
the variable is deleted from the model. The R2adj column lists the model’s R 2 adj , and the X’s column lists the number of predictor variables in the corresponding model.
The goal of model selection is parsimony, to select the model that balances the two competing parsimony: criteria of the best possible fit against the smallest number of predictor variables. If a complex Model selection optimizes both fit model with many predictor variables achieves slightly better fit in the training data than a and simplicity. simpler model, the simpler model is preferred. Models should become more complex only when training sample, meaningful gains in fit are obtained and this improved fit extends to new situations with new Section 9.4 , p. 212 data. It can be too easy to overfit a complex model to the training data, in which the improved model overfit: fit only applies to the sample data from which the model is estimated, with little or no increased The model contains too many predictive power for new data, the test sample.
predictor variables,
Each additional predictor variable in practice increments at least slightly the value of and so fits random R 2
error.
from the previous model without that predictor. R 2 adj typically, but not necessarily, increases for an additional predictor, but add a weak, new predictor variable to the model and R 2 adj may actually drop in value because it accounts for the loss of an additional degree of freedom from the model. Each additional predictor variable removes one more degrees of freedom.
This drop occurs in this example in which a two-predictor model of Verbal and Absent
has an R 2 adj = 0 . 431. The fit index drops slightly to 0.430 for a model with all three predictors.
This evidence indicates adding Income to the two-predictor model not only does not increase fit, there is a slight decrease. Dropping Income from the model is also consistent with the non-significant result for the estimate of its slope coefficient.
After Income is dropped from the model, should the model contain Verbal aptitude and Days Absent ( R 2 adj = 0 . 431) or just days Absent ( R 2 adj = 0 . 411)? In terms of the balance between fit and number of predictors, perhaps the model with just Absent as a predictor variable provides the best trade-off. The fit only drops from R 2 adj = 0 . 431 to 0.411, which is not much of a drop for deleting a single predictor variable.
A formal test of the contribution of one or more predictor variables to a model is an hypothesis test of a nested model, with deleted variables, compared to the full model with all of the predictor variables. The test of the nested model against the full model examines the nested model: contribution of the variables deleted to obtain the nested model. If the analysis detects no Model with predictor variables difference between the models, then the deleted variables did not significantly contribute to the that are a subset of
model. the more complete
model.
Scenario Test the effectiveness of specified predictor variables Create a nested model with the specified predictor variables deleted from the full model. Then compare the two models.
The lessR function Nest accomplishes the test of the difference between the models. To Nest function: use the function, list the response variable, the subset of the predictor variables for the nested Compare a nested model with the model, and then the larger set of predictor variables in the full model. The Nest function directly corresponding full compares the residual sum of squares of the two models. The process is similar to the concept model.
underlying R 2 that compares the residual sum of squares of the full model to the null model.
Here the result is more general in that the nested model is not as severe as the null model, but still not as complete as the full model.
null model,
The function Nest analyzes the full model first, and then analyzes the data for the nested Section 9.3.2 , p. 211
model with the same data analyzed by the full model. The full model contains more variables
232 Regression II
and some of these additional variables not in the nested model may have missing data. If the full data table was submitted to two separate analyses, any cases, rows of data, that only had missing values on the variables not present in the nested model would be deleted from the full model analysis, but not from the nested model analysis. The comparison between models is only valid when the same data are analyzed for both models.
Another strategy is to use the R function na.omit to purge all cases from the data frame
na.omit function:
with at least one missing data value.
Remove all rows of data with any missing data
mydata <- na.omit(mydata)
values.
This strategy is too broad, however, because data values may be missing for variables in the data table not present in either of the models. The result is the deletion of too much data, leaving an analysis based on a smaller sample size than necessary.
Invoke the Nest function to compare a nested model to the full model.
lessR Input Test a nested regression model against the full model > Nest(Reading, c(Absent), c(Verbal,Absent,Income))
The result appears in Listing 10.5 .
Model 1: Reading ˜ Absent Model 2: Reading ˜ Verbal + Absent + Income
Res.Df
RSS df SumSq
F p-value
Listing 10.5 The comparison of a nested model to the more general full model.
From this analysis dropping both Verbal and Income yields a model that is not statistically significantly different from just a one-predictor model.
Effect of both Absent and Income: p -value = 0 . 074 >α= 0 . 05 ,
so do not reject H 0 :β Verbal =β Income = 0
05. This situation is “backwards” from the typical test of a null hypothesis that leads to detection of a difference by rejecting the null hypothesis. In this situation not rejecting the null hypothesis leads to a conclusion to delete variables. So a lack of power, the lack of the sensitivity of the test to find a difference that truly exists, would lead to the conclusion there is a difference. Given the relatively small sample size of 100, a larger sample may, but not necessarily, lead to a significant difference.
The p-value, however, just misses the boundary of α= 0 .
cross-validation:
Our conclusion is that solely in terms of the prediction of Reading ability, the days Absent in
Re-estimate a
this sample of 100 data points is probably sufficient compared to a model with one or both of the
model on new data to verify that the
remaining predictor variables. This conclusion, however, is tentative. In terms of understanding
original estimates generalize to a new
the relationships, and potentially the causes, of Reading ability or lack thereof, a model with
sample.
both days Absent and Verbal ability is probably the best model for this situation.
Regression II 233
Note, however, that whenever a model is re-specified and then re-analyzed with the same data from which it was originally estimated, the training data, the new model should be cross- validated on a new sample of data, the test data. The smaller the sample size of the training data, the more the need to apply the model to a new data sample, the test data. Models re-specified on small data sets are at risk of overfitting the model to the data, modeling random sampling error instead of components that apply to the population as a whole.
10.2.4 Three-dimensional Scatter Plot
A two-predictor model presents another option available from John Fox’s car package (Fox & Weisberg, 2011), an option that can be referenced directly from the Regression function: An interactive three-dimensional scatter plot and associated regression plane. The vertical axis is
for the response variable and the two horizontal axes are for the two predictor variables. The regression
regression plane is the three-dimensional analogue of the two-dimensional line. The line is the plane: Three dimensional linear linear graph in two dimensions and the plane is the linear graph in three dimensions.
surface.
To access the car function for the scatter plot and regression plane from Regression set the scatter.3D option to TRUE .
scatter.3D: For a two-predictor model set to TRUE to specify a
lessR Input 3D scatter plot about the regression plane
three-dimensional scatter plot and
> Regression(Reading ∼ Verbal + Absent, scatter.3D=TRUE)
regression plane.
The resulting graph appears in Figure 10.2 . When viewed on the computer screen, rotate the entire graph as desired by dragging over an open area with the mouse. The graphics window can be re-sized by dragging one of its corners or edges.
Reading 100
V erbal 100
0 Absent 10
Figure 10.2 Three-dimensional scatter plot provided by the car package. The rather considerable variation of the data points about the regression plane is apparent
from Figure 10.2 . Also apparent is the downward slope of Reading ability as the number of days Absent increases. The Verbal aptitude effect leads to a less pronounced increase in Reading
234 Regression II
ability, though the angle of the presentation of this graph tends to obscure this effect. Obtain different perspectives by rotating the graph.