The Multiple Regression Model

10.2 The Multiple Regression Model

The multiple regression model contains two or more predictor variables to account for the values of a single response variable. Express the fitted value of Y as a function, here linear, of a set of

m predictor variables with corresponding regression coefficients b 0 , b 1 , …, b m .

Y ˆ = b 0 + b 1 X 1 + b 2 X 2 +...+ b m X m

The form of the model remains the same as with the single-predictor models from the previous chapter, just more predictor variables. The estimation procedure is the same least-squares procedure defined for a single-predictor

residuals,

model. This procedure provides the regression coefficients that minimize the sum of the squared

Section 9.3 , p. 209

residuals. Define the residual for the i th data point as Y i −ˆ Y i . The lessR function Regression provides the same output as from the single-predictor model, the model estimates, model fit, residuals, and prediction intervals. There are also some new considerations that result from multiple predictor variables that contribute to the output.

Why add more predictor variables to the model? As discussed in the previous chapter, the

purposes of

analysis of a regression model has two primary purposes:

regression

analysis: Forecast unknown response

Y: Forecast unknown values of Y, given of the value of each of the predictor variables ˆ

values and relate the response

b j : Relate how the values of the response variable change as the values of the predictor

variable to the

variables change

predictors.

These goals of regression analysis remain the same for multiple regression, which typically can better achieve these goals compared to the use of only a single-predictor model.

To benefit from adding new predictors, ideally each new predictor variable adds new,

new predictor

relevant information to the model. That is, choose predictor variables that satisfy the following

variable

two conditions.

criterion: The

predictor provides new, relevant information.

New Information: A proposed predictor variable is relatively uncorrelated with the predictor variables already in the model. Relevant Information: A proposed predictor variable correlates with Y.

Each additional predictor variable generally leads to better fit with a decreased sum of the

s e , Section 9.3.1 ,

squared residuals. This decrease indicates improved fit with a lower standard deviation of the

p. 210

residuals, s e , and a larger R 2 . Unlike R 2 , R 2 adj accounts for the number of predictor variables in

R 2 , Section 9.3.2 ,

the model relative to the overall sample size. As such, R 2 adj has the desirable property that adding

p. 211

predictor variables that do not contribute much to the overall reduction in the sum of squared residuals may even indicate worse fit with a lower value.

A distinguishing characteristic of multiple regression is that each additional predictor

variable also changes the size and meaning of the slope coefficients for the remaining predictor

partial slope

coefficient b j :

variables. The regression coefficient b j isolates the relation of predictor X j to response Y with

Average change in

the effects of the remaining predictor variables held constant. The analysis controls the effects

Y for each unit increase in X j with

of the remaining predictor variables as if their values were identical for all rows of data in the

values of all other predictors held

analysis. Each slope coefficient partials out the effects of the other predictor variables in the

constant.

model, so these slope coefficients are called partial slope coefficients.

Regression II 225

Consider the slope coefficient in a simple (one-predictor) model. The slope coefficient indicates the impact of a one-unit increase in the value of the single predictor variable X on the average change in the response variable Y. The extent of this change, however, results from two different sources. First, the coefficient summarizes the effect of any direct causal impact that X has on Y. Second, the coefficient also summarizes the influence of all variables on Y related to X. Change the value of the predictor variable X and all the values of variables correlated with X also gross effect: change. The changes in all of these related variables also influence Y. The subsequent average Effect of a change in the value of the change in the value of the response variable, the gross effect of X on Y, reflects the cumulative predictor variable

impact of all of these changes. and all related

variables on the

For a multiple regression model consider the partial slope coefficient for predictor variable average change of

X the response j . The size of this partial slope coefficient reflects the impact of a change in predictor X j on Y variable. with the values of the other variables in the model held constant. As a result, the addition of

a new predictor variable to a model generally changes the values of the remaining coefficients. As more potential extraneous influences are removed, the average change in Y that results from

a change in X j more resembles the direct causal impact of X j on Y, referred to as a net effect. The extent of a net effect, however, is relative. If all variables correlated with X j that impact Y net effect: Effect are included in the model, then the net effect reflects the direct causal impact of X j and Y. In of one variable on the response reality some of these extraneous variables, perhaps most, are not included in the model and so variable with the are not controlled.

values of related variables held

Which interpretation, net effect or gross effect, is preferred? The answer depends on the constant. question asked. Is the focus on the effect of a direct change in X j on Y for all other variables held constant? Or should this effect be assessed with the other predictors freely varying? If the goal is to understand the direct causal impact of a predictor variable on the response variable, then these other influences should be included in the model as additional predictor variables. If these other variables correlated with the predictor variable do not appear in the model, the corresponding slope coefficient, perhaps inadvertently, includes their impact on the response variable. If the goal, however, is to assess the overall change on the response variable, then the one-predictor model regression coefficient provides the desired estimate.

The control achieved with multiple regression is statistical control. For example, and statistical

illustrated later in more detail, consider the relation of primary interest between Gender and control: Multiple regression controls Salary. Statistical control allows for this relationship to be examined by holding variables the values of

constant such as the number of Years employed. When experimental control with random potentially

confounding

assignment is not possible, statistical control is the next best alternative to isolating direct variables. causal influences. Unfortunately, not all potential confounding variables can be identified, measured and then entered into the regression model. Holding the values of the other variables experimental constant is an attempt, usually not entirely successful, to achieve the equivalence attained with control, Section 6.3.4 , experimental control.

p. 138

10.2.1 Multiple Regression Example

An educational researcher investigates the conditions that contribute to reading success. As part of this (simulated) research project at the end of the school year, 100 students in the same grade were administered a test of their Reading ability with results on a percentage scale from

0 to 100. Verbal aptitude was measured with a separate test, again on a scale of 0 to 100. Also measured was the number of days Absent from school during the school year and family Income in $1000s. The data are in the file called Reading that is downloaded as part of lessR .

mydata <- Read("Reading", format="lessR")

226 Regression II

Gross Effects

Separately regressing the measure of Reading ability on each of the remaining variables demonstrates a statistically significant relation between Reading ability and each predictor variable.

Y ˆ = 38 . 407 + 0 . 523 X Verbal , for H 0 :β Verbal = 0 , p -value = 0 . 000 Y ˆ = 83 . 477 − 2 . 833 X Absent , for H 0 :β Absent = 0 , p -value = 0 . 000

Y ˆ = 62 . 282 + 0 . 170 X Income , for H 0 :β Income = 0 , p -value = 0 . 000 The gross effect for each predictor variable is significant, with the corresponding p-value

05. More specifically, each p-value is approximately equal to 0 to within three decimal digits.

less than α= 0 .

A change in each of the predictor variables leads to a change in Reading ability as expressed by the gross effect for each predictor. In this sample of 100 students, an increase in one point on the Verbal aptitude test yields, on average, an increase of a little more than one-half point on the Reading test. Each additional day of being Absent yields an average decrease of more than 2.8 points on the Reading test. And each additional $1000 of family income results in an average increase of 0.17 points on the Reading test.

Net Effects

There is, however, no control on the estimation of each of these three slope coefficients of their respective gross effects. When one of these predictor variables varies, so do all the variables correlated with the predictor variable, including the values of the other two predictor variables in the study.

Scenario Statistically control for the effects of other predictor variables Assess these relationships regarding Reading ability with the imposition of statistical control. Use multiple regression estimates that convey the net effect of each predictor on Reading ability, relative to the other predictor variables.

Only with a control that holds the values of these correlated variables constant can the underlying direct impact of a predictor variable on the response variable be estimated.

lessR Input Multiple regression > Regression(Reading ∼ Verbal + Absent + Income)

The resulting partial slope coefficients and the associated hypothesis tests and confidence intervals are found in Listing 10.1 .

Regression II 227

Upper 95% (Intercept)

Estimate

Std Err t-value p-value

Listing 10.1 Estimated model coefficients, hypothesis tests, and confidence intervals for the three predictor model of Reading ability.

The estimated multiple regression model follows.

Y ˆ Reading = 64 . 437 + 0 . 204 X Verbal − 2 . 043 X Absent + 0 . 033 X Income

The pattern of results of the estimated net effects with multiple regression differs from that obtained with the estimated gross effects from three separate one-predictor regressions. Consider Income, which has an estimated net effect coefficient considerably less than the corresponding estimated gross effect. The gross effect is significant, but the partial slope coefficient for Income is not significant, not statistically detectable from zero.

Net effect of Income on Reading:

p -value = 0 . 364 > α = 0 . 05 so do not reject H 0 :β Income = 0 For students who have the same Verbal aptitude and are Absent the same number of days

of school, there is no detected effect of Income on Reading ability. The net effect of Income cannot be distinguished from zero.

The net effect for Verbal aptitude also drops considerably from the gross effect, less than half, from 0.523 to 0.204. The effect, however, does remain significant.

Net effect of Verbal aptitude on Reading:

p -value = 0 . 037 < α = 0 . 05 so reject H 0 :β Verbal = 0

When the effects of the variables Absent and Income are controlled, that is, held constant, a 1-point increase in the Verbal attitude test score results in an average increase of Reading ability of 0.2 points.

Controlling for Verbal aptitude and family Income only slightly mitigates the effect of days Absent on Reading ability. The estimate of the gross effect is − 2 . 833. The net effect estimate is − 2 . 043.

Net effect of days Absent on Reading:

p -value = 0 . 000 < α = 0 . 05 so reject H 0 :β Absent = 0

The effect of days Absent applies for all students with the same Verbal aptitude and family Income. Days Absent appears to be closer to having a direct causal impact on Reading ability than do the other two predictor variables. Days Absent probably directly contributes to less Reading ability because of missed classroom instruction and practice. The variable is also likely

a proxy for a more abstract concept, general Motivation. The motivated student both is more likely not to miss class and is also more interested in learning, both attending class and paying

228 Regression II

more attention when in class. This additional hypothesis would be tested in a later study that also included measures of Motivation beyond just class attendance.

standardize option,

The previous chapter introduced the standardize option, set to TRUE . The effect of this

Section 9.2.2 , p. 207

specification is to first standardize each variable before estimating the model. Though not applied here, this option applies to the multiple regression model and can be invoked for any analysis.

Model Fit and Validity

As discussed in the previous chapter, the estimated model coefficients cannot be properly interpreted without several other considerations satisfied. One consideration is the model fit in terms of the size of the residuals. The relevant output appears in Listing 10.2 .

Standard deviation of residuals: 10.19 for 96 degrees of freedom

If normal, the approximate 95% range of residuals about each fitted value is 2*t-cutoff*10.19, with a 95% interval t-cutoff of 1.985 95% range of variation: 40.46

R-squared: 0.448

Adjusted R-squared: 0.430

Null hypothesis that population R-squared=0 F-statistic: 25.934

df: 3 and 96

p-value: 0.000

Listing 10.2 Model fit in terms of fit indices based on the size of the residuals.

The population R 2 is apparently larger than zero because the corresponding p-value is equal to 0 within three decimal digits. Its sample value of R 2 = 0 . 448 is of moderate size and compares favorably with the value typically obtained in the social and behavioral sciences. The difficulty in precise prediction is also consistent with many other studies. Here the 95% range of variation based on the sample residuals spans over 40 points on a 100-point scale for Reading ability. The sizes of the actual prediction intervals are a little larger, up to a value of 42.85 from the Regression output.

In terms of influential data points and outliers there do not appear to be any in this data set. From the Regression output the largest R-student value in magnitude is 2.35 and the largest Cook’s Distance is only 0.12. In terms of assumptions, the residuals are at least approximately normally distributed and, as shown in Figure 10.1 , the plot of the residuals and fitted values does not indicate any noticeable patterns.

In terms of fit, influence, and assumptions the interpretation of the estimated coefficients appears valid.

10.2.2 Collinearity

Ideally, each predictor variable relates to the response variable and contributes new information compared to the remaining predictor variables in the regression model. A potential problem

collinearity:

is that two or more highly related predictor variables provide redundant information. When

Property of two or more highly

a predictor variable linearly relates to one or more of the remaining predictor variables, the

related variables.

predictor variables are collinear. Little gain in predictive efficiency results from the addition to the model of a new predictor variable that substantially correlates with an existing predictor variable.

Regression II 229

50 60 70 8 0 90 Fitted Values

Largest Cook's Distance, 0.12, is highlighted

Figure 10.1 Diagnostic plot of Residuals vs. Fitted Values.

Conversely, dropping one of the redundant variables from the model does not substantially diminish predictive accuracy.

Another issue is that multiple regression analysis estimates the net effect of a unit change in each predictor variable on the response variable, with the value of all other predictor variables held constant. The more collinearity, the more difficult to separate the effects of the collinear predictor variables from each other. This difficulty in separation of effects increases the standard errors of the slope coefficients for the collinear predictor variables. When multiple predictor variables represent different names for essentially the same concept, the standard errors of the estimates are inflated because the effects cannot be easily disentangled from each other. The null model, large standard errors for the individual coefficients preclude statistical significance.

Section 9.3.2 , p. 211

A large correlation between two predictor variables results in collinearity. A more general method for detecting collinearity considers the relationship between each predictor variable with all the remaining predictors. To evaluate this relationship, create a regression model in

which the j th predictor, X j , becomes the response variable, and all remaining predictors are the predictor variables. If X j is collinear with one or more of the remaining predictors, then the resulting R 2 j would be large, such as over the somewhat arbitrary but conventional threshold of 0.8.

From this value of R 2 j for the j th predictor, define two equivalent indices of collinearity: collinearity

Tolerance and the variance inflation factor or VIF.

indicators of a predictor variable: Tolerance and VIF, each based on the

R 2 1 of regressing 1

Tolerance( b j ) = 1 − R 2 j

and

VIF( b j ) =

the predictor on all

Tolerance 1 − R 2 j

others.

230 Regression II

Tolerance of the j th predictor variable is R 2 j subtracted from 1. So R 2 j larger than 0.8 stipulates the same threshold of collinearity for tolerance less than 0.2. VIF is just the reciprocal of Tolerance, so the same threshold is stipulated when VIF is larger than 1 / 0 . 2 = 5. The Regression function reports the values of Tolerance and VIF for each of the predictor variables, here shown in Listing 10.3 .

Listing 10.3 Collinearity indicators for each of the three predictor variables.

All the values of Tolerance are well above 0.2 and all the values of VIF, then, well below 5. Collinearity is not an issue for this analysis.

10.2.3 Model Selection

Collinearity occurs when one or more predictor variables do not add new information to the remaining predictor variables. A task related to collinearity analysis is to pare down an initial set of predictor variables to the minimal set of predictors that each provide new and relatively unique information, and also relate to the response variable. For example, if two predictor variables are collinear, one possibility is to remove the collinear predictor from the model that

model selection:

is the least related to the response variable. The task of selecting an appropriate set of predictor

Choose the minimal set of

variables from a larger initial set of predictor variables submitted to a regression analysis is model

predictors that

selection.

achieves satisfactory fit

Several different procedures exist for model selection. Perhaps the most useful procedure

relative to all

is all possible subsets, an analysis of the fit of all possible models from a given set of predictor

possible models.

2 variables. The procedure computes a fit index such as R 2 R for every possible combination of adj , Section 9.3.2 , adj p. 212

predictor variables in the model. The procedure that provides this analysis is the function leaps written by Lumley and Miller (2009) from the package of the same name. The leaps package

all possible

is automatically downloaded and installed with the lessR package. The Regression function

subsets: Model selection

automatically accesses the leaps function and then displays the modified output by default, as

procedure that lists

shown in Listing 10.4 .

the fit of all possible models from the submitted model.

Verbal Absent Income

R2adj X’s 1 1 0 0.431

Listing 10.4 All possible regression models evaluated in terms of R 2 adj .

Each row of the output in Listing 10.4 is for the analysis of a different model. A column with

a 1 indicates that the considered model contains the corresponding variable and a 0 indicates

Regression II 231

the variable is deleted from the model. The R2adj column lists the model’s R 2 adj , and the X’s column lists the number of predictor variables in the corresponding model.

The goal of model selection is parsimony, to select the model that balances the two competing parsimony: criteria of the best possible fit against the smallest number of predictor variables. If a complex Model selection optimizes both fit model with many predictor variables achieves slightly better fit in the training data than a and simplicity. simpler model, the simpler model is preferred. Models should become more complex only when training sample, meaningful gains in fit are obtained and this improved fit extends to new situations with new Section 9.4 , p. 212 data. It can be too easy to overfit a complex model to the training data, in which the improved model overfit: fit only applies to the sample data from which the model is estimated, with little or no increased The model contains too many predictive power for new data, the test sample.

predictor variables,

Each additional predictor variable in practice increments at least slightly the value of and so fits random R 2

error.

from the previous model without that predictor. R 2 adj typically, but not necessarily, increases for an additional predictor, but add a weak, new predictor variable to the model and R 2 adj may actually drop in value because it accounts for the loss of an additional degree of freedom from the model. Each additional predictor variable removes one more degrees of freedom.

This drop occurs in this example in which a two-predictor model of Verbal and Absent

has an R 2 adj = 0 . 431. The fit index drops slightly to 0.430 for a model with all three predictors.

This evidence indicates adding Income to the two-predictor model not only does not increase fit, there is a slight decrease. Dropping Income from the model is also consistent with the non-significant result for the estimate of its slope coefficient.

After Income is dropped from the model, should the model contain Verbal aptitude and Days Absent ( R 2 adj = 0 . 431) or just days Absent ( R 2 adj = 0 . 411)? In terms of the balance between fit and number of predictors, perhaps the model with just Absent as a predictor variable provides the best trade-off. The fit only drops from R 2 adj = 0 . 431 to 0.411, which is not much of a drop for deleting a single predictor variable.

A formal test of the contribution of one or more predictor variables to a model is an hypothesis test of a nested model, with deleted variables, compared to the full model with all of the predictor variables. The test of the nested model against the full model examines the nested model: contribution of the variables deleted to obtain the nested model. If the analysis detects no Model with predictor variables difference between the models, then the deleted variables did not significantly contribute to the that are a subset of

model. the more complete

model.

Scenario Test the effectiveness of specified predictor variables Create a nested model with the specified predictor variables deleted from the full model. Then compare the two models.

The lessR function Nest accomplishes the test of the difference between the models. To Nest function: use the function, list the response variable, the subset of the predictor variables for the nested Compare a nested model with the model, and then the larger set of predictor variables in the full model. The Nest function directly corresponding full compares the residual sum of squares of the two models. The process is similar to the concept model.

underlying R 2 that compares the residual sum of squares of the full model to the null model.

Here the result is more general in that the nested model is not as severe as the null model, but still not as complete as the full model.

null model,

The function Nest analyzes the full model first, and then analyzes the data for the nested Section 9.3.2 , p. 211

model with the same data analyzed by the full model. The full model contains more variables

232 Regression II

and some of these additional variables not in the nested model may have missing data. If the full data table was submitted to two separate analyses, any cases, rows of data, that only had missing values on the variables not present in the nested model would be deleted from the full model analysis, but not from the nested model analysis. The comparison between models is only valid when the same data are analyzed for both models.

Another strategy is to use the R function na.omit to purge all cases from the data frame

na.omit function:

with at least one missing data value.

Remove all rows of data with any missing data

mydata <- na.omit(mydata)

values.

This strategy is too broad, however, because data values may be missing for variables in the data table not present in either of the models. The result is the deletion of too much data, leaving an analysis based on a smaller sample size than necessary.

Invoke the Nest function to compare a nested model to the full model.

lessR Input Test a nested regression model against the full model > Nest(Reading, c(Absent), c(Verbal,Absent,Income))

The result appears in Listing 10.5 .

Model 1: Reading ˜ Absent Model 2: Reading ˜ Verbal + Absent + Income

Res.Df

RSS df SumSq

F p-value

Listing 10.5 The comparison of a nested model to the more general full model.

From this analysis dropping both Verbal and Income yields a model that is not statistically significantly different from just a one-predictor model.

Effect of both Absent and Income: p -value = 0 . 074 >α= 0 . 05 ,

so do not reject H 0 :β Verbal =β Income = 0

05. This situation is “backwards” from the typical test of a null hypothesis that leads to detection of a difference by rejecting the null hypothesis. In this situation not rejecting the null hypothesis leads to a conclusion to delete variables. So a lack of power, the lack of the sensitivity of the test to find a difference that truly exists, would lead to the conclusion there is a difference. Given the relatively small sample size of 100, a larger sample may, but not necessarily, lead to a significant difference.

The p-value, however, just misses the boundary of α= 0 .

cross-validation:

Our conclusion is that solely in terms of the prediction of Reading ability, the days Absent in

Re-estimate a

this sample of 100 data points is probably sufficient compared to a model with one or both of the

model on new data to verify that the

remaining predictor variables. This conclusion, however, is tentative. In terms of understanding

original estimates generalize to a new

the relationships, and potentially the causes, of Reading ability or lack thereof, a model with

sample.

both days Absent and Verbal ability is probably the best model for this situation.

Regression II 233

Note, however, that whenever a model is re-specified and then re-analyzed with the same data from which it was originally estimated, the training data, the new model should be cross- validated on a new sample of data, the test data. The smaller the sample size of the training data, the more the need to apply the model to a new data sample, the test data. Models re-specified on small data sets are at risk of overfitting the model to the data, modeling random sampling error instead of components that apply to the population as a whole.

10.2.4 Three-dimensional Scatter Plot

A two-predictor model presents another option available from John Fox’s car package (Fox & Weisberg, 2011), an option that can be referenced directly from the Regression function: An interactive three-dimensional scatter plot and associated regression plane. The vertical axis is

for the response variable and the two horizontal axes are for the two predictor variables. The regression

regression plane is the three-dimensional analogue of the two-dimensional line. The line is the plane: Three dimensional linear linear graph in two dimensions and the plane is the linear graph in three dimensions.

surface.

To access the car function for the scatter plot and regression plane from Regression set the scatter.3D option to TRUE .

scatter.3D: For a two-predictor model set to TRUE to specify a

lessR Input 3D scatter plot about the regression plane

three-dimensional scatter plot and

> Regression(Reading ∼ Verbal + Absent, scatter.3D=TRUE)

regression plane.

The resulting graph appears in Figure 10.2 . When viewed on the computer screen, rotate the entire graph as desired by dragging over an open area with the mouse. The graphics window can be re-sized by dragging one of its corners or edges.

Reading 100

V erbal 100

0 Absent 10

Figure 10.2 Three-dimensional scatter plot provided by the car package. The rather considerable variation of the data points about the regression plane is apparent

from Figure 10.2 . Also apparent is the downward slope of Reading ability as the number of days Absent increases. The Verbal aptitude effect leads to a less pronounced increase in Reading

234 Regression II

ability, though the angle of the presentation of this graph tends to obscure this effect. Obtain different perspectives by rotating the graph.