Multiple Regression

7.2 Multiple Regression

7.2.1 General Linear Regression Model

Assuming the existence of p − 1 predictor variables, the general linear regression model is the direct generalisation of 7.1:

Y i = β 0 + β 1 x i 1 + β 2 x i 2 + K + β p − 1 x i , p − 1 + ε i = ∑ β k x ik + ε i , 7.34

with x i 0 = 1 . In the following we always consider normal regression models with

i.i.d. errors ε i ~N 0, σ . Note that:

– The general linear regression model implies that the observations are independent normal variables.

– When the x i represent values of different predictor variables the model is called a first-order model, in which there are no interaction effects between the predictor variables.

– The general linear regression model encompasses also qualitative predictors. For example:

Y i = β 0 + β 1 x i 1 + β 2 x i 2 + ε i . 7.35

x i 1 = patient ’ s weight  1 if patient female

x i 2 =   0 if patient male

Patient is male:

Patient is female: Y i = ( β 0 + β 2 ) + β 1 x i 1 + ε i .

Multiple linear regression can be performed with SPSS, STATISTICA, MATLAB and R with the same commands and functions listed in Commands 7.1.

7.2.2 General Linear Regression in Matrix Terms

In order to understand the computations performed to fit the general linear regression model to the data, it is convenient to study the normal equations 7.3 in matrix form.

We start by expressing the general linear model (generalisation of 7.1) in matrix terms as:

y =X β + ε, 7.36

7 Data Regression

where:

– y is an n × 1 matrix (i.e., a column vector) of the predictions; – X is an n × p matrix of the p − 1 predictor values plus a bias (of value 1) for

the n predictor levels; – β is a p × 1 matrix of the coefficients; – ε is an n × 1 matrix of the errors.

For instance, the multiple regression expressed by formula 7.35 is represented as follows in matrix form, assuming n = 3 predictor levels:

 y 1   1 x 11 x 12   β 0   ε 1 

x    +    21 22   β 1   ε 2  .   y 3     1 x 31 x 32     β 2     ε 3  

We assume, as for the simple regression, that the errors are i.i.d. with zero mean and equal variance:

V = [] 2 ε [] ε σ I .

Thus: E [] y = X β .

The least square estimation of the coefficients starts by computing the total error:

i = ε ’ ε = ( y − Xb ) ’ ( y − Xb ) = y ’ y − (X ’’ y) b − b ’ X y + b ’ X ’ Xb . 7.37

Next, the error is minimised by setting to zero the derivatives in order to the coefficients, obtaining the normal equations in matrix terms:

b = (X ’ X) − 1 X ’ y = X * y , 7.38

where X * is the so-called pseudo-inverse matrix of X. The fitted values can now be computed as:

y ˆ = Xb .

Note that this formula, using the predictors and the estimated coefficients, can also be expressed in terms of the predictors and the observations, substituting the vector of the coefficients given in 7.38.

Let us consider the normal equations:

= ’ X) − b 1 (X X ’ Y .

7.2 Multiple Regression

For the standardised model (i.e., using standardised variables) we have:

 1 r 12 K r 1 , p − 1 

21 2 , p X 1 ’ X = r − xx =  ; 7.39  K

yx =  . 7.40  M  

   r y , p −1  

Hence:

 M  xx yx , 7.41 

where b is the vector containing the point estimates of the beta coefficients (compare with formula 7.7 in section 7.1.2), r xx is the symmetric matrix of the predictor correlations (see A.8.2) and r yx is the vector of the correlations between Y and each of the predictor variables.

Example 7.10

Q: Consider the following six cases of the Foetal Weight dataset:

Variable Case #1

Determine the beta coefficients of the linear regression of FW (foetal weight in grams) depending on CP (cephalic perimeter in mm) and AP (abdominal perimeter in mm) and performing the computations expressed by formula 7.41.

A: We can use MATLAB function corrcoef or appropriate SPSS, STATISTICA and R commands to compute the correlation coefficients. Using MATLAB and denoting by fw the matrix containing the above data with cases along the rows and variables along the columns, we obtain:

» c=corrcoef(fw(:,:)) »c=

7 Data Regression

We now apply formula 7.41 as follows:

» rxx = c(1:2,1:2); ryx = c(1:2,3); » b = inv(rxx)*ryx b=

These are also the values obtained with SPSS, STATISTICA and R. It is interesting to note that the beta coefficients for the 414 cases of the Foetal Weight dataset are 0.3 and 0.64 respectively.

Example 7.11

Q: Determine the multiple linear regression coefficients of the previous example.

A: Since the beta coefficients are the regression coefficients of the standardised model, we have:

s FW b 1 = 0 . 3847 = 181.44. s CP

2 0 . 5151 = 135.99. s AP

FW

These computations can be easily carried out in MATLAB or R. For instance, in

MATLAB b 2 is computed as b2=0.5151*std(fw(:,3))/std(fw(:,2)). The same values can of course be obtained with the commands listed in Commands 7.1

7.2.3 Multiple Correlation

Let us go back to the R-square statistic described in section 7.1, which represented the square of the correlation between the independent and the dependent variables. It also happens that it represents the square of the correlation between the dependent and predicted variable, i.e., the square of:

7.2 Multiple Regression

∑ ( y i − y )( ˆ y i − y ˆ r )

In multiple regression this quantity represents the correlation between the dependent variable and the predicted variable explained by all the predictors; it is therefore appropriately called multiple correlation coefficient. For p −1 predictors

we will denote this quantity as r Y | X 1 , K , X

Figure 7.6. The regression linear model (plane) describing FW as a function of (CP,AP) using the dataset of Example 7.10. The observations are the solid balls. The predicted values are the open balls (lying on the plane). The multiple correlation corresponds to the correlation between the observed and predicted values.

Example 7.12

Q: Compute the multiple correlation coefficient for the Example 7.10 regression, using formula 7.42.

A: In MATLAB the computations can be carried out with the matrix fw of Example 7.10 as follows:

» fw = [fw(:,1) ones(1,6) fw(:,2:3)]; » [b,bint,r,rint,stats] = regress(fw(:,1),fw(:,2:4)); » y = fw(:,1); ystar = y-r; » corrcoef(y,ytar) ans =

7 Data Regression

The first line includes the independent terms in the fw matrix in order to compute a linear regression model with intercept. The third line computes the predicted values in the ystar vector. The square of the multiple correlation coefficient, r FW|CP,AP = 0.893 computed in the fourth line coincides with the value of R-square computed in the second line ( r) as it should be. Figure 7.6 illustrates this multiple correlation situation.

7.2.4 Inferences on Regression Parameters

Inferences on parameters in the general linear model are carried out similarly to the inferences in section 7.1.3. Here, we review the main results:

– Interval estimation of β k : b k ± t n − p , 1 − α / 2 s b k . – Confidence interval for E[Y k ]: y ˆ k ± t n − p , 1 − α / 2 s y ˆ k .

– Confidence region for the regression hyperplane: yˆ k ± W s y ˆ k , with W 2 = pF p , n −1 p , − α .

Example 7.13

Q: Consider the Foetal Weight dataset, containing foetal echographic measurements, such as the biparietal diameter (BPD), the cephalic perimeter (CP), the abdominal perimeter (AP), etc., and the respective weight-just-after-delivery, FW. Determine the linear regression model needed to predict the newborn weight, FW, using the three variables BPD, CP and AP. Discuss the results.

A: Having filled in the three variables BPD, CP and AP as predictor or independent variables and the variable FW as the dependent variable, one can obtain with STATISTICA the result summary table shown in Figure 7.7.

The standardised beta coefficients have the same meaning as in 7.1.2. Since these reflect the contribution of standardised variables, they are useful for comparing the relative contribution of each variable. In this case, variable AP has the highest contribution and variable CP the lowest. Notice the high coefficient of

multiple determination, R 2 and that in the last column of the table, all t tests are found significant. Similar results are obtained with the commands listed in Commands 7.1 for SPSS, MATLAB and R.

Figure 7.8 shows line plots of the true (observed) values and predicted values of the foetal weight using the multiple linear regression model. The horizontal axis of these line plots is the case number. The true foetal weights were previously sorted by increasing order. Figure 7.9 shows the scatter plot of the observed and predicted values obtained with the Multiple Regression command of STATISTICA.

7.2 Multiple Regression 295

Figure 7.7. Estimation results obtained with STATISTICA of the trivariate linear regression of the foetal weight data.

Figure 7.8. Plot obtained with STATISTICA of the predicted (dotted line) and observed (solid line) foetal weights with a trivariate (BPD, CP, AP) linear regression model.

Figure 7.9. Plot obtained with STATISTICA of the observed versus predicted foetal weight values with fitted line and 95% confidence interval.

7 Data Regression

7.2.5 ANOVA and Extra Sums of Squares

The simple ANOVA test presented in 7.1.4, corresponding to the decomposition of the total sum of squares as expressed by formula 7.24, can be generalised in a straightforward way to the multiple regression model.

Example 7.14

Q: Apply the simple ANOVA test to the foetal weight regression in Example 7.13.

A: Table 7.2 lists the results of the simple ANOVA test, obtainable with SPSS STATISTICA, or R, for the foetal weight data, showing that the regression model is statistically significant ( p ≈ 0).

Table 7.2. ANOVA test for Example 7.13. Sum of

Mean

Fp SSR 128252147 3 42750716 501.9254 0.00

Squares

df Squares

SSE 34921110 410 85173 SST 163173257

It is also possible to apply the ANOVA test for lack of fit in the same way as was done in 7.1.4. However, when there are several predictor values playing their influence in the regression model, it is useful to assess their contribution by means of the so-called extra sums of squares. An extra sum of squares measures the marginal reduction in the error sum of squares when one or several predictor variables are added to the model.

We now illustrate this concept using the foetal weight data. Table 7.3 shows the regression lines, SSE and SSR for models with one, two or three predictors. Notice how the model with (BPD,CP) has a decreased error sum of squares, SSE, when compared with either the model with BPD or CP alone, and has an increased regression sum of squares. The same happens to the other models. As one adds more predictors one expects the linear fit to improve. As a consequence, SSE and SSR are monotonic decreasing and increasing functions, respectively, with the number of variables in the model. Moreover, what SSE decreases is reflected by an equal increase of SSR.

We now define the following extra sum of squares, SSR(X 2 |X 1 ), which measures the improvement obtained by adding a second variable X 2 to a model that has already X 1 :

SSR(X 2 |X 1 ) = SSE(X 1 ) − SSE(X 1 ,X 2 ) = SSR(X 1 ,X 2 ) − SSR(X 1 ).

7.2 Multiple Regression 297

Table 7.3. Computed models with SSE, SSR and respective degrees of freedom for

the foetal weight data (sums of squares divided by 10 6 ).

Abstract Model

df Y = g(X 1 )

Computed model

SSE

df SSR

76.0 412 87.1 1 Y = g(X 2 )

FW(BPD) = −4229.1 + 813.3 BPD

73.1 412 90.1 1 Y = g(X 3 )

FW(CP) = −5096.2 + 253.8 CP

46.2 412 117.1 1 Y = g(X FW(BPD,CP) =

FW(AP) = −2518.5 + 173.6 AP

−5464.7 + 412.0 BPD +

65.8 411 97.4 2 Y = g(X FW(BPD,AP) =

FW(CP,AP) = −4476.2 + 102.9 CP +

Y = g(X 2, X 3 )

38.5 411 124.7 2 Y = g(X 1, X 2, X 3 ) FW(BPD,CP,AP) = −4765.7 + 292.3 BPD + 36.0 CP + 127.7 AP

130.7 AP

X 1 ≡ BPD; X 2 ≡ CP; X 3 ≡ AP

For the data of Table 7.3 we have: SSR(CP | BPD) = SSE(BPD) − SSE(BPD, CP) = 76 − 65.8 = 10.2, which is practically the same as SSR(BPD, CP) − SSR(BPD) = 97.4 – 87.1 = 10.3 (difference only due to numerical roundings).

Similarly, one can define:

SSR(X 3 |X 1 ,X 2 ) = SSE(X 1 ,X 2 ) − SSE(X 1 ,X 2 , X 3 ) = SSR(X 1 ,X 2 , X 3 ) − SSR(X 1 ,X 2 ).

SSR(X 2 ,X 3 |X 1 ) = SSE(X 1 ) − SSE(X 1 ,X 2 , X 3 ) = SSR(X 1 ,X 2 , X 3 ) − SSR(X 1 ).

The first extra sum of squares, SSR(X 3 | X 1 , X 2 ), represents the improvement obtained when adding a third variable, X 3 , to a model that has already two variables, X 1 and X 2 . The second extra sum of squares, SSR(X 2 ,X 3 |X 1 ), represents the improvement obtained when adding two variables, X 2 and X 3 , to a model that has only one variable, X 1 . The extra sums of squares are especially useful for performing tests on the regression coefficients and for detecting multicollinearity situations, as explained in the following sections.

With the extra sums of squares it is also possible to easily compute the so-called partial correlations, measuring the degree of linear relationship between two variables after including other variables in a regression model. Let us illustrate this topic with the foetal weight data. Imagine that the only predictors were BPD, CP and AP as in Table 7.3, and that we wanted to build a regression model of FW by successively entering in the model the predictor which is most correlated with the predicted variable. In the beginning there are no variables in the model and we choose the predictor with higher correlation with the independent variable FW. Looking at Table 7.4 we see that, based on this rule, AP enters the model. Now we

7 Data Regression

must ask which of the remaining variables, BPD or CP, has a higher correlation with the predicted variable of the model that has already AP. The answer to this question amounts to computing the partial correlation of a candidate variable, say

X 2 , with the predicted variable of a model that has already X 1 , r Y , X 2 | X 1 . The respective formula is:

For the foetal weight dataset the computations with the values in Table 7.3 are as follows:

resulting in the partial correlation values listed in Table 7.4. We therefore select BPD as the next predictor to enter the model. This process could go on had we more predictors. For instance, the partial correlation of the remaining variable CP with the predicted variable of a model that has already AP and BPD is computed as:

Further details on the meaning and statistical significance testing of partial correlations can be found in (Kleinbaum DG et al., 1988).

Table 7.4. Correlations and partial correlations for the foetal weight dataset. Variables in

Variables to

the model enter the model Correlation Sample Value (None) BPD

r FW,BPD 0.731 (None) CP

r FW,CP 0.743 (None) AP

r FW,AP 0.847 AP BPD r FW,BPD|AP 0.552 AP CP r FW,CP|AP 0.408 AP, BPD

CP

r FW,CP|BPD,AP 0.119

7.2 Multiple Regression

7.2.5.1 Tests for Regression Coefficients

We will only present the test for a single coefficient, formalised as:

H 0 : β k = 0;

H 1 : β k ≠ 0.

The statistic appropriate for this test, is:

b k t = ~ t n − p . 7.43 s b k

We may also use, as in section 7.1.4, the ANOVA test approach. As an

illustration, let us consider a model with three variables, X 1 , X 2 , X 3 , and, furthermore, let us assume that we want to test whether H “ 0 : β 3 = 0 can be ”

accepted or rejected. For this purpose, we first compute the error sum of squares for the full model:

SSE(F) = SSE ( X 1 , X 2 , X 3 ) , with df F = n – 4.

The reduced model, corresponding to H 0 , has the following error sum of

squares:

SSE(R) = SSE ( X 1 , X 2 ) , with df R = n – 3.

The ANOVA test assessing whether or not any benefit is derived from adding

X 3 to the model, is then based on the computation of:

SSE(R) − SSE(F) SSE(F) SSR ( X 3 | X 1 , X 2 ) SSE ( X 1 , X 2 , X 3 )

In general, we have:

The F test using this sampling distribution is equivalent to the t test expressed by 7.43. This F test is known as partial F test.

7.2.5.2 Multicollinearity and its Effects

If the predictor variables are uncorrelated, the regression coefficients remain constant, irrespective of whether or not another predictor variable is added to the

7 Data Regression

model. Similarly, the same applies for the sum of squares. For instance, for a model with two uncorrelated predictor variables, the following should hold:

SSR(X 1 |X 2 ) = SSE(X 2 ) − SSE(X 1 ,X 2 ) = SSR(X 1 ); 7.45a SSR(X 2 |X 1 ) = SSE(X 1 ) − SSE(X 1 ,X 2 ) = SSR(X 2 ). 7.45b

On the other hand, if there is a perfect correlation between X 1 and X 2 − in other

words, X 1 and X 2 are collinear − we would be able to determine an infinite number

of regression solutions (planes) intersecting at the straight line relating X 1 and X 2 .

Multicollinearity leads to imprecise determination coefficients, imprecise fitted values and imprecise tests on the regression coefficients.

In practice, when predictor variables are correlated, the marginal contribution of any predictor variable in reducing the error sum of squares varies, depending on which variables are already in the regression model.

Example 7.15

Q: Consider the trivariate regression of the foetal weight in Example 7.13. Use formulas 7.45 to assess the collinearity of CP given BPD and of AP given BPD and BPD, CP.

A: Applying formulas 7.45 to the results displayed in Table 7.3, we obtain:

SSR(CP) = 90 × 10 6 .

SSR(CP | BPD) = SSE(BPD) – SSE(CP, BPD) = 76 × 10 6 – 66 × 10 6 = 10 × 10 6 .

We see that SSR(CP|BPD) is small compared with SSR(CP), which is a symptom that BPD and CP are highly correlated. Thus, when BPD is already in the model, the marginal contribution of CP in reducing the error sum of squares is small because BPD contains much of the same information as CP.

In the same way, we compute:

SSR(AP) = 46 × 10 6 .

SSR(AP | BPD) = SSE(BPD) – SSE(BPD, AP) = 41 × 10 6 . SSR(AP | BPD, CP) = SSE(BPD, CP) – SSE(BPD, CP, AP) = 31 × 10 6 .

We see that AP seems to bring a definite contribution to the regression model by reducing the error sum of squares.

7.2.6 Polynomial Regression and Other Models

Polynomial regression models may contain squared, cross-terms and higher order terms of the predictor variables. These models can be viewed as a generalisation of the multivariate linear model.

As an example, consider the following second order model:

i = β 0 + β 1 x i + β 2 x i + ε i . 7.46

7.2 Multiple Regression

The Y i can also be linearly modelled as:

Y i = β 0 + β 1 u i 1 + β 2 u i 2 + ε i with u i 1 = x i ; u i 2 = x 2 i .

As a matter of fact, many complex dependency models can be transformed into the general linear model after suitable transformation of the variables. The general linear model encompasses also the interaction effects, as in the following example:

Y i = β 0 + β 1 x i 1 + β 2 x i 2 + β 3 x i 1 x i 2 + ε i , 7.47

which can be transformed into the linear model, using the extra

variable x i 3 = x i 1 x i 2 for the cross-term x i 1i x 2 .

Frequently, when dealing with polynomial models, the predictor variables are previously centred, replacing x i by x i − x . The reason is that, for instance, X and

X 2 will often be highly correlated. Using centred variables reduces multi- collinearity and tends to avoid computational difficulties. Note that in all the previous examples, the model is linear in the parameters β k . When this condition is not satisfied, we are dealing with a non-linear model, as in the following example of the so-called exponential regression:

Y i = β 0 exp( β 1 x i ) + ε i . 7.48

Unlike linear models, it is not generally possible to find analytical expressions for the estimates of the coefficients of non-linear models, similar to the normal equations 7.3. These have to be found using standard numerical search procedures. The statistical analysis of these models is also a lot more complex. For instance, if we linearise the model 7.48 using a logarithmic transformation, the errors will no longer be normal and with equal variance.

Commands 7.3. SPSS, STATISTICA, MATLAB and R commands used to perform polynomial and non-linear regression.

SPSS Analyze; Regression; Curve Estimation

Analyze; Regression; Nonlinear Statistics; Advanced Linear/Nonlinear

Models; General Linear Models; Polynomial STATISTICA Regression Statistics; Advanced Linear/Nonlinear Models; Non-Linear Estimation

[p,S] = polyfit(X,y,n) MATLAB

[y,delta] = polyconf(p,X,S) [beta,r,J]= nlinfit(X,y,FUN,beta0)

R lm(formula) | glm(formula)

nls(formula, start, algorithm, trace)

7 Data Regression

The MATLAB polyfit function computes a polynomial fit of degree n using the predictor matrix

X and the observed data vector y. The function returns a vector p with the polynomial coefficients and a matrix S to be used with the polyconf function producing confidence intervals y ± delta at alpha confidence level (95% if alpha is omitted). The nlinfit returns the coefficients beta and residuals r of a nonlinear fit y = f(X, beta), whose formula is specified by a string FUN and whose initial coefficient estimates are beta0.

The R glm function operates much in the same way as the lm function, with the support of extra parameters. The parameter formula is used to express a polynomial dependency of the independent variable with respect to the predictors, such as y ~ x + I(x^2), where the function I inhibits the interpretation of “ ^ ” as a formula operator, so it is used as an arithmetical operator. The nls function for nonlinear regression is used with a start vector of initial estimates, an algorithm parameter specifying the algorithm to use and a trace logical value indicating whether a trace of the iteration progress should be printed. An example is: nls(y~1/(1+exp((a-log(x))/b)), start=list(a=0, b=1), alg=“plinear”, trace=TRUE) .

Example 7.16

Q: Consider the Stock Exchange dataset (see Appendix E). Design and evaluate a second order polynomial model, without interaction effects, for the SONAE share values depending on the predictors EURIBOR and USD.

A: Table 7.5 shows the estimated parameters of this second order model, along with the results of t tests. From these results, we conclude that all coefficients have an important contribution to the designed model. The simple ANOVA test gives also significant results. However, Figure 7.10 suggests that there is some trend of the residuals as a function of the observed values. This is a symptom that some lack of fit may be present. In order to investigate this issue we now perform the ANOVA test for lack of fit. We may use STATISTICA for this purpose, in the same way as in the example described in section 7.1.4.

Table 7.5. Results obtained with STATISTICA for a second order model, with predictors EURIBOR and USD, in the regression of SONAE share values ( Stock Exchange dataset).

Effect SONAE SONAE

+95% Param.

tp −95%

Cnf.Lmt Intercept

Std.Err

Cnf.Lmt

−236008 EURIBOR 13938 1056 13.2 0.00 11860 16015 EURIBOR 2 −1767

−1491 USD 560661 49041 11.4 0.00 464164 657159 USD 2 −294445

7.3 Building and Evaluating the Regression Model 303

2000 Raw Residuals

-2000 Observed Values 5000

Figure 7.10. Residuals versus observed values in the Stock Exchange example.

First, note that there are p − 1 = 4 predictor variables in the model; therefore,

p = 5. Secondly, in order to have enough replicates for STATISTICA to be able to compute the pure error, we use two new variables derived from EURIBOR and USD by rounding them to two and three significant digits, respectively. We then

obtain (removing a 10 3 factor):

SSE = 345062; df = n − p = 308; MSE = 1120. SSPE = 87970; df = n − c = 208; MSPE = 423.

From these results we compute:

SSLF = SSE – SSPE = 257092; df = c – p = 100; MSLF = 2571.

F * = MSLF/MSPE = 6.1 .

The 95% percentile of F 100,208 is 1.3. Since F * > 1.3, we then conclude for the

lack of fit of the model. ฀