Case Study

7.3.3 Case Study

We have already used the foetal weight prediction task in order to illustrate specific topics on regression. We will now consider this task in a more detailed fashion so that the reader can appreciate the application of the several topics that were previously described in a complete worked-out case study.

7.3.3.1 Determining a Linear Model

We start with the solution obtained by forward stepwise search, summarised in Figure 7.11. Table 7.6 shows the coefficients of the model. The values of beta indicate that their contributions are different. All t tests are significant; therefore, no coefficient is discarded at this phase. The ANOVA test, shown in Table 7.7 gives also a good prognostic of the goodness of fit of the model.

Table 7.6. Parameters and t tests of the trivariate linear model for the foetal weight example.

B Std. Err. of B t 410 p Intercept

Beta Std. Err. of Beta

Table 7.7. ANOVA test of the trivariate linear model for the foetal weight example.

F p Regress. 128252147 3 42750716 501.9254 0.00 Residual 34921110 410 85173

Sum of Squares

df Mean Squares

Total 163173257

7.3 Building and Evaluating the Regression Model 309

Expected Normal Value

b -600 -400 -200 0 200 400 600 800 1000 1200 Figure 7.14. Distribution of the residuals for the foetal weight example: a) Normal

a -800 -600 -400 -200 0 200 400 600 800 1000 1200

probability plot; b) Histogram.

7.3.3.2 Evaluating the Linear Model Distribution of the Residuals

In order to assess whether the errors can be assumed normally distributed, one can use graphical inspection, as in Figure 7.14, and also perform the distribution fitting tests described in chapter 5. In the present case, the assumption of normal distribution for the errors seems a reasonable one.

The constancy of the residual variance can be assessed using the following modified Levene test:

1. Divide the data set into two groups: one with the predictor values comparatively low and the other with the predictor values comparatively high. The objective is to compare the residual variance in the two groups. In the present case, we divide the cases into the two groups corresponding to

observed weights below and above 3000 g. The sample sizes are n 1 = 118 and n 2 = 296, respectively.

2. Compute the medians of the residuals e i in the two groups: med 1 and med 2 . In the present case med 1 = −182.32 and med 2 = 59.87.

3. Let d i 1 = e i 1 − med 1 and d i 2 = e i 2 − med 2 represent the absolute deviations of the residuals around the medians in each group. We now compute the respective sample means, d 1 and d 2 , of these absolute deviations, which in our study case are: d 1 = 187 . 37 , d 2 = 221 . 42 .

4. Compute:

7 Data Regression

with s =

In the present case the computed t value is t * = –1.83 and the 0.975 percentile of t 412 is 1.97. Since |t * |<t 412,0.975 , we accept that the residual variance is constant.

Test of Fit

We now proceed to evaluate the goodness of fit of the model, using the method described in 7.1.4, based on the computation of the pure error sum of squares. Using SPSS, STATISTICA, MATLAB or R, we determine:

n = 414; c = 381; n – c = 33; c – 2 = 379 . SSPE = 1846345.8; MSPE=SSPE/( n – c) = 55949.9 . SSE = 34921109 .

Based on these values, we now compute:

SSLF = SSE − SSPE = 33074763.2; MSLF = SSLF/(c – 2) = 87268.5 .

Thus, the computed F * is: F * = MSLF/MSPE = 1.56. On the other hand, the 95% percentile of F 379, 33 is 1.6. Since F * <F 379, 33 , we do not reject the goodness of fit hypothesis.

Detecting Outliers

The detection of outliers was already performed in 7.3.2.1. Eighteen cases are identified as being outliers. The evaluation of the model without including these outlier cases is usually performed at a later phase. We leave as an exercise the preceding evaluation steps after removing the outliers.

Assessing Multicollinearity

Multicollinearity can be assessed either using the extra sums of squares as described in 7.2.5.2 or using the VIF factors described in 7.3.2.2. This last method is particularly fast and easy to apply.

Using SPSS, STATISTICA, MATLAB or R, one can easily obtain the coefficients of determination for each predictor variable regressed on the other ones. Table 7.8 shows the values obtained for our case study.

Table 7.8. VIF factors obtained for the foetal weight data. BPD(CP,AP) CP(BPD,AP) AP(BPD,CP)

r 2 0.6818 0.7275 0.4998 VIF 3.14 3.67 2

7.3 Building and Evaluating the Regression Model 311

Although no single VIF is larger than 10, the mean VIF is 2.9, larger than 1 and, therefore, indicative that some degree of multicollinearity may be present.

Cross-Validating the Linear Model

Until now we have assessed the regression performance using the same set that was used for the design. Assessing the performance in the design (training) set yields on average optimistic results, as we have already seen in Chapter 6, when discussing data classification. We need to evaluate the ability of our model to generalise when applied to an independent test set. For that purpose we apply a cross-validation method along the same lines as in section 6.6.

Let us illustrate this procedure by applying a two-fold cross-validation to our FW(AP,BPD,CP) model. For that purpose we randomly select approximately half of the cases for training and the other half for test, and then switch the roles. This can be implemented in SPSS, STATISTICA, MATLAB and R by setting up a filter

variable with random 0s and 1s. Denoting the two sets by D 0 and D 1 we obtained the results in Table 7.9 in one experiment. Based on the F tests and on the proximity of the RMS values we conclude the good generalisation of the model.

Table 7.9. Two-fold cross-validation results. The test set results are in italic. Design with D 0 (204 cases)

Design with D 1 (210 cases)

D 0 F (p) 272.6

7.3.3.3 Determining a Polynomial Model

We now proceed to determine a third order polynomial model for the foetal weight regressed by the same predictors but without interaction terms. As previously mentioned in 7.2.6, in order to avoid numerical problems, we use centred predictors by subtracting the respective mean. We then use the following predictor variables:

X 2 1 3 BPD − mean ( BPD); 11 = X 1 ; X 111 = X 1 .

X 2 2 = CP − mean ( CP); X 22 = X 2 ; X 222 = X 2 3 .

X 3 = AP − mean ( AP); X 33 = X 2 ; X = X 3 3 333 3 .

With SPSS and STATISTICA, in order to perform the forward stepwise search, the predictor variables must first be created before applying the respective regression commands. Table 7.9 shows some results obtained with the forward stepwise search. Note that although six predictors were included in the model using

7 Data Regression

the threshold of 1 for the “F to enter”, the three last predictors do not have significant F tests and the predictors X 222 and X 11 also do not pass in the respective t tests (at 5% significance level). Let us now apply the backward search process. Figure 7.15 shows the summary table of this search process, obtained with STATISTICA, using a threshold of “F to remove” = 10 (one more than the number of initial predictors). The variables are removed consecutively by increasing order of their F contribution until reaching

the end of the process with two included variables, X 1 and X 3 . Notice, however, that variable X 2 is found significant in the F test, and therefore, it should probably

be included too.

Table 7.10. Parameters of a third order polynomial regression model found with a forward stepwise search for the foetal weight data (using SPSS or STATISTICA).

t 410 p Intercept

Beta Std. Err. of Beta

F to Enter

Figure 7.15. Parameters and tests obtained with STATISTICA for the third order polynomial regression model (foetal weight example) using the backward stepwise search procedure.

7.3.3.4 Evaluating the Polynomial Model

We now evaluate the polynomial model found by forward search and including the six predictors X 1 ,X 2 ,X 3 ,X 11 ,X 22 ,X 222 . This is done for illustration purposes only

7.3 Building and Evaluating the Regression Model 313

since we saw in the previous section that the backward search procedure found a simpler linear model. Whenever a simpler (using less predictors) and similarly performing model is found, it should be preferred for the same generalisation reasons that were explained in the previous chapter.

The distribution of the residuals is similar to what is displayed in Figure 7.14. Since the backward search cast some doubts as to whether some of these predictors have a valid contribution, we will now use the methods based on the extra sums of squares. This is done in order to evaluate whether each regression coefficient can

be assumed to be zero, and to assess the multicollinearity of the model. As a final result of this evaluation, we will conclude that the polynomial model does not bring about any significant improvement compared to the previous linear model with three predictors.

Table 7.11. Results of the test using extra sums of squares for assessing the contribution of each predictor in the polynomial model (foetal weight example).

Variables in the

X 2 ,X 3 ,X 11 , X 1 ,X 3 ,X 11 , X 1 ,X 2 ,X 11 , X 1 ,X 2 ,X 3 , X 1 ,X 2 ,X 3 , X 1 ,X 2 ,X 3 ,

Reduced Model

X 22 ,X 222 X 22 ,X 222 X 22 ,X 222 X 22 ,X 222 X 11 ,X 222 X 11 ,X 22

SSE(R) (/10 3 )

SSR = SSE(R) − SSE(F) (/10 3 )

F * = SSR/MSE

42.15 20.82 20.81 1.76 4.26 2.84 Reject H 0 Yes Yes Yes No Yes No

Testing whether individual regression coefficients are zero

We use the partial F test described in section 7.2.5.1 as expressed by formula 7.44. As a preliminary step, we determine with SPSS, STATISTICA, MATLAB or R the SSE and MSE of the model:

SSE = 34402739; MSE = 84528.

We now use the 95% percentile of F 1,407 = 3.86 to perform the individual tests as

summarised in Table 7.11. According to these tests, variables X 11 and X 222 should

be removed from the model.

Assessing multicollinearity

We use the test described in section 7.2.5.2 using the same SSE and MSE as before. Table 7.12 summarises the individual computations. According to Table

7 Data Regression

7.11, the larger differences between SSE(X ) and SSE( X | R) occur for variables X 11 ,

X 22 and X 222 . These variables have a strong influence in the multicollinearity of the model and should, therefore, be removed. In other words, we come up with the first model of Example 7.17.

Table 7.12. Sums of squares for each predictor in the polynomial model (foetal weight example) using the full and reduced models.

Variable

X 1 X 2 X 3 X 11 X 22 X 222 SSE( X ) (/10 3 )

76001 73062 46206 131565 130642 124828 X 2 ,X 3 ,X 11 , X 1 ,X 3 ,X 11 , X 1 ,X 2 ,X 11 , X 1 ,X 2 ,X , X

Reduced Model 3 1 ,X 2 ,X 3 , X 1 ,X 2 ,X 3 ,

X 22 ,X 222 X 22 ,X 222 X 22 ,X 222 X 22 ,X 222 X 11 ,X 222 X 11 ,X 22 SSE(R) (/10 3 )

37966 36163 36162 34552 34763 34643 SSE( X | R) = SSE(R)

− SSE (/10 3 ) 3563 1760 1759 149 360 240 Larger Differences