Building the Model

7.3.1 Building the Model

When there are several variables that can be used as candidates for predictor variables in a regression model, it would be fastidious having to try every possible combination of variables. In such situations, one needs a search procedure operating in the variable space in order to build up the regression model much in

7 Data Regression

the same way as we performed feature selection in Chapter 6. The search procedure has also to use an appropriate criterion for predictor selection. There are many such criteria published in the literature. We indicate here just a few:

– SSE (minimisation) – R square (maximisation) – t statistic (maximisation) – F statistic (maximisation)

When building the model, these criteria can be used in a stepwise manner the same way as we performed sequential feature selection in Chapter 6. That is, by either adding consecutive variables to the model − the so-called forward search

method −, or by removing variables from an initial set − the so-called backward search method.

For instance, a very popular method is to use forward stepwise building up the model using the F statistic, as follows:

1. Initially enters the variable, say X 1 , that has maximum F k = MSR(X k )/MSE(X k ), which must be above a certain specified level.

2. Next is added the variable with maximum F k = MSR(X k |X 1 ) / MSE(X k ,X 1 ) and above a certain specified level.

3. The Step 2 procedure goes on until no variable has a partial F above the specified level.

Example 7.17

Q: Apply the forward stepwise procedure to the foetal weight data (see Example 7.13), using as initial predictor sets {BPD, CP, AP} and {MW, MH, BPD, CP, AP, FL}.

A: Figure 7.11 shows the evolution of the model using the forward stepwise method to {BPD, CP, AP}. The first variable to be included, with higher F, is the variable AP. The next variables that are included have a decreasing F contribution but still higher than the specified level of “F to Enter”, equal to 1. These results confirm the findings on partial correlation coefficients discussed in section 7.2.5 (Table 7.4).

Figure 7.11. Forward stepwise regression (obtained with STATISTICA) for the foetal weight example, using {BPD, CP, AP} as initial predictor set.

7.3 Building and Evaluating the Regression Model 305

Let us now assume that the initial set of predictors is {MW, MH, BPD, CP, AP, FL}. Figure 7.12 shows the evolution of the model at each step. Notice that one of the variables, MH, was not included in the model, and the last one, CP, has a non- significant F test (p > 0.05), and therefore, should also be excluded.

Figure 7.12. Forward stepwise regression (obtained with STATISTICA) for the foetal weight example, using {MW, MH, BPD, CP, AP, FL} as initial predictor set.

Commands 7.4. SPSS, STATISTICA, MATLAB and R commands used to perform stepwise linear regression.

SPSS Analyze; Regression; Linear; Method

Forward STATISTICA

Statistics; Multiple Regression; Advanced; Forward Stepwise

MATLAB stepwise(X,y) step(object, direction = c(“both”,

“backward”, “forward”), trace)

With SPSS and STATISTICA the user can specify the level of F in order to enter or remove variables.

The MATLAB stepwise function fits a regression model of y depending on

X, displaying figure windows for interactively controlling the stepwise addition and removal of model terms.

The R step function allows the stepwise selection of a model, represented by the parameter object and generated by R lm or glm functions. The selection is based on a more sophisticated criterion than the ANOVA F. The parameter direction specifies the direction (forward, backward or a combination of both) of the stepwise search. The parameter trace when left with its default value will force step to generate information during its running.

7 Data Regression