Evaluating the Model

7.3.2 Evaluating the Model

7.3.2.1 Identifying Outliers

Outliers correspond to cases exhibiting a strong deviation from the fitted regression curve, which can have a harmful influence in the process of fitting the model to the data. Identification of outliers, for their eventual removal from the dataset, is usually carried out using the so-called semistudentised residuals (or standard residuals), defined as:

7.49 MSE

MSE

Cases whose magnitude of the semistudentised residuals exceeds a certain threshold (usually 2), are considered outliers and are candidates for removal.

Example 7.18

Q: Detect the outliers of the first model designed in Example 7.13, using semistudentised residuals.

A: Figure 7.13 shows the partial listing, obtained with STATISTICA, of the 18 outliers for the foetal weight regression with the three predictors AP, BPD and CP. Notice that the magnitudes of the Standard Residual column are all above 2.

Figure 7.13. Outlier list obtained with STATISTICA for the foetal weight example.

7.3 Building and Evaluating the Regression Model

There are other ways to detect outliers, such as:

– Use of deleted residuals: the residual is computed for the respective case, assuming that it was not included in the regression analysis. If the deleted residual differs greatly from the original residual (i.e., with the case included) then the case is, possibly, an outlier. Note in Figure 7.13 how case 86 has a deleted residual that exhibits a large difference from the original residual, when compared with similar differences for cases with smaller standard residual.

– Cook’s distance: measures the distance between beta values with and without the respective case. If there are no outlier cases, these distances are of approximately equal amplitude. Note in Figure 7.13 how the Cook’s distance for case 86 is quite different from the distances of the other cases.

7.3.2.2 Assessing Multicollinearity

Besides the methods described in 7.2.5.2, multicollinearity can also be assessed using the so-called variance inflation factors (VIF), which are defined for each predictor variable as:

VIF = ( 1 − r 2 ) − k 1 k , 7.50

where 2 r k is the coefficient of multiple determination when x k is regressed on the 2 p − 2 remaining variables in the model. An r k near 1, indicating significant correlation with the remaining variables, will result in a large value of VIF. A VIF larger than 10 is usually taken as an indicator of multicollinearity.

For assessing multicollinearity, the mean of the VIF values is also computed:

VIF = ∑ k = 1 VIF k /( p − 1 ) . 7.51

A mean VIF considerably larger than 1 is indicative of serious multicollinearity problems.

Commands 7.5. SPSS, STATISTICA, MATLAB and R commands used to evaluate regression models.

SPSS Analyze; Regression; Linear; Statistics;

Model Fit STATISTICA Statistics; Multiple regression; Advanced; ANOVA

MATLAB regstats(y,X) R

influence.measures

7 Data Regression

The MATLAB regstats function generates a set of regression diagnostic measures, such as the studentised residuals and the Cook’s distance. The function creates a window with check boxes for each diagnostic measure and a Calculate Now button. Clicking Calculate Now pops up another window where the user can specify names of variables for storing the computed measures.

The R influence.measures is a suite of regression diagnostic functions, including those diagnostics that we have described, such as deleted residuals and Cook’s distance.