Ridge Regression

7.5 Ridge Regression

Imagine that we had the dataset shown in Figure 7.17a and that we knew to be the result of some process with an unknown polynomial response function plus some added zero mean and constant standard deviation normal noise. Let us further assume that we didn’t know the order of the polynomial function; we only knew that it didn’t exceed the 9 th order. Searching for a 9 th order polynomial fit we would get the regression solution shown with dotted line in Figure 7.17a. The fit is quite good (the R-square is 0.99), but do we really need a 9 th order fit? Does the 9 th order fit, we have found for the data of Figure 7.17a, generalise for a new dataset generated in the same conditions?

We find here again the same “training set”- test set” issue that we have found in “ Chapter 6 when dealing with data classification. It is, therefore, a good idea to get a new dataset and try to fit the found polynomial to it. As an alternative we may also fit a new polynomial to the new dataset and compare both solutions. Figure 7.17b shows a possible instance of a new dataset, generated by the same process for the same predictor values, with the respective 9 th order polynomial fit. Again the fit is quite good (R-square is 0.98) although the large downward peak at the right end looks quite suspicious.

Table 7.14 shows the polynomial coefficients for both datasets. We note that with the exception of the first two coefficients there is a large discrepancy of the corresponding coefficient values in both solutions. This is an often encountered problem in regression with over-fitted models (roughly, with higher order than the data “justifies”): a small variation of the noise may produce a large variation of the model parameters and, therefore, of the predicted values. In Figure 7.17 the downward peak at the right end leads us to rightly suspect that we are in presence of an over-fitted model and consequently try a lower order. Visual clues, however, are more often the exception than the rule.

One way to deal with the problem of over-fitted models is to add to the error function 7.37 an extra term that penalises the norm of the regression coefficients:

E = ( y − Xb ) ’ ( y − Xb ) + r b ’ b = SSE + R . 7.57

When minimising the new error function 7.57 with the added term R = rb’b (called a regularizer) we are constraining the regression coefficients to be as small as possible driving the coefficients of unimportant terms towards zero. The parameter r controls the degree of penalisation of the square norm of b and is called the ridge factor. The new regression solution obtained by minimizing 7.57 is known as ridge regression and leads to the following ridge parameter vector b R :

XX + r I ) r YX .

7.5 Ridge Regression 317

a -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 b -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

Figure 7.17.

A set of 21 points (solid circles) with 9 th order polynomial fits (dotted lines). In both cases the x values and the noise statistics are the same; only the y values correspond to different noise instances.

Table 7.14. Coefficients of the polynomial fit of Figures 7.17a and 7.17b. Polynomyal

a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 Figure 7.17a 3.21 −0.93 0.31 8.51 −3.27 −9.27 −0.47 3.05 0.94 0.03 Figure 7.17b 3.72 −1.21 −6.98 20.87 19.98 −30.92 −31.57 6.18 12.48 2.96

a -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 b -12 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

Figure 7.18. Ridge regression solutions with r = 1for the Figure 7.17 datasets.

Figure 7.18 shows the ridge regression solutions for the Figure 7.17 datasets using a ridge factor r = 1. We see that the solutions are similar to each other and with a smoother aspect. The downward peak of Figure 7.17 disappeared. Table

7.15 shows the respective polynomial coefficients, where we observe a much

7 Data Regression

smaller discrepancy between both sets of coefficients as well as a decreasing influence of higher order terms.

Table 7.15. Coefficients of the polynomial fit of Figures 7.18a and 7.18b. Polynomyal

coefficients

a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 Figure 7.18a

2.96 0.62 −0.43 0.79 −0.55 0.36 −0.17 −0.32 0.08 0.07 Figure 7.18b

One can also penalise selected coefficients by using in 7.58 an adequate diagonal matrix of penalties, P, instead of I, leading to:

Figure 7.19 shows the regression solution of Figure 7.17b dataset, using as P a matrix with diagonal [1 1 1 1 10 10 1000 1000 1000 1000] and r = 1. Table 7.16 shows the computed and the true coefficients. We have now almost retrieved the true coefficients. The idea of “over-fitted” model is now clear.

Table 7.16. Coefficients of the polynomial fit of Figure 7.19 and true coefficients. Polynomyal

a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 Figure 7.19

Let us now discuss how to choose the ridge factor when performing ridge regression with 7.58 (regression with 7.59 is much less popular). We can gain some insight into this issue by considering the very simple dataset shown in Figure

7.20, constituted by only 3 points, to which we fit a least square linear model – the dotted line –, and a second-order model – the parabola represented with solid line – using a ridge factor.

The regression line satisfies property iv of section 7.1.2: the sum of the residuals is zero. In Figure 7.20a the ridge factor is zero; therefore, the parabola passes exactly at the 3 points. This will always happen no matter where the observed values are positioned. In other words, the second-order solution is in this case an over-fitted solution tightly attached to the “training set” and unable to generalise to another independent set (think of an addition of i.i.d. noise to the observed values).

7.5 Ridge Regression 319

The b vector is in this case b = [0 3.5 −1.5] , with no independent term and a ’

large second-order term.

-2 -4 -6 -8

-10 x

0 0.5 1 1.5 Figure 7.19. Ridge regression solution of Figure 7.17b dataset, using a diagonal

matrix of penalties (see text).

Let us now add a regularizer. As we increase the ridge factor the second-order term decreases and the independent term increases. With r = 0.6 we get the

solution shown in Figure 7.20b with b = [0.42 0.74 −0.16] . We are now quite ’

near the regression line with a large independent term and a reduced second-order term. The addition of i.i.d. noise with small amplitude should not change, on average, this solution. On average we expect some compensation of the errors and

a solution that somehow passes half way of the points. In Figure 7.20c the regularizer weighs as much as the classic least squares error. We get b = [0.38

0.53 −0.05]’ and “almost” a line passing below the “half way”. Usually, when performing ridge regression we go as far as r = 1. If we go beyond this value the square norm of b is driven to small values and we may get strange solutions such as the one shown in Figure 7.20d for r = 50 corresponding to b = [0.020 0.057

0.078] , i.e., a dominant second-order term. ’

Figure 7.21 shows for r ∈ [0, 2] the SSE curve together with the curve of the following error:

ˆ − ˆ SSE(L) 2 ∑ ( y i y iL ) ,

where the yˆ i are, as usual, the predicted values (second-order model) and the yˆ iL are the predicted values of the linear model, which is the preferred model in this case. The minimum of SSE(L) (L from Linear) occurs at r = 0.6, where the SSE curve starts to saturate.

We may, therefore choose the best r by graphical inspection of the estimated SSE (or MSE) and the estimated coefficients as functions of r, the so-called ridge traces. One usually selects the value of r that corresponds to the beginning of a “stable” evolution of the MSE and coefficients.

7 Data Regression

Besides its use in the selection of “smooth”, non-over-fitted models, ridge regression is also used as a remedy to decrease the effects of multicollinearity as illustrated in the following Example 7.20. In this application one must select a ridge factor corresponding to small values of the VIF factors.

c 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 d 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Figure 7.20. Fitting a second-order model to a very simple dataset (3 points

represented by solid circles) with ridge factor: a) 0; b) 0.6; c) 1; d) 50.

1.8 E 1.6 1.4 1.2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Figure 7.21. SSE (solid line) and SSE(L) (dotted line) curves for the ridge

regression solutions of Figure 7.20 dataset.

7.5 Ridge Regression 321

Example 7.20

Q: Determine the ridge regression solution for the foetal weight prediction model designed in Example 7.13.

A: Table 7.17 shows the evolution with r of the MSE, coefficients and VIF for the linear regression model of the foetal weight data using the predictors BPD, AP and CP. The mean VIF is also included in Table 7.17.

Table 7.17. Values of MSE, coefficients, VIF and mean VIF for several values of the ridge parameter in the multiple linear regression of the foetal weight data.

0 0.10 0.20 0.30 0.40 0.50 0.60 MSE

291.8 318.2 338.8 355.8 370.5 383.3 394.8 BPD

b 292.3 269.8 260.7 254.5 248.9 243.4 238.0 VIF

3.14 2.72 2.45 2.62 2.12 2.00 1.92 CP

b 36.00 54.76 62.58 66.19 67.76 68.21 68.00 VIF

3.67 3.14 2.80 2.55 3.09 1.82 2.16 AP

b 124.7 108.7 97.8 89.7 83.2 78.0 73.6 VIF

2.00 1.85 1.77 1.71 1.65 1.61 1.57 Mean VIF 2.90 2.60 2.34 2.17 2.29 1.80 1.88

Mean VIF

a) Plot of the foetal weight regression MSE and coefficients for several values of the ridge parameter; b) Plot of the mean VIF factor for several values of the ridge parameter.

Figure 7.22 shows the ridge traces for the MSE and three coefficients as well as the evolution of the Mean VIF factor. The ridge traces do not give, in this case, a clear indication of the best r value, although the CP curve suggests a “stable” evolution starting at around r = 0.2. We don’t show the values and the curve corresponding to the intercept term since it is not informative. The evolution of the

7 Data Regression

VIF and Mean VIF factors (the Mean VIF is shown in Figure 7.22b) suggest the solutions r = 0.3 and r = 0.5 as the most appropriate.

Figure 7.23 shows the predicted FW values with r = 0 and r = 0.3. Both solutions are near each other. However, the ridge regression solution has decreased multicollinearity effects (reduced VIF factors) with only a small increase of the MSE.

Figure 7.23. Predicted versus observed FW values with r = 0 (solid circles) and r = 0.3 (open circles).

Commands 7.6. SPSS, STATISTICA and MATLAB commands used to perform ridge regression.

SPSS Ridge Regression Macro STATISTICA

Statistics; Multiple Regression; Advanced; Ridge

MATLAB b=ridge(y,X,k) (k is the ridge parameter)