Logit and Probit Models

7.6 Logit and Probit Models

Logit and probit regression models are adequate for those situations where the dependent variable of the regression problem is binary, i.e., it has only two possible outcomes, e.g., “success”/ failure” or “normal”/ abnormal”. We assume “ “ that these binary outcomes are coded as 1 and 0. The application of linear regression models to such problems would not be satisfactory since the fitted predicted response would ignore the restriction of binary values for the observed data.

A simple regression model for this situation is:

Y i = g ( x i ) + ε i , with y i ∈ {} 0 , 1 . 7.60

7.5 Logit and Probit Models

Let us consider Y i to be a Bernoulli random variable with p i = P(Y i = 1). Then, as explained in Appendix A and presented in B.1.1, we have:

Ε [] Y i = p i .

On the other hand, assuming that the errors have zero mean, we have from 7.60:

Ε [] Y i = g ( x i ) .

Therefore, no matter which regression model we are using, the mean response for each predictor value represents the probability that the corresponding observed variable is one.

In order to handle the binary valued response we apply a mapping from the predictor domain onto the [0, 1] interval. The logit and probit regression models are precisely popular examples of such a mapping. The logit model uses the so- called logistic function, which is expressed as:

− x ip − [] 1

exp( β 0 + β 1 x i 1 + K + β p 1 )

1 + exp( β 0 + β 1 x i 1 + K + β p − 1 x ip − 1 )

The probit model uses the normal probability distribution as mapping function:

Ε [] Y i = N 0 , 1 ( β 0 + β 1 x i 1 + K + β p − 1 x ip − 1 ) .

Note that both mappings are examples of S-shaped functions (see Figure 7.24 and Figure A.7.b), also called sigmoidal functions. Both models are examples of non-linear regression.

The logistic response enjoys the interesting property of simple linearization. As

a matter of fact, denoting as before the mean response by the probability p i , and if we apply the logit transformation:

 p i  p i = ln 

 , 7.65  1 − p i  we obtain:

p * i = β 0 + β 1 x i 1 + K + β p − 1 x ip − 1 .

Since the mean binary responses can be interpreted as probabilities, a suitable method to estimate the coefficients for the logit and probit models, is the maximum likelihood method, explained in Appendix C, instead of the previously used least square method. Let us see how this method is applied in the case of the simple logit model. We start by assuming a Bernoulli random variable associated to each observation y i ; therefore, the joint distribution of the n observations is (see B.1.1):

7 Data Regression

Taking the natural logarithm of this likelihood, we obtain:

ln p ( y , K , y ) = y ln

∑ i 

i  

ln( 1 − p i ) .

Using formulas 7.62, 7.63 and 7.64, the logarithm of the likelihood (log- likelihood), which is a function of the coefficients, L( β), can be expressed as:

L ( β ) = ∑ y i ( β 0 + β 1 x i ) − ∑ ln [ 1 + exp( β 0 + β 1 x i ) ] . 7.69

The maximization of the L( β) function can now be carried out using one of many numerical optimisation methods, such as the quasi-Newton method, which iteratively improves current estimates of function maxima using estimates of its first and second order derivatives.

The estimation of the probit model coefficients follows a similar approach. Both models tend to yield similar solutions, although the probit model is more complex to deal with, namely in what concerns inference procedures and multiple predictor handling.

Example 7.21

Q: Consider the Clays’ dataset, which includes 94 samples of analysed clays from a certain region of Portugal. The clays are categorised according to their geological age as being pliocenic ( y i = 1; 69 cases) or holocenic ( y i = 0; 25 cases). Imagine that one wishes to estimate the probability of a given clay (from that region) to be pliocenic, based on its content in high graded grains (variable HG). Design simple logit and probit models for that purpose. Compare both solutions.

A: Let AgeB represent the binary dependent variable. Using STATISTICA or SPSS (see Commands 7.7), the fitted logistic and probit responses are:

AgeB = exp( −2.646 + 0.23 × HG) /[1 + exp( −2.646 + 0.23 × HG)]; AgeB = N 0,1 ( −1.54 + 0.138 × HG).

Figure 7.24 shows the fitted response for the logit model and the observed data.

A similar figure is obtained for the probit model. Also shown is the 0.5 threshold line. Any response above this line is assigned the value 1, and below the line, the value 0. One can, therefore, establish a training-set classification matrix for the predicted versus the observed values, as shown in Table 7.18, which can be obtained using either the SPSS or STATISTICA commands. Incidentally, note how the logit and probit models afford a regression solution to classification problems and constitute an alternative to the statistical classification methods described in Chapter 6.

When dealing with binary responses, we are confronted with the fact that the regression errors can no longer be assumed normal and as having equal variance. Therefore, the statistical tests for model evaluation, described in preceding

7.5 Logit and Probit Models 325

sections, are no longer applicable. For the logit and probit models, some sort of the chi-square test described in Chapter 5 is usually applied in order to assess the goodness of fit of the model. SPSS and STATISTICA afford another type of chi-

square test based on the log-likelihood of the model. Let L 0 represent the log- likelihood for the null model, i.e., where all slope parameters are zero, and L 1 the log-likelihood of the fitted model. In the test used by STATISTICA, the following quantity is computed:

L= −2(L 0 −L 1 ),

which, under the null hypothesis that the null model perfectly fits the data, has a chi-square distribution with p − 1 degrees of freedom. The test used by SPSS is

similar, using only the quantity –2 L 1 , which, under the null hypothesis, has a chi- square distribution with n − p degrees of freedom. In Example 7.21, the chi-square test is significant for both the logit and probit models; therefore, we reject the null hypothesis that the null model fits the data perfectly. In other words, the estimated parameters b 1 (0.23 and 0.138 for the logit and probit models, respectively) have a significant contribution for the fitted models.

0.0 HG -0.2 -5

0 5 10 15 20 25 30 35 40 45 Figure 7.24. Logistic response for the clay classification problem, using variable

HG (obtained with STATISTICA). The circles represent the observed data.

Table 7.18. Classification matrix for the clay dataset, using predictor HG in the logit or probit models.

Predicted Age = 0 Error rate Observed Age = 1

Predicted Age = 1

65 4 94.2 Observed Age = 0

7 Data Regression

Example 7.22

Q: Redo the previous example using forward search in the set of all original clay features.

A: STATISTICA ( Generalized Linear/Nonlinear Models) and SPSS afford forward and backward search in the predictor space when building a logit or probit model. Figure 7.25 shows the response function of a logit bivariate model

built with the forward search procedure and using the predictors HG and TiO 2 . In order to derive the predicted Age values, one would have to determine the cases above and below the 0.5 plane. Table 7.19 displays the corresponding classification matrix, which shows some improvement, compared with the situation of using the predictor HG alone. The error rates of Table 7.19, however, are training set estimates. In order to evaluate the performance of the model one would have to compute test set estimates using the same methods as in section 7.3.3.2.

Table 7.19. Classification matrix for the clay dataset, using predictors HG and TiO 2 in the logit model.

Predicted Age = 0 Error rate Observed Age = 1

Predicted Age = 1

66 3 95.7 Observed Age = 0

Figure 7.25. 3-D plot of the bivariate logit model for the Clays’ dataset. The solid circles are the observed values.

Exercises 327

Commands 7.7. SPSS and STATISTICA commands used to perform logit and probit regression.

SPSS Analyze; Regression; Binary Logistic |

Probit

Statistics; Advanced Linear/Nonlinear Models; Nonlinear Estimation; Quick Logit

STATISTICA | Quick Probit

Statistics; Advanced Linear/Nonlinear Models; Generalized Linear/Nonlinear Models; Logit | Probit

Exercises

7.1 The Flow Rate dataset contains daily measurements of flow rates in two Portuguese Dams, denoted AC and T. Consider the estimation of the flow rate at AC by linear regression of the flow rate at T: a) Estimate the regression parameters.

b) Assess the normality of the residuals. c) Assess the goodness of fit of the model. d) Predict the flow rate at AC when the flow rate at T is 4 m 3 /s.

7.2 Redo the previous Exercise 7.1 using quadratic regression confirming a better fit with

higher R 2 .

7.3 Redo Example 7.3 without the intercept term, proving the goodness of fit of the model.

7.4 In Exercises 2.18 and 4.8 the correlations between HFS and a transformed variable of I0 were studied. Using polynomial regression, determine a transformed variable of I0 with higher correlation with HFS.

7.5 Using the Clays’ dataset, show that the percentage of low grading material depends

on their composition of K 2 O and Al 2 O 3 . Use for that purpose a stepwise regression

approach with the chemical constituents as predictor candidates. Furthermore, perform the following analyses: a) Assess the contribution of the predictors using appropriate inference tests.

b) Assess the goodness of fit of the model. c) Assess the degree of multicollinearity of the predictors.

7.6 Consider the Services’ firms of the Firms dataset. Using stepwise search of a linear ’ regression model estimating the capital revenue, CAPR, of the firms with the predictor candidates {GI, CA, NW, P, A/C, DEPR}, perform the following analyses: a) Show that the best predictor of CAPR is the apparent productivity, P.

b) Check the goodness of fit of the model. c) Obtain the regression line plot with the 95% confidence interval.

7 Data Regression

7.7 Using the Forest Fires dataset, show that, in the conditions of the sample, it is ’ possible to predict the yearly AREA of burnt forest using the number of reported fires

as predictor, with an r 2 over 80%. Also, perform the following analyses: a) Use ridge regression in order to obtain better parameter estimates. b) Cross-validate the obtained model using a partition of even/odd years.

7.8 The search of a prediction model for the foetal weight in section 7.3.3.3 contemplated a third order model. Perform a stepwise search contemplating the interaction effects

X 12 =X 1 X 2 , X 13 =X 1 X 3 , X 23 =X 2 X 3 , and show that these interactions have no valid contribution.

7.9 The following Shepard’s formula is sometimes used to estimate the foetal weight: log 10 FW = 1.2508 + 0.166BPD + 0.046AP − 0.002646(BPD)(AP). Try to obtain this formula using the Foetal Weight dataset and linear regression.

7.10 Variable X 22 , was found to be a good predictor candidate in the forward search process in section 7.3.3.3. Study in detail the model with predictors X 1 ,X 2 , X 3 , X 22 , assessing namely: the multicollinearity; the goodness of fit; and the detection of outliers.

7.11 Consider the Wines dataset. Design a classifier for the white vs. red wines using ’ features ASP, GLU and PHE and logistic regression. Check if a better subset of features can be found.

7.12 In Example 7.16, the second order regression of the SONAE share values ( Stock Exchange dataset) was studied. Determine multiple linear regression solutions for the SONAE variable using the other variables of the dataset as predictors and forward and backward search methods. Perform the following analyses: a) Compare the goodness of fit of the forward and backward search solutions.

b) For the best solution found in a), assess the multicollinearity and the contribution of the various predictors and determine an improved model. Test this model using

a cross-validation scheme and identify the outliers.

7.13 Determine a multiple linear regression solution that will allow forecasting the temperature one day ahead in the Weather dataset (Data 1 worksheet). Use today’s temperature as one of the predictors and evaluate the model.