Logistic Regression

10.4 Logistic Regression

The previous section showed how categorical predictor variables can be included in a multiple regression model. This section shows how to analyze a model with a categorical response variable binary variable, that has only two categories. Examples of such binary or yes/no variables include Gender (male, Section 10.3 , p. 234 female), Cigarette Use (smoke, do not smoke), Begin Therapy (begin, do not begin), and Home Ownership (own, do not own).

To illustrate, consider a data table of various body measurements for 340 motorcyclists, 170 men and 170 women, fitted for motorcycle clothing. The values for the variable Gender factor variable, are entered in the data file as M or F . As these values are read into R the variable Gender is Section 2.2.2 , p. 34 automatically coded as a factor variable, a non-numerical categorical variable. Measurements are provided for Weight to the nearest pound, Height, Waist, Hips, and Chest to the nearest inch, Hand circumference to the nearest quarter of an inch and Shoe size to the nearest half size. The data table is part of lessR and can be accessed via its name dataBodyMeas or the abbreviation BodyMeas .

read lessR data, Section 2.3.5 ,

> mydata <- Read("BodyMeas", format="lessR") p. 44

The values of a binary variable can be coded as an indicator variable with numeric values

0 and 1. So a model with a binary response variable with this encoding can be submitted for analysis to a traditional least-squares regression program such as Regression . The results of this least-squares traditional regression analysis, however, are problematic as shown in Figure 10.4 . The response estimation, Section 9.3 , p. 209 variable Gender is regressed on Hand circumference in inches. (See the note for how to obtain

this figure. 1 ) The response variable Gender has only two values, 0 and 1, yet the fitted values from the resulting estimated regression line are continuous. What does a fitted value of 0.75

mean? Or how about an out of range value such as Y ˆ Gender =− 0 . 38?

Another problem is that the least-squares estimation procedure applied to a binary response variable necessarily violates some assumptions. First, from the mathematics of the variance of

a binary variable, the residuals of a binary response variable cannot have the same variance for different values of the predictor variables. Also, the residuals cannot be normally distributed because each residual is calculated with a subtraction from only either 0 or 1.

Residuals for a binary response: ˆ Y i = Y i − 0 or Y ˆ i = Y i − 1

240 Regression II

Figure 10.4 Least-squares regression fit and scatter plot of Hand circumference and binary response variable Gender.

Instead of continuously distributed across their range according to a normal distribution, the residuals cluster only around these two values.

10.4.1 The Logistic Regression Model

The key to properly modeling a binary response variable is to switch the focus from the actual values of the response variable to their probabilities. For example, given a Hand circumference, what is the probability that the person is Male? Probabilities vary continuously from and including 0 to 1 so all fitted values in that range would be meaningful. Fitted values, however, would still extend beyond that range.

Rather than directly model the probability of the response variable, invoke the equivalent

odds of an event:

concept of the odds of an event.

Ratio of the probability of the event occurring to

odds =

the probability of

not occurring.

Half of the people represented in the BodyMeas data table are men and half are women. The probability of randomly selecting a Male is 0.5, which means that the odds of this selection are 0 . 5 / 0 . 5 = 1, often expressed as 1 to 1. If 10% of the cases in the data table were from men, then the odds for this selection would be 0 . 1 / 0 . 9 ≈ 0 . 11 or 0.11 to 1. Similarly, for a probability of 0.9 the associated odds are 0 . 9 / 0 . 1 = 9 or 9 to 1. The greater the probability of event, the greater its odds. The smallest probabilities yield odds close to 0 and the closer the probability is to 1 the larger the odds.

We wish to model the values of one or more variables over all possible values. To transform the range of odds from 0 to infinity to all values from negative to positive infinity, take the

logarithm of the odds. The odds of an event that has a probability of 0.5 is 1, with the

logit: The

logarithm of the

corresponding logarithm of 0, the boundary between an event with either less or more than

odds for the occurrence of

a 0.5 probability. A probability close to zero yields a very large negative value. A probability

an event.

closer to one yields a very large positive value.

Regression II 241

This logarithm of the odds is the logit transformation, the response variable for the logistic regression. With some algebra the logit transformation can be expressed directly as a linear function of the predictor variables, with ln the natural logarithm of the resulting expression.

logit(p) = ln

= b 0 + b 1 X 1 + b 2 X 2 +... b m X m

The logistic model specifies how the logarithm of the odds change for a one-unit increase in each of the predictor variables, with the values of the other predictor variables held constant. The odds are expressed for obtaining the value of the response variable of 1. This is the model estimated by a logistic regression analysis.

There is no mathematical solution for these estimates by the least-squares principle

previously discussed. Fortunately, there is a solution to estimate these coefficients by a method maximum

called maximum likelihood, which maximizes the conditional probability of the data for any given likelihood solution: Choose set of parameters. These conditional probabilities are likelihoods. The likelihood is calculated estimated values

from the data for one set of initial estimates. Then the estimates are successively adjusted to for the model that

would have most

produce a greater likelihood for each iteration. The output of a logistic regression includes the likely produced the number of iterations processed before the best solution is reached from the initial estimates. observed data.

10.4.2 Logistic Regression Coefficients

Now apply a logistic regression analysis to the BodyMeas data.

Scenario Predict Gender from body measurements Sometimes a customer’s gender is not recorded. How well can Gender be predicted from available body measurements?

Gender is a binary variable so the resulting regression model should be a logistic regression. The lessR function Logit accomplishes this regression. The syntax of Logit is the same as for Regression . By default R orders the levels of the factors alphabetically before converting to a 0 and 1, so the default coding is 0 for Female and 1 for Male. First try a logistic regression model with only a single predictor, here Hand circumference in inches.

lessR Input Logistic regression with a single predictor > Logit(Gender ∼ Hand)

The first part of the output appears in Listing 10.10 . The form of the output is identical to that of the usual least-squares regression analyses previously presented. Each estimate of the regression model is presented, with its standard error, hypothesis test that the corresponding population value is 0, and associated 95% confidence interval. What is new is the number of iterations the algorithm required to achieve an optimal maximum likelihood solution.

242 Regression II

Upper 95% (Intercept) -26.9237

Estimate

Std Err z-value p-value

3.8904 Number of Fisher Scoring iterations: 6

Listing 10.10 Estimated coefficients from logistic regression analysis.

Each estimate is evaluated with the null hypothesis of a zero population value for the corresponding slope coefficient. The sample estimate from the logit model is b Hand = 3 . 202.

Effect of Hand size: p -value < α = 0 . 05 , so reject H 0 :β Hand = 0 The direct interpretation of each estimated slope coefficient from this model, however, is

not straightforward. As shown, although the model for logit(p) is linear, the response variable for this analysis is the logit, the logarithm of the odds. Consistent with the interpretation of a linear model, for a one-unit increase in Hand circumference the expected change in the logistic function is 3.20. But what does this mean?

10.4.3 The Odds Ratio

odds ratio: The ratio of change in

Fortunately, a simple expression permits a straightforward interpretation of how the change

the odds that the

in the value of X impacts the odds for the value of 1 for the response variable, here Male.

binary response variable equals 1 as

The algebra is to apply the exponential function to each side of the model, accomplished in

the value of the

R with the exp function. The exponential function converts a subtraction to a division, so

predictor variable is increased one unit.

the comparison of a change in the odds from changing the value of the predictor variable is

exp function:

expressed as a ratio. The result is the odds ratio. For example, exponentiate the estimated slope

Exponentiation.

coefficient for Hand with exp(3.2023) , which yields 24.59.

− relationship

An odds ratio of 1.0 indicates no relationship between predictor and response variables.

between X and Y: Odds ratio is less

An odds ratio of .5 indicates that a value of 1 for the response variable is half as likely with

than 1.0.

an increase of the predictor variable by one unit, an inverse relationship. As the predictor

+ relationship

value increases the value of the response variable decreases. Values of the odds ratio over 1.0

between X and Y: Odds ratio is

indicate a positive relationship of the predictor to the probability that the value of the response

greater than 1.0.

variable is 1. The Logit function automatically displays the odds ratios and the corresponding 95% confidence intervals, shown in Listing 10.11 .

Odds Ratio

Listing 10.11 The estimated odds ratio for each coefficient and associated 95% confidence interval.

The odds ratio in Listing 10.11 is considerably larger than 1.0, so there is a positive relationship of Hand circumference to being Male. The odds are for a value of the response variable equal to 1, that is, that a Male is randomly selected from the sample of 340 people.

Regression II 243

These odds are almost 25 times as likely, 24.59, for each additional inch of Hand circumference. In the population this odds ratio, with 95% confidence, is somewhere between 13.53 and 48.93.

The odds ratio is so much larger than 1.0 because of the unit of measurement, inches, as a percentage of hand size. Measuring hand size in inches yields a range of sizes between 6 and

12. Each inch spans a considerable portion of the range from 6 to 12 inches. The result is a dramatic increase of the odds that a random selection of a person from a sample all with the same increased Hand size yields a Male.

To illustrate, convert the Hand measurements from inches to millimeters by multiplying each Hand measurement by 2.54 and re-run the analysis.

> mydata <- Transform(Hand.mm=2.540*Hand) > Logit(Gender ˜ Hand.mm)

A millimeter is much smaller than an inch, so the size of the resulting odds ratio decreases dramatically, from 24.59 to 3.53. The odds of selecting a Male for each increase of one millimeter in Hand circumference increases, but not as much as for an increase of the much larger inch. There are different units for the two analyses, but the same relationship.

10.4.4 Assessment of Fit

For a traditional least-squares regression analysis, assessment of fit is based on the minimization of the sum of squared residuals. Obtain the standard deviation of the residuals and the R 2 fit statistics from this minimization. For a maximum likelihood solution such as for logistic regression, there is no least-squares minimization from which to obtain these statistics, nor is

there a direct analogy to R 2 , though there are several possibilities for fit indices (Tjur, 2009). An intuitive fit index is the percentage of correctly predicted values of the response variable from the corresponding probabilities. If the probability of a 1 for the value of the response variable is greater than or equal to 0.5 assign the case to that group. If the probability is less than 0.5, assign the case to the group with the response variable equal to 0. Then determine how many cases are correctly classified and compare to the baseline probability, which is the larger percentage of cases with either a 1 or with a 0.

For the BodyMeas data set both men and women are equally present so the baseline rate of correct predictions, from what could be called the null model, is 50%, shown in Listing 10.12 provided by the Logit function. The use of Hand circumference to predict Gender increases null model, the percentage of correct predictions from 50% to 88.2%.

Section 9.3.2 , p. 211

Baseline

Predicted

--------------------------------------------------- Gender

Total %Tot

F M %Correct

--------------------------------------------------- Total

Listing 10.12 Classification table from Height predicting Gender.

The output of Logit for a single predictor model also includes a graph, Figure 10.5 . The graph is the predicted probability that Gender=1 for each Hand size, that is, the probability

244 Regression II

of selecting a Male. These probabilities are obtained for the previously provided expression of logit(p) from the exponential function. The algorithm inverses the logarithm of the model

with the specific estimates of b 0 , b 1 and so forth, and then solves for the probability of selecting the value of Gender=1.

Gender (0=F

0.0 6 7 8 9 10 11 Hand

Figure 10.5 Logistic fit expressed as a probability curve and scatter plot of Hand circumference and Gender.

10.4.5 Outliers

outlier analysis,

The Logit function also provides the outlier analysis that accompanies the least-squares

Section 9.5 , p. 216

regression analyses. Examination of the scatter plot of the data in Figure 10.5 indicates that the most deviant data values are for Males with a Hand circumference of 7 inches, and for

influence statistics,

Females with a value of 9.5 inches. These are also the four values that have an R-Studentized

Section 9.5.1 , p. 217

residual larger in magnitude than 2, shown in Listing 10.13 . They also have the largest values of Cook’s Distance.

Hand Gender fitted residual rstudent dffits

152 9.5 F 0.9706 -0.9706

-2.684 -0.2256 0.07555

170 9.5 F 0.9706 -0.9706

-2.684 -0.2256 0.07555

Listing 10.13 The four cases with the magnitude of the R-Student residual larger than 2 and also the largest values of Cook’s Distance.

The information in Listing 10.13 lists the most extreme misclassifications. If a re-examination were possible, it would be advised to double-check the assignment of Gender in these four cases. The data values could be correct but the possibility of a transcription error should be explored if possible.

Regression II 245

10.4.6 Logistic Multiple Regression

The potential for improvement of model fit and understanding of the relations among the variables in the model applies to all multiple regression models regardless of their means of estimation, least-squares or maximum likelihood. With additional predictor variables that tend to be uncorrelated with each other but correlated with the response variable, the model will demonstrate improved fit and a richer understanding of the underlying relationships. Here consider a variety of other measurements intended to help differentiate among Male and Female.

lessR Input Logistic multiple regression

> Logit(Gender ∼ Height + Weight + Waist + Hips + Chest + Hand + Shoe)

The model specifies seven predictors to account for Gender. The estimated regression coefficients from the logistic regression model, similar to those in

Listing 10.10 , provide the information needed to compute the odds ratios. For this logistic multiple regression we move directly to this output, which appears in Listing 10.14 . The interpretations of these values are similar to that for the one-predictor model already considered, except that each coefficient is interpreted with the values of all remaining predictor variables held constant.

Odds Ratio

Listing 10.14 Odds ratios and associated confidence intervals.

As seen from the output in Listing 10.10 , the Logit function provides the estimated coefficients and hypothesis tests of each regression coefficient. Although not shown here, in this analysis the partial slope coefficients for three variables were not significantly different

05. These three variables are Height, Waist, and Chest. This lack of significance for these variables can also be gleaned from the 95% confidence intervals of the odds ratios reported in Listing 10.14 . The confidence intervals of the odds ratios for these three variables with non-significant coefficients all cross the neutral value of

from 0, that is, the corresponding p-values were all larger than α= 0 .

1.0. For example, the lower bound of this interval for Height is 0.9546, which means that for each one-inch increase in Height, the odds of a random selection of a Male decrease by

1 − 0 . 9546 or 4.5%, with the values for all other variables held constant. Yet the upper bound of the same confidence interval is 1.4571, which means that the same odds increase by 45.7%. Because the same confidence interval contains values below 1.0 and above 1.0, the corresponding

246 Regression II

relationship between Height and Gender=1 cannot be shown to exist, with the values of all other variables held constant.

10.4.7 Assessment of Fit

A primary purpose for adding more predictor variables to a model is to enhance the model’s ability to predict the response variable. The classification table for this seven-predictor model appears in Listing 10.15 . Here the percentage of correct classifications with the one-predictor model compared with the seven-predictor model has increased from 88.2% to 93.2%.

Total %Tot

F M %Correct

Listing 10.15 Classification table based on seven predictors of Gender.

The addition of six more variables to the original logistic model with just Hand circum- ference as the predictor variable has enhanced predictive efficiency. Are all seven predictor variables necessary? The logistic multiple regression indicated that three of the variables had non-significant partial slope coefficients. This lack of significance suggests that these three coefficients do not contribute to the predictive efficiency of the model. This concept can be more formally evaluated by comparing the fit of the reduced or nested model with four predictors to the full model with seven predictors.

As with least-squares regression, use the lessR function Nest to conduct this hypothesis test. The null hypothesis is that all of the deleted variables have zero partial slope coefficients.

method option:

By default the function assumes a least-squares solution. To specify that logit regression is the

Set to "logit" to indicate a logit

applicable procedure invoke the method option, set to "logit" .

analysis for comparing models.

lessR Input Compare nested models > Nest(Gender, c(Weight, Hips, Hand, Shoe),

c(Height, Weight, Waist, Hips, Chest, Hand, Shoe), method="logit")

Nest function

The result of the comparison of the two models with Nest is shown in Listing 10.16 . The

applied to least-squares

Deviance index is the analogy for a maximum likelihood solution to the sum of the squared

solutions,

residuals for a least-squares solution.

Section 10.2.3 , p. 231

Here the assessment is of the reduction of Deviance from the nested model to the full model, a reduction of 6.5019. The result is not significant, so the addition of the three predictor variables to the nested four-predictor model does not significantly contribute to the reduction in Deviance.

Effect of Height, Waist and Chest: p -value = 0 . 090 >α= 0 . 05 ,

so do not reject H 0 :β Height =β Waist =β Chest = 0

Regression II 247

Model 1: Gender ˜ Weight + Hips + Hand + Shoe Model 2: Gender ˜ Height + Weight + Waist + Hips + Chest + Hand + Shoe

Resid.df Resid.Dev df Deviance p-value

Listing 10.16 Direct comparison of a nested model to the full model with logistic regression.

The classification table provides further support of the conclusion of the viability of the four- predictor model. Re-running the model only with these four predictors confirms the validity of this analysis. The table is identical to that from the seven-predictor model in Listing 10.15 . With Weight, Hips, Hand circumference, and Shoe size already in the model, the addition of Height, Waist, and Chest does not improve model fit.

The odds ratios for these four predictors in Listing 10.17 are not much changed from those from the seven-predictor model presented in Listing 10.14 . A one-inch increase in Hand size results in an increase of the Odds of choosing a Male over 8.5 times more likely, with the values of all other predictor variables held constant.

Odds Ratio

Listing 10.17 Odds ratios and confidence intervals for the four-predictor model.

10.4.8 Predictions from New Data

Given the acceptance of the four-predictor model, predictions can now be made from new data.

Scenario Predict Gender from new data with the logistic regression model Suppose the Gender of a customer is unknown and so is predicted from his or her corresponding body measurements. Also suppose that it is known that the customer has a medium size glove, but his or her hand size is unknown. Obtain the predicted value of Gender from the known measurements of Weight, Hips, and Shoe size, and then the three Hand circumference values that fit the medium size glove.

The analysis is done with the same set of X1.new , X2.new options and so forth as for the least-squares regression models.

prediction in least-squares models, Section 10.3.1 ,

lessR Input Logistic regression prediction from new data

p. 238

> Logit(Gender ∼ Weight + Hips + Hand + Shoe,

X1.new=155, X2.new=39, X3.new=c(8,8.5,9), X4.new=10)

248 Regression II

The predictions for the three different sets of measurements appear in Listing 10.18 . In all three cases the probability is high or very high that Y=1, that is, the person is a Male.

Weight Hips Hand Shoe Ynew predict fitted std.err

Listing 10.18 Probabilities of a Male for three sets of new values for the four predictor variables.

In summary, very good differentiation between men and women is obtained simply from knowledge of the circumference of the Hand, with an 88.2% correct classification of Gender on this information alone. Increase this classification accuracy by adding the predictor variables of Weight, Hips, and Shoe size to the model. The result for this sample of 170 men and 170 women is an overall correct classification percentage to Gender of 93.2% on the basis of these five body measurements.

Worked Problems

?dataBodyMeas

1 Return to the BodyMeas data set.

for more information.

> mydata <- Read("BodyMeas", format="lessR")

(a) Predict Weight from Height and Waist. Specify the estimated model. Specify the variables with significant partial slope coefficients. Interpret the largest coefficient. (b) Identify the obvious outlier. What data value most contributes to the status of this case

as an outlier? (c) Using the Subset function drop this case from the data table. (d) Re-estimate the model. Is the model reasonably similar or qualitatively different from the

model estimated with the outlier? (e) Predict Weight from Height, Waist, Hips, Chest, Hand circumference, and Shoe size. Specify the estimated model. Specify the variables with significant partial slope coefficients. Interpret the largest coefficient.

(f) Evaluate collinearity. (g) Examine all possible subset regressions. List two models with five predictors that have

an R 2 adj >

0.80 and also two such models with only two predictors.

(h) Relate your answers for the two previous questions. 2 The Cars93 data set contains much information on 93 1993 car models. One variable is

?dataCars93 for

Source, which has two values, 0 for a foreign car and 1 for a car manufactured in the USA.

more information.

> mydata <- Read("Cars93", format="lessR")

(a) Use the variable HP for horsepower to account for the Source of the automobile. Does horsepower successfully differentiate between non-USA and USA manufactured cars? If the odds ratio is interpretable provide the interpretation.

Regression II 249

(b) Use the variable Width to account for the Source of the automobile. Does Width successfully differentiate between foreign and USA manufactured cars? If the odds ratio is interpretable provide the interpretation.

(c) For the one-variable Width predictor variable model what is the predicted Source for a car with a Width of 60.5 inches? (d) How much improvement in prediction is there with the following predictor variables: Width, PassCap, Wheelbase, Engine, MPGhiway. Are all these predictor variables relevant? Interpret the largest odds ratio.

This page intentionally left blank