Logistic Regression
10.4 Logistic Regression
The previous section showed how categorical predictor variables can be included in a multiple regression model. This section shows how to analyze a model with a categorical response variable binary variable, that has only two categories. Examples of such binary or yes/no variables include Gender (male, Section 10.3 , p. 234 female), Cigarette Use (smoke, do not smoke), Begin Therapy (begin, do not begin), and Home Ownership (own, do not own).
To illustrate, consider a data table of various body measurements for 340 motorcyclists, 170 men and 170 women, fitted for motorcycle clothing. The values for the variable Gender factor variable, are entered in the data file as M or F . As these values are read into R the variable Gender is Section 2.2.2 , p. 34 automatically coded as a factor variable, a non-numerical categorical variable. Measurements are provided for Weight to the nearest pound, Height, Waist, Hips, and Chest to the nearest inch, Hand circumference to the nearest quarter of an inch and Shoe size to the nearest half size. The data table is part of lessR and can be accessed via its name dataBodyMeas or the abbreviation BodyMeas .
read lessR data, Section 2.3.5 ,
> mydata <- Read("BodyMeas", format="lessR") p. 44
The values of a binary variable can be coded as an indicator variable with numeric values
0 and 1. So a model with a binary response variable with this encoding can be submitted for analysis to a traditional least-squares regression program such as Regression . The results of this least-squares traditional regression analysis, however, are problematic as shown in Figure 10.4 . The response estimation, Section 9.3 , p. 209 variable Gender is regressed on Hand circumference in inches. (See the note for how to obtain
this figure. 1 ) The response variable Gender has only two values, 0 and 1, yet the fitted values from the resulting estimated regression line are continuous. What does a fitted value of 0.75
mean? Or how about an out of range value such as Y ˆ Gender =− 0 . 38?
Another problem is that the least-squares estimation procedure applied to a binary response variable necessarily violates some assumptions. First, from the mathematics of the variance of
a binary variable, the residuals of a binary response variable cannot have the same variance for different values of the predictor variables. Also, the residuals cannot be normally distributed because each residual is calculated with a subtraction from only either 0 or 1.
Residuals for a binary response: ˆ Y i = Y i − 0 or Y ˆ i = Y i − 1
240 Regression II
Figure 10.4 Least-squares regression fit and scatter plot of Hand circumference and binary response variable Gender.
Instead of continuously distributed across their range according to a normal distribution, the residuals cluster only around these two values.
10.4.1 The Logistic Regression Model
The key to properly modeling a binary response variable is to switch the focus from the actual values of the response variable to their probabilities. For example, given a Hand circumference, what is the probability that the person is Male? Probabilities vary continuously from and including 0 to 1 so all fitted values in that range would be meaningful. Fitted values, however, would still extend beyond that range.
Rather than directly model the probability of the response variable, invoke the equivalent
odds of an event:
concept of the odds of an event.
Ratio of the probability of the event occurring to
odds =
the probability of
not occurring.
Half of the people represented in the BodyMeas data table are men and half are women. The probability of randomly selecting a Male is 0.5, which means that the odds of this selection are 0 . 5 / 0 . 5 = 1, often expressed as 1 to 1. If 10% of the cases in the data table were from men, then the odds for this selection would be 0 . 1 / 0 . 9 ≈ 0 . 11 or 0.11 to 1. Similarly, for a probability of 0.9 the associated odds are 0 . 9 / 0 . 1 = 9 or 9 to 1. The greater the probability of event, the greater its odds. The smallest probabilities yield odds close to 0 and the closer the probability is to 1 the larger the odds.
We wish to model the values of one or more variables over all possible values. To transform the range of odds from 0 to infinity to all values from negative to positive infinity, take the
logarithm of the odds. The odds of an event that has a probability of 0.5 is 1, with the
logit: The
logarithm of the
corresponding logarithm of 0, the boundary between an event with either less or more than
odds for the occurrence of
a 0.5 probability. A probability close to zero yields a very large negative value. A probability
an event.
closer to one yields a very large positive value.
Regression II 241
This logarithm of the odds is the logit transformation, the response variable for the logistic regression. With some algebra the logit transformation can be expressed directly as a linear function of the predictor variables, with ln the natural logarithm of the resulting expression.
logit(p) = ln
= b 0 + b 1 X 1 + b 2 X 2 +... b m X m
The logistic model specifies how the logarithm of the odds change for a one-unit increase in each of the predictor variables, with the values of the other predictor variables held constant. The odds are expressed for obtaining the value of the response variable of 1. This is the model estimated by a logistic regression analysis.
There is no mathematical solution for these estimates by the least-squares principle
previously discussed. Fortunately, there is a solution to estimate these coefficients by a method maximum
called maximum likelihood, which maximizes the conditional probability of the data for any given likelihood solution: Choose set of parameters. These conditional probabilities are likelihoods. The likelihood is calculated estimated values
from the data for one set of initial estimates. Then the estimates are successively adjusted to for the model that
would have most
produce a greater likelihood for each iteration. The output of a logistic regression includes the likely produced the number of iterations processed before the best solution is reached from the initial estimates. observed data.
10.4.2 Logistic Regression Coefficients
Now apply a logistic regression analysis to the BodyMeas data.
Scenario Predict Gender from body measurements Sometimes a customer’s gender is not recorded. How well can Gender be predicted from available body measurements?
Gender is a binary variable so the resulting regression model should be a logistic regression. The lessR function Logit accomplishes this regression. The syntax of Logit is the same as for Regression . By default R orders the levels of the factors alphabetically before converting to a 0 and 1, so the default coding is 0 for Female and 1 for Male. First try a logistic regression model with only a single predictor, here Hand circumference in inches.
lessR Input Logistic regression with a single predictor > Logit(Gender ∼ Hand)
The first part of the output appears in Listing 10.10 . The form of the output is identical to that of the usual least-squares regression analyses previously presented. Each estimate of the regression model is presented, with its standard error, hypothesis test that the corresponding population value is 0, and associated 95% confidence interval. What is new is the number of iterations the algorithm required to achieve an optimal maximum likelihood solution.
242 Regression II
Upper 95% (Intercept) -26.9237
Estimate
Std Err z-value p-value
3.8904 Number of Fisher Scoring iterations: 6
Listing 10.10 Estimated coefficients from logistic regression analysis.
Each estimate is evaluated with the null hypothesis of a zero population value for the corresponding slope coefficient. The sample estimate from the logit model is b Hand = 3 . 202.
Effect of Hand size: p -value < α = 0 . 05 , so reject H 0 :β Hand = 0 The direct interpretation of each estimated slope coefficient from this model, however, is
not straightforward. As shown, although the model for logit(p) is linear, the response variable for this analysis is the logit, the logarithm of the odds. Consistent with the interpretation of a linear model, for a one-unit increase in Hand circumference the expected change in the logistic function is 3.20. But what does this mean?
10.4.3 The Odds Ratio
odds ratio: The ratio of change in
Fortunately, a simple expression permits a straightforward interpretation of how the change
the odds that the
in the value of X impacts the odds for the value of 1 for the response variable, here Male.
binary response variable equals 1 as
The algebra is to apply the exponential function to each side of the model, accomplished in
the value of the
R with the exp function. The exponential function converts a subtraction to a division, so
predictor variable is increased one unit.
the comparison of a change in the odds from changing the value of the predictor variable is
exp function:
expressed as a ratio. The result is the odds ratio. For example, exponentiate the estimated slope
Exponentiation.
coefficient for Hand with exp(3.2023) , which yields 24.59.
− relationship
An odds ratio of 1.0 indicates no relationship between predictor and response variables.
between X and Y: Odds ratio is less
An odds ratio of .5 indicates that a value of 1 for the response variable is half as likely with
than 1.0.
an increase of the predictor variable by one unit, an inverse relationship. As the predictor
+ relationship
value increases the value of the response variable decreases. Values of the odds ratio over 1.0
between X and Y: Odds ratio is
indicate a positive relationship of the predictor to the probability that the value of the response
greater than 1.0.
variable is 1. The Logit function automatically displays the odds ratios and the corresponding 95% confidence intervals, shown in Listing 10.11 .
Odds Ratio
Listing 10.11 The estimated odds ratio for each coefficient and associated 95% confidence interval.
The odds ratio in Listing 10.11 is considerably larger than 1.0, so there is a positive relationship of Hand circumference to being Male. The odds are for a value of the response variable equal to 1, that is, that a Male is randomly selected from the sample of 340 people.
Regression II 243
These odds are almost 25 times as likely, 24.59, for each additional inch of Hand circumference. In the population this odds ratio, with 95% confidence, is somewhere between 13.53 and 48.93.
The odds ratio is so much larger than 1.0 because of the unit of measurement, inches, as a percentage of hand size. Measuring hand size in inches yields a range of sizes between 6 and
12. Each inch spans a considerable portion of the range from 6 to 12 inches. The result is a dramatic increase of the odds that a random selection of a person from a sample all with the same increased Hand size yields a Male.
To illustrate, convert the Hand measurements from inches to millimeters by multiplying each Hand measurement by 2.54 and re-run the analysis.
> mydata <- Transform(Hand.mm=2.540*Hand) > Logit(Gender ˜ Hand.mm)
A millimeter is much smaller than an inch, so the size of the resulting odds ratio decreases dramatically, from 24.59 to 3.53. The odds of selecting a Male for each increase of one millimeter in Hand circumference increases, but not as much as for an increase of the much larger inch. There are different units for the two analyses, but the same relationship.
10.4.4 Assessment of Fit
For a traditional least-squares regression analysis, assessment of fit is based on the minimization of the sum of squared residuals. Obtain the standard deviation of the residuals and the R 2 fit statistics from this minimization. For a maximum likelihood solution such as for logistic regression, there is no least-squares minimization from which to obtain these statistics, nor is
there a direct analogy to R 2 , though there are several possibilities for fit indices (Tjur, 2009). An intuitive fit index is the percentage of correctly predicted values of the response variable from the corresponding probabilities. If the probability of a 1 for the value of the response variable is greater than or equal to 0.5 assign the case to that group. If the probability is less than 0.5, assign the case to the group with the response variable equal to 0. Then determine how many cases are correctly classified and compare to the baseline probability, which is the larger percentage of cases with either a 1 or with a 0.
For the BodyMeas data set both men and women are equally present so the baseline rate of correct predictions, from what could be called the null model, is 50%, shown in Listing 10.12 provided by the Logit function. The use of Hand circumference to predict Gender increases null model, the percentage of correct predictions from 50% to 88.2%.
Section 9.3.2 , p. 211
Baseline
Predicted
--------------------------------------------------- Gender
Total %Tot
F M %Correct
--------------------------------------------------- Total
Listing 10.12 Classification table from Height predicting Gender.
The output of Logit for a single predictor model also includes a graph, Figure 10.5 . The graph is the predicted probability that Gender=1 for each Hand size, that is, the probability
244 Regression II
of selecting a Male. These probabilities are obtained for the previously provided expression of logit(p) from the exponential function. The algorithm inverses the logarithm of the model
with the specific estimates of b 0 , b 1 and so forth, and then solves for the probability of selecting the value of Gender=1.
Gender (0=F
0.0 6 7 8 9 10 11 Hand
Figure 10.5 Logistic fit expressed as a probability curve and scatter plot of Hand circumference and Gender.
10.4.5 Outliers
outlier analysis,
The Logit function also provides the outlier analysis that accompanies the least-squares
Section 9.5 , p. 216
regression analyses. Examination of the scatter plot of the data in Figure 10.5 indicates that the most deviant data values are for Males with a Hand circumference of 7 inches, and for
influence statistics,
Females with a value of 9.5 inches. These are also the four values that have an R-Studentized
Section 9.5.1 , p. 217
residual larger in magnitude than 2, shown in Listing 10.13 . They also have the largest values of Cook’s Distance.
Hand Gender fitted residual rstudent dffits
152 9.5 F 0.9706 -0.9706
-2.684 -0.2256 0.07555
170 9.5 F 0.9706 -0.9706
-2.684 -0.2256 0.07555
Listing 10.13 The four cases with the magnitude of the R-Student residual larger than 2 and also the largest values of Cook’s Distance.
The information in Listing 10.13 lists the most extreme misclassifications. If a re-examination were possible, it would be advised to double-check the assignment of Gender in these four cases. The data values could be correct but the possibility of a transcription error should be explored if possible.
Regression II 245
10.4.6 Logistic Multiple Regression
The potential for improvement of model fit and understanding of the relations among the variables in the model applies to all multiple regression models regardless of their means of estimation, least-squares or maximum likelihood. With additional predictor variables that tend to be uncorrelated with each other but correlated with the response variable, the model will demonstrate improved fit and a richer understanding of the underlying relationships. Here consider a variety of other measurements intended to help differentiate among Male and Female.
lessR Input Logistic multiple regression
> Logit(Gender ∼ Height + Weight + Waist + Hips + Chest + Hand + Shoe)
The model specifies seven predictors to account for Gender. The estimated regression coefficients from the logistic regression model, similar to those in
Listing 10.10 , provide the information needed to compute the odds ratios. For this logistic multiple regression we move directly to this output, which appears in Listing 10.14 . The interpretations of these values are similar to that for the one-predictor model already considered, except that each coefficient is interpreted with the values of all remaining predictor variables held constant.
Odds Ratio
Listing 10.14 Odds ratios and associated confidence intervals.
As seen from the output in Listing 10.10 , the Logit function provides the estimated coefficients and hypothesis tests of each regression coefficient. Although not shown here, in this analysis the partial slope coefficients for three variables were not significantly different
05. These three variables are Height, Waist, and Chest. This lack of significance for these variables can also be gleaned from the 95% confidence intervals of the odds ratios reported in Listing 10.14 . The confidence intervals of the odds ratios for these three variables with non-significant coefficients all cross the neutral value of
from 0, that is, the corresponding p-values were all larger than α= 0 .
1.0. For example, the lower bound of this interval for Height is 0.9546, which means that for each one-inch increase in Height, the odds of a random selection of a Male decrease by
1 − 0 . 9546 or 4.5%, with the values for all other variables held constant. Yet the upper bound of the same confidence interval is 1.4571, which means that the same odds increase by 45.7%. Because the same confidence interval contains values below 1.0 and above 1.0, the corresponding
246 Regression II
relationship between Height and Gender=1 cannot be shown to exist, with the values of all other variables held constant.
10.4.7 Assessment of Fit
A primary purpose for adding more predictor variables to a model is to enhance the model’s ability to predict the response variable. The classification table for this seven-predictor model appears in Listing 10.15 . Here the percentage of correct classifications with the one-predictor model compared with the seven-predictor model has increased from 88.2% to 93.2%.
Total %Tot
F M %Correct
Listing 10.15 Classification table based on seven predictors of Gender.
The addition of six more variables to the original logistic model with just Hand circum- ference as the predictor variable has enhanced predictive efficiency. Are all seven predictor variables necessary? The logistic multiple regression indicated that three of the variables had non-significant partial slope coefficients. This lack of significance suggests that these three coefficients do not contribute to the predictive efficiency of the model. This concept can be more formally evaluated by comparing the fit of the reduced or nested model with four predictors to the full model with seven predictors.
As with least-squares regression, use the lessR function Nest to conduct this hypothesis test. The null hypothesis is that all of the deleted variables have zero partial slope coefficients.
method option:
By default the function assumes a least-squares solution. To specify that logit regression is the
Set to "logit" to indicate a logit
applicable procedure invoke the method option, set to "logit" .
analysis for comparing models.
lessR Input Compare nested models > Nest(Gender, c(Weight, Hips, Hand, Shoe),
c(Height, Weight, Waist, Hips, Chest, Hand, Shoe), method="logit")
Nest function
The result of the comparison of the two models with Nest is shown in Listing 10.16 . The
applied to least-squares
Deviance index is the analogy for a maximum likelihood solution to the sum of the squared
solutions,
residuals for a least-squares solution.
Section 10.2.3 , p. 231
Here the assessment is of the reduction of Deviance from the nested model to the full model, a reduction of 6.5019. The result is not significant, so the addition of the three predictor variables to the nested four-predictor model does not significantly contribute to the reduction in Deviance.
Effect of Height, Waist and Chest: p -value = 0 . 090 >α= 0 . 05 ,
so do not reject H 0 :β Height =β Waist =β Chest = 0
Regression II 247
Model 1: Gender ˜ Weight + Hips + Hand + Shoe Model 2: Gender ˜ Height + Weight + Waist + Hips + Chest + Hand + Shoe
Resid.df Resid.Dev df Deviance p-value
Listing 10.16 Direct comparison of a nested model to the full model with logistic regression.
The classification table provides further support of the conclusion of the viability of the four- predictor model. Re-running the model only with these four predictors confirms the validity of this analysis. The table is identical to that from the seven-predictor model in Listing 10.15 . With Weight, Hips, Hand circumference, and Shoe size already in the model, the addition of Height, Waist, and Chest does not improve model fit.
The odds ratios for these four predictors in Listing 10.17 are not much changed from those from the seven-predictor model presented in Listing 10.14 . A one-inch increase in Hand size results in an increase of the Odds of choosing a Male over 8.5 times more likely, with the values of all other predictor variables held constant.
Odds Ratio
Listing 10.17 Odds ratios and confidence intervals for the four-predictor model.
10.4.8 Predictions from New Data
Given the acceptance of the four-predictor model, predictions can now be made from new data.
Scenario Predict Gender from new data with the logistic regression model Suppose the Gender of a customer is unknown and so is predicted from his or her corresponding body measurements. Also suppose that it is known that the customer has a medium size glove, but his or her hand size is unknown. Obtain the predicted value of Gender from the known measurements of Weight, Hips, and Shoe size, and then the three Hand circumference values that fit the medium size glove.
The analysis is done with the same set of X1.new , X2.new options and so forth as for the least-squares regression models.
prediction in least-squares models, Section 10.3.1 ,
lessR Input Logistic regression prediction from new data
p. 238
> Logit(Gender ∼ Weight + Hips + Hand + Shoe,
X1.new=155, X2.new=39, X3.new=c(8,8.5,9), X4.new=10)
248 Regression II
The predictions for the three different sets of measurements appear in Listing 10.18 . In all three cases the probability is high or very high that Y=1, that is, the person is a Male.
Weight Hips Hand Shoe Ynew predict fitted std.err
Listing 10.18 Probabilities of a Male for three sets of new values for the four predictor variables.
In summary, very good differentiation between men and women is obtained simply from knowledge of the circumference of the Hand, with an 88.2% correct classification of Gender on this information alone. Increase this classification accuracy by adding the predictor variables of Weight, Hips, and Shoe size to the model. The result for this sample of 170 men and 170 women is an overall correct classification percentage to Gender of 93.2% on the basis of these five body measurements.
Worked Problems
?dataBodyMeas
1 Return to the BodyMeas data set.
for more information.
> mydata <- Read("BodyMeas", format="lessR")
(a) Predict Weight from Height and Waist. Specify the estimated model. Specify the variables with significant partial slope coefficients. Interpret the largest coefficient. (b) Identify the obvious outlier. What data value most contributes to the status of this case
as an outlier? (c) Using the Subset function drop this case from the data table. (d) Re-estimate the model. Is the model reasonably similar or qualitatively different from the
model estimated with the outlier? (e) Predict Weight from Height, Waist, Hips, Chest, Hand circumference, and Shoe size. Specify the estimated model. Specify the variables with significant partial slope coefficients. Interpret the largest coefficient.
(f) Evaluate collinearity. (g) Examine all possible subset regressions. List two models with five predictors that have
an R 2 adj >
0.80 and also two such models with only two predictors.
(h) Relate your answers for the two previous questions. 2 The Cars93 data set contains much information on 93 1993 car models. One variable is
?dataCars93 for
Source, which has two values, 0 for a foreign car and 1 for a car manufactured in the USA.
more information.
> mydata <- Read("Cars93", format="lessR")
(a) Use the variable HP for horsepower to account for the Source of the automobile. Does horsepower successfully differentiate between non-USA and USA manufactured cars? If the odds ratio is interpretable provide the interpretation.
Regression II 249
(b) Use the variable Width to account for the Source of the automobile. Does Width successfully differentiate between foreign and USA manufactured cars? If the odds ratio is interpretable provide the interpretation.
(c) For the one-variable Width predictor variable model what is the predicted Source for a car with a Width of 60.5 inches? (d) How much improvement in prediction is there with the following predictor variables: Width, PassCap, Wheelbase, Engine, MPGhiway. Are all these predictor variables relevant? Interpret the largest odds ratio.
This page intentionally left blank