Indicator Variables

10.3 Indicator Variables

The previous examples of regression analysis involved continuous variables. Categorical variables can also appear as predictor variables in a regression analysis. How is that possible given that categorical variables have non-numeric data values? The answer is to convert the levels, that is, categories, to numeric variables.

The numeric versions of a categorical variable are called indicator variables, also sometimes

binary variable:

called dummy variables. Create an indicator variable for each category. Each indicator variable

Variable with only two unique values.

is a binary variable, that is, with only two unique values, such as a 0 and 1. For example, assign

a value of 1 if the level is present for a given case and a value of 0 if it is not. Gender has two levels, Male and Female, so two indicator variables could be created for Gender, illustrated in Table 10.1 . Score the indicator variable for Male a 1 if a person is a man and 0 if the person is a woman. Similarly, score the indicator variable for a Female a 1 for a woman and a 0 for a man.

Table 10.1 Data table of the categorical variable Gender and its two indicator variables for four people.

Gender

The benefit of an indicator variable is that despite the non-numeric values of Gender, the 0 and 1 values of the resulting indicator variables are numeric. In general, if there are k categories or levels of the categorical variable, then k − 1 indicator variables are needed to describe the values of the categorical variable. The variable Gender has just two levels, so to know the value of either indicator variable for Gender is to know the value of the remaining indicator variable. If the value of one of the Gender indicator variables for a person is a 0 then the value for the remaining indicator is a 1, and vice versa.

Gender can be included in a regression analysis in terms of either one of its two indicator variables. The interpretation of the slope coefficient remains the same as for any regression model, the average change in the response variable for a one unit increase in the corresponding predictor variable, with the values of all other variables in the model held constant. For an indicator variable of Gender that means its slope coefficient represents the average change in the response variable from either moving from Male to Female, or from Female to Male, depending on which indicator variable was entered into the model.

factor variable,

The core R regression function is lm for “linear model”. This function automatically creates

Section 1.6.3 , p. 22

these indicator variables from a categorical variable encoded as a factor submitted to a regression analysis. The lessR function Regression invokes lm and then accesses its output, so

Employee data table, Figure 1.7 ,

categorical variables can be entered directly into a function call to Regression . In this example

p. 21

the variable Gender from the Employee data set has values M and F .

Regression II 235

lessR Input Regression analysis with a categorical variable > Regression(Salary ∼ Gender)

R names the resulting indicator variable GenderM , a juxtaposition of the name of the categorical variable and the name of the category or level M . By default R orders the levels factor variable, of a categorical variable, a factor, alphabetically, so F is the first level of Gender and M is the Section 3.3.2 , p. 59 second. R uses the convention of naming the Gender effect with the last of the two levels. The y-intercept in this analysis is the average value of Salary for the first level of Gender, F . The slope coefficient for this indicator variable GenderM represents the “male effect”, the average

change in Salary moving from the first level of Gender, F , to the second level, M .

The estimated regression coefficients and analyses appear in Listing 10.6 .

Upper 95% (Intercept) 56830.598

Estimate

Std Err t-value p-value

Listing 10.6 Estimated model for Salary regressed on Gender.

The output in Listing 10.6 provides the estimated model.

Y ˆ Salary = $56830 . 60 + $14316 . 86 X GenderM

Both coefficients are significant with p-values less than 0.05. In this sample the average increase in Salary for a 1-unit increase in the indicator variable, moving from Female to Male, is $14317. The accompanying scatter plot from the Regression function in Figure 10.3 illustrates

F M Gender

Figure 10.3 Scatter plot of the factor variable Gender with Salary.

236 Regression II

this difference between the mean Salary levels with horizontal lines drawn through the mean of each group. This is the default form of the lessR scatter plot function when the variable

Scatter plot

plotted on the horizontal axis is a factor. The Regression function automatically provides the

function, Section 8.2.1 ,

call to ScatterPlot .

p. 182

The estimated slope coefficient b 1 = $14317 is the sample mean difference, Y ¯ M −¯ Y F . The p-value from a t-test analysis of the mean difference is identical to the sample slope coefficient

ttest function,

from a least-squares regression analysis, as are the corresponding confidence intervals. This

Section 6.3 , p. 130

comparison is from the Regression output in Figure 10.6 and the ttest output in Listing 10.7 .

Hypothesis Test of 0 Mean Diff: t = 2.088, df = 35, p-value = 0.044 Margin of Error for 95% Confidence Level: 13921.454

95% Confidence Interval for Mean Difference: 395.406 to 28238.314

Listing 10.7 Independent groups t-test analysis of the mean difference of Salary for men and women.

This generalization of results also extends to analysis of variance.

lessR Input Three equivalent analyses of a mean difference > ttest(Salary ∼ Gender) > ANOVA(Salary ∼ Gender) > Regression(Salary ∼ Gender)

Each of these three analyses yields identical results for the hypothesis test of the mean difference of Salary for Men and Women. A summary of the output for all three analyses follows where µ M and µ F are the respective population mean Salaries for men and women.

Gender effect on Salary: t -test for H 0 :µ M −µ F = 0 , p -value = 0 . 044 <α= 0 . 05 ANOVA for H 0 :µ M =µ F , p -value = 0 . 044 <α= 0 . 05 Regression for H 0 :β Gender = 0 , p -value = 0 . 044 <α= 0 . 05

The result of each analysis is the detection of a difference between average Salary of men and women. That is, the difference is statistically significant. Men have a higher average salary than women at this company.

t -test between

Of the three analyses, each successive one is more general. The t-test compares the means

groups, Section 6.3.1 ,

of the response variable across two groups with the t-value. The one-way ANOVA procedure

p. 130

compares two or more means by a ratio of two variances, which is an F-value. The regression

ANOVA one-way between groups,

procedure can evaluate the response variable for changes in the value of a continuous variable

Section 7.2 , p. 150

as well, also with the t-value. This analysis established the difference between average men’s and women’s Salary at this

gross effect,

company. This slope coefficient, however, is a gross effect. How much average Salary changes

Section 10.2 , p. 225

between the two values of Gender does not separate the direct effect of Gender on Salary from potential indirect effects. That is, the correlation between Salary and Gender could result to some extent from a causal relation of other variables to Salary from which Gender is correlated.

Regression II 237

Another question of interest is to assess the net effect of Gender on Salary controlling the values of other potentially confounding variables. One potential such variable is Years experience working at that company. To evaluate the net effect of Gender relative to Years experience, consider the following multiple regression.

> Regression(Salary ∼ Years + Gender) The output appears in Listing 10.8 .

Estimate

Std Err t-value p-value

GenderM -5170.610

Listing 10.8 Estimated multiple regression model of net effects of Years experience and Gender.

From Listing 10.8 , write the model as follows.

Y ˆ Salary = $33103 . 74 + $3467 . 77 X Years − $5170 . 61 X GenderM

In particular, the net effect of Gender is not significant. Net effect of Gender on Salary:

p -value = 0 . 246 >α= 0 . 05 for H 0 :β Gender = 0 , do not reject H 0

The effect of Years, however, is significant. Net effect of Years on Salary:

p -value = 0 . 000 >α= 0 . 05 for H 0 :β Years = 0 , reject H 0

In this sample each additional year of employment at the company leads to an average increase of Salary of $3468 regardless of Gender. There is a 95% confidence that the true average increase is somewhere from $2681 to $4255. This effect is perhaps a direct causal impact on the determination of Salary. The lack of significance for the net effect of Gender indicates that there is no detected difference in Salary for men and women for any group of employees who have worked at the company for the same number of years.

Investigating further, for some reason women in this company have worked fewer years on average than men. The two sample averages for Years employed are Y ¯ M = 12 . 24 and Y ¯ F =

84. The corresponding independent groups t-test with the ttest function indicates that this independent groups difference in average Years worked is significant with a p-value equal to 0.003. The reason for this t -test, Section 6.3 , p. 130 pattern is not clear from the data. Perhaps management used to be dominated by chauvinists who did not hire women, but now all the old guys are dead, retired, or in jail.

The analysis detected no overt discrimination against women in the company at this time. After some number of years, however, the revised data should be analyzed to verify that equal Salary by Gender is in fact becoming the norm as women gain more work experience at the company. Currently the second highest paid employee, Leslie James, is a woman, so perhaps

238 Regression II

this trend will be realized. As illustrated, the implementation of multiple regression allows for statistical control, which then permits a more sophisticated examination of causal influences.

10.3.1 Prediction from New Data

prediction from new

As discussed in the previous chapter, true forecasting or prediction occurs with new data. The

values, Section 9.4.2 ,

values of the response variable are already known for the data from which the model is estimated.

p. 214

Only by coincidence are the new values of the predictor variables equal to the existing values. The fitted values and corresponding prediction intervals need to be calculated for any specified values of the predictors, not just for existing values in the original data set.

Scenario Prediction from new data To assess the potential gender discrimination in the company, calculate predicted values of Salary for men and women separately for various Years of experience working in the company. Even though not as many women have extended Years experience as the men, the model provides the values of Salary consistent with additional experience. Are there Gender differences?

To provide the predictions and 95% prediction intervals for new, specified data values use the

X1.new, X2.new

X1.new and X2.new options for the Regression function. Use X1.new to specify the values for

options: Obtain predictions for

the first predictor variable, and X2.new for the second predictor variable listed in the function

specified values of

call. These values may be specified up to the arbitrary cutoff of five predictor variables, up to

the predictor variables.

X5.new .

lessR Input Prediction from new data > Regression(Salary ∼ Years + Gender,

X1.new=c(10,15,20), X2.new=c("F","M"))

The output from Regression is identical to the previous analysis except for the section that provides the predicted values, which appears in Listing 10.9 .

Years Gender Salary

fitted

ci:lwr

ci:upr

pi:lwr

pi:upr width

Listing 10.9 Predicted values and prediction intervals of Salary from Years experience and Gender for specified values of new data.

The output for the predicted values contains no values for the response variable Salary because at this time these values are unknown. The widths of the prediction intervals are

Regression II 239

around $50000 so prediction is not expected to be accurate. The fitted values, however, are perfectly consistent with the two-predictor-variable model. Based on the current distribution of Salary among men and women at the company, women are predicted to have a larger Salary than men.

The larger predicted Salaries for women indicates from another perspective the information obtained from a multiple regression model, which necessarily assesses the net effects of the predictor variables instead of the gross effects. The predicted Salary for men and then for women based on Gender alone are their respective average Salaries in this sample. The respective men’s

60, a difference of $14316.86. When the number of Years worked is controlled, however, the Gender effect is no longer significant, although women are predicted to make more than men at all levels of experience. Again, what needs to be verified is that this model remains applicable as women actually gain more experience working at the company.

and women’s average Salary is Y ¯ M = $71147 . 46 and Y ¯ F = $56830 .