Categorical or Indicator Variables

12.8 Categorical or Indicator Variables

An extremely important special-case application of multiple linear regression oc- curs when one or more of the regressor variables are categorical, indicator, or dummy variables . In a chemical process, the engineer may wish to model the process yield against regressors such as process temperature and reaction time. However, there is interest in using two diﬀerent catalysts and somehow including “the catalyst” in the model. The catalyst eﬀect cannot be measured on a contin- uum and is hence a categorical variable. An analyst may wish to model the price

of homes against regressors that include square feet of living space x 1 , the land acreage x 2 , and age of the house x 3 . These regressors are clearly continuous in

nature. However, it is clear that cost of homes may vary substantially from one area of the country to another. If data are collected on homes in the east, mid- west, south, and west, we have an indicator variable with four categories. In the chemical process example, if two catalysts are used, we have an indicator variable with two categories. In a biomedical example in which a drug is to be compared to a placebo, all subjects are evaluated on several continuous measurements such as age, blood pressure, and so on, as well as gender, which of course is categorical with two categories. So, included along with the continuous variables are two indicator variables: treatment with two categories (active drug and placebo) and gender with two categories (male and female).

Model with Categorical Variables

Let us use the chemical processing example to illustrate how indicator variables

are involved in the model. Suppose y = yield and x 1 = temperature and x 2 =

reaction time. Now let us denote the indicator variable by z. Let z = 0 for catalyst

1 and z = 1 for catalyst 2. The assignment of the (0, 1) indicator to the catalyst is arbitrary. As a result, the model becomes

y i =β 0 +β 1 x 1i +β 2 x 2i +β 3 z i + i ,

i = 1, 2, . . . , n.

Three Categories

The estimation of coeﬃcients by the method of least squares continues to apply. In the case of three levels or categories of a single indicator variable, the model will

12.8 Categorical or Indicator Variables

include two regressors, say z 1 and z 2 , where the (0, 1) assignment is as follows:

where 0 and 1 are vectors of 0’s and 1’s, respectively. In other words, if there are

categories, the model includes − 1 actual model terms.

It may be instructive to look at a graphical representation of the model with three categories. For the sake of simplicity, let us assume a single continuous variable x. As a result, the model is given by

y i =β 0 +β 1 x i +β 2 z 1i +β 3 z 2i + i .

Thus, Figure 12.2 reﬂects the nature of the model. The following are model ex- pressions for the three categories.

As a result, the model involving categorical variables essentially involves a change in the intercept as we change from one category to another. Here of course we are assuming that the coeﬃcients of continuous variables remain the same across the categories .

y Category 1

Category 2 Category 3

Figure 12.2: Case of three categories.

Example 12.9: Consider the data in Table 12.7. The response y is the amount of suspended

solids in a coal cleansing system. The variable x is the pH of the system. Three diﬀerent polymers are used in the system. Thus, “polymer” is categorical with three categories and hence produces two model terms. The model is given by

y i =β 0 +β 1 x i +β 2 z 1i +β 3 z 2i + i ,

i = 1, 2, . . . , 18.

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

Here we have

1, for polymer 1,

1, for polymer 2,

From the analysis in Figure 12.3, the following conclusions are drawn. The

coeﬃcient b 1 for pH is the estimate of the common slope that is assumed in the

regression analysis. All model terms are statistically signiﬁcant. Thus, pH and the nature of the polymer have an impact on the amount of cleansing. The signs and

magnitudes of the coeﬃcients of z 1 and z 2 indicate that polymer 1 is most eﬀective

(producing higher suspended solids) for cleansing, followed by polymer 2. Polymer

3 is least eﬀective.

Table 12.7: Data for Example 12.9

x, (pH) y, (amount of suspended solids) Polymer

Slope May Vary with Indicator Categories

In the discussion given here, we have assumed that the indicator variable model terms enter the model in an additive fashion. This suggests that the slopes, as in Figure 12.2, are constant across categories. Obviously, this is not always going to be the case. We can account for the possibility of varying slopes and indeed test for this condition of parallelism by including product or interaction terms between indicator terms and continuous variables. For example, suppose a model with one continuous regressor and an indicator variable with two levels is chosen. The model is given by

y=β 0 +β 1 x+β 2 z+β 3 xz + .

12.8 Categorical or Indicator Variables

Sum of Source DF Squares Mean Square

F Value

Corrected Total 17 85260.44444

R-Square

Coeff Var

Root MSE

y Mean

Error t Value Pr > |t|

Intercept -161.8973333 37.43315576

Figure 12.3: SAS printout for Example 12.9.

This model suggests that for category l (z = 1),

E(y) = (β 0 +β 2 ) + (β 1 +β 3 )x,

while for category 2 (z = 0),

E(y) = β 0 +β 1 x.

Thus, we allow for varying intercepts and slopes for the two categories. Figure 12.4 displays the regression lines with varying slopes for the two categories.

Category 1: slope = β 1 + β 3

Category 2: slope = β 1

Figure 12.4: Nonparallelism in categorical variables.

In this case, β 0 ,β 1 , and β 2 are positive while β 3 is negative with |β 3 |<β 1 . Ob- viously, if the interaction coeﬃcient β 3 is insigniﬁcant, we are back to the common

slope model.

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

Exercises

12.45 A study was done to assess the cost eﬀective- (c) Discuss the diﬀerence between a van and an SUV ness of driving a four-door sedan instead of a van or an

in terms of gas mileage.

SUV (sports utility vehicle). The continuous variables

are odometer reading and octane of the gasoline used. 12.46 A study was done to determine whether the

The response variable is miles per gallon. The data are gender of the credit card holder was an important fac- presented here.

tor in generating proﬁt for a certain credit card com-

MPG

Car Type

Odometer

Octane

pany. The variables considered were income, the num-

34.5 sedan

87.5 ber of family members, and the gender of the card

33.3 sedan

87.5 holder. The data are as follows:

78.0 Proﬁt

Income

Gender Members

90.0 (a) Fit a linear regression model using the variables

(a) Fit a linear regression model including two indica-

available. Based on the ﬁtted model, would the

tor variables. Use (0, 0) to denote the four-door

company prefer male or female customers?

sedan.

(b) Would you say that income was an important fac-

(b) Which type of vehicle appears to get the best gas

tor in explaining the variability in proﬁt?

mileage?

Categorical or Indicator Variables

12.8 Categorical or Indicator Variables

Parts

Dokumen yang terkait

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

The Correlation between students vocabulary master and reading comprehension

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

Antiremed Kelas 12 Matematika (4)

Transmission of Greek and Arabic Veteri

Services for adults with an autism spect

Dukungan

Links

Categorical or Indicator Variables

12.8 Categorical or Indicator Variables

Parts

Dokumen yang terkait

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

The Correlation between students vocabulary master and reading comprehension

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

Antiremed Kelas 12 Matematika (4)

Transmission of Greek and Arabic Veteri

Services for adults with an autism spect

Dokumen yang Anda mencari sudah siap untuk unduhkan