Chapter14 - Introduction to Multiple Regression

Statistics for Managers Using Microsoft® Excel 5th Edition

Chapter 14 Introduction to Multiple Regression

Learning Objectives In this chapter, you learn:

 How to develop a multiple regression model

 How to interpret the regression coefficients

 How to determine which independent variables to include in the regression model



How to determine which independent variables are

most important in predicting a dependent variable

 How to use categorical variables in a regression model

The Multiple Regression Model

Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (X _i ).

2i i ki k 2 1i 1 i

ε X β X β X β β Y

      

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Multiple Regression Equation

The coefficients of the multiple regression model are estimated using sample data Multiple regression equation with k independent variables:

Estimated Estimated Estimated slope coefficients (or predicted) intercept value of Y

Yˆ b b X b X b

X       i 1 1i 2 2i k ki

In this chapter we will always use Excel to obtain the regression slope coefficients and other regression summary measures.

Multiple Regression Equation

Example with Yˆ b b X b

X   

2 two independent variables

1 X

ble ria va or e f

op ² Sl ₂ le X r variab

Slope fo

X ₁

Multiple Regression Equation

2 Variable Example

A distributor of frozen dessert pies wants to

 evaluate factors thought to influence demand

Dependent variable: Pie sales (units per week)

 Independent variables: Price (in $)

 Advertising ($100’s) Data are collected for 15 weeks



Multiple Regression Equation

2 Variable Example

 Sales = b + b

1 (Price) + b

2 (Advertising)

 Sales = b +b

1 X

2 X

2 Where X

1 = Price

2 = Advertising ^{Week Pie Sales} ^Price ^($) ^Advertising ^($100s) ^{1 350} ^5.50 ^3.3 ^{2 460} ^7.50 ^3.3 ^{3 350} ^8.00 ^3.0 ^{4 430} ^8.00 ^4.5 ^{5 350} ^6.80 ^3.0 ^{6 380} ^7.50 ^4.0 ^{7 430} ^4.50 ^3.0 ^{8 470} ^6.40 ^3.7 ^{9 450} ^7.00 ^3.5 ^{10 490} ^5.00 ^4.0 ^{11 340} ^7.20 ^3.5 ^{12 300} ^7.90 ^3.2 ^{13 440} ^5.90 ^4.0 _{14 450} _5.00 _3.5 _{15 300} _7.00 _2.7 Multiple regression equation: Multiple Regression Equation _{Multiple R 0.72213} _{Regression Statistics}

2 Variable Example, Excel

_{Standard Error 47.46341} _{Adjusted R Square 0.44172} R Square 0.52148 Sales 306.526 - 24.975(X ) 74.131(X ) _Observations ₁₅   ₁ ₂ _Residual _Regression ANOVA _{df SS MS F Significance F} _{12 27033.306 2252.776} _{2 29460.027 14730.013 6.53861 0.01201}

Total _{Coefficients Standard Error t Stat P-value Lower 95% Upper 95%} ^{14 56493.333} _{Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888} _{Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392} Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404

Multiple Regression Equation

2 Variable Example

) 74.131(X ) 24.975(X - 306.526 Sales

1   b ₁ = -24.975: sales will

decrease, on average, by 24.975 pies per week for each $1 increase in selling price, net of the effects of changes due to advertising

b ₂ = 74.131: sales will

increase, on average, by 74.131 pies per week for each $100 increase in advertising, net of the effects of changes due to price

where Sales is in number of pies per week Price is in $ Advertising is in $100’s.

Multiple Regression Equation

2 Variable Example

Predict sales for a week in which the selling price is $5.50 and advertising is $350:

Predicted sales is 428.62 pies

428.62 (3.5) 74.131 (5.50) 24.975 - 306.526

) 74.131(X ) 24.975(X - 306.526 Sales ₂ ₁ 

   

Note that Advertising is in $100’s, so $350 means that

X ₂ = 3.5 Coefficient of Multiple Determination

 Reports the proportion of total variation in Y explained by all X variables taken together squares of sum total squares of sum regression

SST SSR

2   r

Coefficient of _{Regression Statistics} Multiple Determination (Excel) _{Multiple R 0.72213} ₂ SSR 29460.0 r .52148 _{R Square 0.52148}    _{Adjusted R Square 0.44172} SST 56493.3 _{Standard Error 47.46341} 52.1% of the variation in pie sales is _Observations ₁₅ explained by the variation in price _ANOVA _{df SS MS F Significance F} and advertising _Total _Residual Regression _{14 56493.333} _{12 27033.306 2252.776} ^{2 29460.027 14730.013 6.53861 0.01201}

_{Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392}

_{Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404}

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

2 Adjusted r

never decreases when a new X variable is added

 to the model



This can be a disadvantage when comparing

models What is the net effect of adding a new variable?

 We lose a degree of freedom when a new X

 variable is added

 Did the new X variable add enough independent power to offset the loss of one degree of freedom?

2 Adjusted r

Shows the proportion of variation in Y explained by all X

 variables adjusted for the number of X variables used n

1 2  2  

  r 1 ( 1 r )

   Y . 12 .. k     n k

1     

 (where n = sample size, k = number of independent variables)

 variables ² Smaller than r

Penalizes excessive use of unimportant independent

 

Useful in comparing models

2 Adjusted r

Regression Statistics _{Multiple R 0.72213 r .44172}

2  _{R Square 0.52148} adj _{Adjusted R Square 0.44172} 44.2% of the variation in pie sales is explained _{Standard Error 47.46341} by the variation in price and advertising, taking _Observations ₁₅ into account the sample size and number of _ANOVA _{df SS MS F Significance F}

independent variables

_Total _Residual Regression _{14 56493.333} _{12 27033.306 2252.776} ^{2 29460.027 14730.013 6.53861 0.01201}

_{Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392}

_{Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404}

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

F-Test for Overall Significance

F-Test for Overall Significance of the Model

 Shows if there is a linear relationship between all of

 the X variables considered together and Y Use F test statistic

 Hypotheses:

 H : β = β = … = β = 0 (no linear relationship) ₁ _{2 k} H : at least one β ≠ 0 (at least one independent variable _{1 i} affects Y)

F-Test for Overall Significance Test statistic:

 SSR MSR k F  

SSE MSE n k

1   where F has (numerator) = k and

 (denominator) = (n - k - 1) degrees of freedom

_{Regression Statistics} F-Test for Overall Significance _{R Square 0.52148} Multiple R 0.72213

MSR 14730.0 F 6.5386 _{Adjusted R Square 0.44172}    _Observations _{Standard Error 47.46341} ₁₅ MSE 2252.8

P-value for _ANOVA _{df SS MS F Significance F} the F-Test _Total _Residual Regression _{14 56493.333} _{12 27033.306 2252.776} ^{2 29460.027 14730.013 6.53861 0.01201}

_{Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392}

_{Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404}

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

F-Test for Overall Significance

 H : β ₁ = β ₂ = 0

 H ₁ : β ₁ and β ₂ not both zero

  = .05

 df ₁ = 2 df ₂ = 12 Test Statistic: Decision: Conclusion:

Since F test statistic is in the rejection region (p-value < .05), reject H

There is evidence that at least one independent variable affects Y  = .05

F _.05 = 3.89 ^{Reject H Do not} ^{reject H} 6.5386 MSE

MSR F  

Critical Value: F _ = 3.885

F Residuals in Multiple Regression Two variable model

Sample

observation

Y _i

Yˆ b b X b

X   

2 Residual = < <

e = (Y – Y ) _{i i i} Y _i x _2i

X ₂ The best fitting linear regression

x _1i equation, Y, is found by minimizing the sum of squared ₂ errors, e

X ₁

Multiple Regression Assumptions Errors ( residuals ) from the regression model:

< e = (Y – Y ) i i i

Assumptions:

 The errors are independent The errors are normally distributed

 Errors have an equal variance 

Multiple Regression Assumptions

 These residual plots are used in multiple regression:

 Residuals vs. Y i

 Residuals vs. X 1i

 Residuals vs. X 2i

 Residuals vs. time (if time series data)

Use the residual plots to check for violations of regression assumptions

< Individual Variables Tests of Hypothesis Use t-tests of individual variable slopes

 Shows if there is a linear relationship

 between the variable X i and Y Hypotheses:

 H : β = 0 (no linear relationship) _i H : β ≠ 0 (linear relationship does exist _{1 i} between X and Y) _i

Individual Variables Tests of Hypothesis H : β = 0 (no linear relationship) _j H : β ≠ 0 (linear relationship does exist _{1 j} between X and Y) _i Test Statistic:

 b  j

(df = n – k – 1) t

 S b j Individual Variables _{Regression Statistics} Tests of Hypothesis _{Multiple R 0.72213} t-value for Price is t = -2.306, with p- _{Adjusted R Square 0.44172} _{R Square 0.52148} value .0398 _{Standard Error 47.46341} t-value for Advertising is t = 2.855, _Observations ₁₅ with p-value .0145 _Residual _Regression ANOVA _{df SS MS F Significance F} _{12 27033.306 2252.776} _{2 29460.027 14730.013 6.53861 0.01201}

Individual Variables Tests of Hypothesis Coefficients Standard Error t Stat P-value

H : β = 0 _j Price -24.97509 10.83213 -2.30565 0.03979

H : β ≠ 0 _{1 j} Advertising 74.13096 25.96732 2.85478 0.01449 d.f. = 15 - 2 - 1 = 12 _

The test statistic for each variable falls in

= .05

the rejection region (p-values < .05)

t = 2.1788 _/2 Decision: _{/2=.025 /2=.025} Reject H for each variable

Conclusion: _{Reject H Do not reject H Reject H} There is evidence that both Price and

Advertising affect pie sales at  = .05

t t _{α/2 α/2}

Confidence Interval Estimate for the Slope

Confidence interval for the population slope β _i

Example: Form a 95% confidence interval for the effect of changes in

price (X ₁ ) on pie sales, holding constant the effects of advertising:

1 k n i

S t b  

 _{Coefficients Standard Error}

Intercept 306.52619 114.25389 _{Price -24.97509 10.83213} _{Advertising 74.13096 25.96732} where t has (n – k – 1) d.f.

Here, t has (15 – 2 – 1) = 12 d.f.

24.975 ± (2.1788)(10.832): So the interval is (-48.576, -1.374)

Confidence Interval Estimate for the Slope

Confidence interval for the population slope β _i

Example: Excel output also reports these interval endpoints:

Weekly sales are estimated to be reduced by between 1.37 to 48.58 pies for each increase of $1 in the selling price, holding constant the effects of advertising. ^{Coefficients Standard Error … Lower 95% Upper 95%} ^{Intercept 306.52619 114.25389 … 57.58835 555.46404} ^{Price -24.97509 10.83213 … -48.57626 -1.37392} ^{Advertising 74.13096 25.96732 … 17.55303 130.70888}

Testing Portions of the Multiple Regression Model Contribution of a Single Independent Variable X

 j SSR(X | all variables except X ) j j

= SSR (all variables) – SSR(all variables except X )

 j variation in Y (SST)

Measures the contribution of X in explaining the total

Testing Portions of the Multiple Regression Model

Contribution of a Single Independent Variable X , assuming _j all other variables are already included (consider here a 3-variable model): SSR(X | X and X ) ₁ ₂ ₃

= SSR (all variables) – SSR(X and X ) ₂ ₃ From ANOVA section of From ANOVA section of

regression for regression for

Yˆ b b X b

X Yˆ b b X b X b

X   

   

3 Measures the contribution of X in explaining SST ₁ The Partial F-Test Statistic Consider the hypothesis test: H : variable X j does not significantly improve the model after all other variables are included H ^{1 : variable X j significantly improves the model after} all other variables are included

Test using the F-test statistic: (with 1 and n-k-1 d.f.)

SSR (X | all variables except j) j

F  MSE Testing Portions of Model: Example Test at the α = .05 level to determine whether the price variable significantly improves the model given that advertising is included Example: Frozen dessert pies

Testing Portions of Model: Example

H : X ₁ (price) does not improve the model with X ₂ (advertising) included H ₁ : X ₁ does improve model

α = .05, df = 1 and 12 F critical Value = 4.75

(For X ₁ and X ₂ ) (For X _ANOVA ₂ only) _{df SS MS}

Regression ^{2 29460.02687 14730.01343} _Residual _{12 27033.30647 2252.775539} _Total _{14 56493.33333} ^ANOVA ^{df SS} ^Regression ^{1 17484.22249} _Residual _{13 39009.11085} _Total _{14 56493.33333}

Testing Portions of Model: Example

Conclusion: Reject H ; adding X ₁ does improve model 316 .

5 2252 78 . , 484 22 .

17 , 460 03 .

) 29 MSE(all) X | (X SSR F ² ¹

   

(For X ₁ and X ₂ ) (For X _ANOVA ₂ only) _{df SS MS}

Relationship Between Test Statistics

The partial F test statistic developed in this section



and the t test statistic are both used to determine the

contribution of an independent variable to a multiple

regression model. The hypothesis tests associated with these two



statistics always result in the same decision (that is,

the p-values are identical).

t F  a

1 , a Where a = degrees of freedom Coefficient of Partial Determination for k Variable Model

2 r _{Yj.(all variables except j)}

SSR (X | all variables except j)

_j 

SST SSR(all variables ) SSR(X | all variables except j)

  _j Measures the proportion of variation in the dependent variable that is explained by X j while controlling for (holding constant) the other independent variables

Using Dummy Variables A dummy variable is a categorical

 independent variable with two levels: yes or no, on or off, male or female

 coded as 0 or 1

 Assumes equal slopes for other variables

 If more than two levels, the number of

 dummy variables needed is (number of levels

Dummy Variable Example Yˆ b b X b

X    ₁ ₂

2 Let: Y = pie sales X = price

1 X = holiday (X = 1 if a holiday occurred during the week)

₂

(X = 0 if there was no holiday that week) ₂

Dummy Variable Example

Holiday

Yˆ b b X b (1) (b b ) b

X       ₁ ₂ ₁

1 No Holiday Yˆ b b X b (0) b b

X      ₁ ₂ ₁

1 Different Same

Y (sales)

intercept slope b + b ₂ If H : β = 0 is ₂ Holiday (

rejected, then

b ² = 1)

“Holiday” has a

No Holid ay (X ₂ significant effect = 0)

on pie sales X (Price) ₁

Dummy Variable Example

Sales: number of pies sold per week Price: pie price in $ Holiday:

0 If no holiday occurred b

= 15: on average, sales were 15 pies greater in weeks

with a holiday than in weeks without a holiday, given the same price

) 15(Holiday 30(Price) - 300 Sales  

1 If a holiday occurred during the week

Dummy Variable Model More Than Two Levels

The number of dummy variables is one less than

 the number of levels Example:

 Y = house price ; X = square feet

 If style of the house is also thought to matter Style = ranch, split level, colonial

Three levels, so two dummy variables are needed



Dummy Variable Model More Than Two Levels Example: Let “colonial” be the default category, and let X and ₂ X be used for the other two categories: ₃ Y = house price X = square feet ₁ X = 1 if ranch, 0 otherwise ₂ X = 1 if split level, 0 otherwise ₃ The multiple regression equation is:

Yˆ b b X b X b

X    

3 Dummy Variable Model More Than Two Levels

Consider the regression equation:

Yˆ 20.43 0.045X

23.53X

18.84X    

3 With the same square feet, a

For a colonial: X = X = 0 ₂ ₃ ranch will have an estimated

Yˆ 20.43 0.045X   1 average price of 23.53

thousand dollars more than a For a ranch: X = 1; X = 0 ₂ ₃ colonial.

Yˆ 20.43 0.045X

23.53   

1 With the same square feet, a

split-level will have an For a split level: X = 0; X = 1 ₂ ₃ estimated average price of

18.84 thousand dollars more

Yˆ 20.43 0.045X

18.84    1 than a colonial. Interaction Between Independent Variables Hypothesizes interaction between pairs of X

 variables Response to one X variable may vary at different levels

 of another X variable

 Contains a two-way cross product term Yˆ b b X b X b

X    

3 b b X b X b (X X )    

 1

2 Effect of Interaction

Y β β X β X β

X X ε       ₁ ₁ ₂ ₂ ₃ ₁ ₂ Given:

 1 measured by β

Without interaction term, effect of X on Y is

1 With interaction term, effect of X on Y is measured

 1 by β + β

2 Effect changes as X changes

 2

Interaction Example

X ₂ = 1: Y = 1 + 2X ₁ + 3(1) + 4X ₁ (1) = 4 + 6X ₁ X ₂ = 0: Y = 1 + 2X ₁ + 3(0) + 4X ₁ (0) = 1 + 2X ₁ Slopes are different if the effect of X ₁ on Y depends on X ₂ value

X ₁

0.5

1.5

1.5 Y = 1 + 2X ₁ + 3X ₂ + 4X ₁ X ₂ Suppose X ₂ is a dummy variable and the estimated regression equation is

Yˆ Significance of Interaction Term Can perform a partial F-test for the

 contribution of a variable to see if the

addition of an interaction term improves the

model Multiple interaction terms can be included

 Use a partial F-test for the simultaneous

 contribution of multiple variables to the model Simultaneous Contribution of Independent Variables Use partial F-test for the simultaneous contribution

 of multiple variables to the model Let m variables be an additional set of variables added

 simultaneously To test the hypothesis that the set of m variables

 improves the model:

[SSR(all) SSR (all except new set of m variables )] / m



F 

MSE(all)

(where F has m and n-k-1 d.f.)

Chapter Summary

 Developed the multiple regression model

 Tested the significance of the multiple regression model

 Discussed adjusted r

 Discussed using residual plots to check model assumptions In this chapter, we have

Chapter Summary In this chapter, we have

 Tested portions of the regression model

Tested individual regression coefficients

 Used dummy variables

 Evaluated interaction effects 

Chapter14 - Introduction to Multiple Regression