Inferences Concerning the Regression Coefficients

11.5 Inferences Concerning the Regression Coefficients

  Aside from merely estimating the linear relationship between x and Y for purposes of prediction, the experimenter may also be interested in drawing certain inferences about the slope and intercept. In order to allow for the testing of hypotheses and

  the construction of confidence intervals on β 0 and β 1 , one must be willing to make

  the further assumption that each i , i = 1, 2, . . . , n, is normally distributed. This

  assumption implies that Y 1 ,Y 2 ,...,Y n are also normally distributed, each with

  probability distribution n(y i ;β 0 +β 1 x i , σ).

  From Section 11.4 we know that B 1 follows a normal distribution. It turns out

  that under the normality assumption, a result very much analogous to that given in Theorem 8.4 allows us to conclude that (n

  − 2)S 2 σ 2 is a chi-squared variable with n − 2 degrees of freedom, independent of the random variable B 1 . Theorem

  8.5 then assures us that the statistic √

  has a t-distribution with n − 2 degrees of freedom. The statistic T can be used to

  construct a 100(1 − α) confidence interval for the coefficient β 1 .

  Confidence Interval

  A 100(1 − α) confidence interval for the parameter β 1 in the regression line

  where t α2 is a value of the t-distribution with n − 2 degrees of freedom.

  Example 11.2: Find a 95 confidence interval for β 1 in the regression line μ Y |x =β 0 +β 1 x, based

  on the pollution data of Table 11.1. Solution : From the results given in Example 11.1 we find that S xx = 4152.18 and S xy =

  3752.09. In addition, we find that S yy = 3713.88. Recall that b 1 = 0.903643.

  Therefore, taking the square root, we obtain s = 3.2295. Using Table A.4, we find t 0.025 ≈ 2.045 for 31 degrees of freedom. Therefore, a 95 confidence interval for

  which simplifies to

  0.8012 < β 1 < 1.0061.

  Chapter 11 Simple Linear Regression and Correlation

  Hypothesis Testing on the Slope

  To test the null hypothesis H 0 that β 1 =β 10 against a suitable alternative, we

  again use the t-distribution with n − 2 degrees of freedom to establish a critical region and then base our decision on the value of

  The method is illustrated by the following example.

  Example 11.3: Using the estimated value b 1 = 0.903643 of Example 11.1, test the hypothesis that

  β 1 = 1.0 against the alternative that β 1 < 1.0. Solution : The hypotheses are H 0 :β 1 = 1.0 and H 1 :β 1 < 1.0. So

  with n − 2 = 31 degrees of freedom (P ≈ 0.03).

  Decision: The t-value is significant at the 0.03 level, suggesting strong evidence

  that β 1 < 1.0. One important t-test on the slope is the test of the hypothesis

  H 0 :β 1 = 0 versus H 1 :β 1 = 0.

  When the null hypothesis is not rejected, the conclusion is that there is no signifi- cant linear relationship between E(y) and the independent variable x. The plot of the data for Example 11.1 would suggest that a linear relationship exists. However,

  in some applications in which σ 2 is large and thus considerable “noise” is present in

  the data, a plot, while useful, may not produce clear information for the researcher.

  Rejection of H 0 above implies that a significant linear regression exists.

  Figure 11.7 displays a MINITAB printout showing the t-test for

  H 0 :β 1 = 0 versus H 1 :β 1 = 0,

  for the data of Example 11.1. Note the regression coefficient (Coef), standard error (SE Coef), t-value (T), and P -value (P). The null hypothesis is rejected. Clearly, there is a significant linear relationship between mean chemical oxygen demand reduction and solids reduction. Note that the t-statistic is computed as

  coefficient

  standard error

  s S xx

  The failure to reject H 0 :β 1 = 0 suggests that there is no linear relationship

  between Y and x. Figure 11.8 is an illustration of the implication of this result. It may mean that changing x has little impact on changes in Y , as seen in (a). However, it may also indicate that the true relationship is nonlinear, as indicated by (b).

  When H 0 :β 1 = 0 is rejected, there is an implication that the linear term in x

  residing in the model explains a significant portion of variability in Y . The two

  11.5 Inferences Concerning the Regression Coefficients

  Regression Analysis: COD versus Per_Red The regression equation is COD = 3.83 + 0.904 Per_Red

  Predictor

  Coef SE Coef

  Per_Red

  R-Sq(adj) = 91.0

  Analysis of Variance Source

  Residual Error 31 323.3

  Total

  Figure 11.7: MINITAB printout for t-test for data of Example 11.1.

  Figure 11.8: The hypothesis H 0 :β 1 = 0 is not rejected.

  plots in Figure 11.9 illustrate possible scenarios. As depicted in (a) of the figure,

  rejection of H 0 may suggest that the relationship is, indeed, linear. As indicated

  in (b), it may suggest that while the model does contain a linear effect, a better representation may be found by including a polynomial (perhaps quadratic) term (i.e., terms that supplement the linear term).

  Statistical Inference on the Intercept

  Confidence intervals and hypothesis testing on the coefficient β 0 may be established

  from the fact that B 0 is also normally distributed. It is not difficult to show that

  B 0

  −β 0

  T=

  n

  S

  x 2 i (nS xx )

  Chapter 11 Simple Linear Regression and Correlation

  Figure 11.9: The hypothesis H 0 :β 1 = 0 is rejected.

  has a t-distribution with n − 2 degrees of freedom from which we may construct a 100(1 − α) confidence interval for α.

  Confidence Interval

  A 100(1 − α) confidence interval for the parameter β 0 in the regression line

  where t α2 is a value of the t-distribution with n − 2 degrees of freedom.

  Example 11.4: Find a 95 confidence interval for β 0 in the regression line μ Y |x =β 0 +β 1 x, based

  on the data of Table 11.1. Solution : In Examples 11.1 and 11.2, we found that

  S xx = 4152.18

  and

  s = 3.2295.

  From Example 11.1 we had

  n

  x 2 i = 41,086

  Using Table A.4, we find t 0.025 ≈ 2.045 for 31 degrees of freedom. Therefore, a

  95 confidence interval for β 0 is

  which simplifies to 0.2132 < β 0 < 7.4461.

  11.5 Inferences Concerning the Regression Coefficients

  To test the null hypothesis H 0 that β 0 =β 00 against a suitable alternative,

  we can use the t-distribution with n − 2 degrees of freedom to establish a critical region and then base our decision on the value of

  Example 11.5: Using the estimated value b 0 = 3.829633 of Example 11.1, test the hypothesis that β 0 = 0 at the 0.05 level of significance against the alternative that β 0 = 0.

  Solution : The hypotheses are H 0 :β 0 = 0 and H 1 :β 0 = 0. So

  with 31 degrees of freedom. Thus, P = P -value ≈ 0.038 and we conclude that

  β 0 = 0. Note that this is merely CoefStDev, as we see in the MINITAB printout

  in Figure 11.7. The SE Coef is the standard error of the estimated intercept.

  A Measure of Quality of Fit: Coefficient of Determination

  Note in Figure 11.7 that an item denoted by R-Sq is given with a value of 91.3.

  This quantity, R 2 , is called the coefficient of determination. This quantity is

  a measure of the proportion of variability explained by the fitted model. In Section 11.8, we shall introduce the notion of an analysis-of-variance approach to hypothesis testing in regression. The analysis-of-variance approach makes use

  n

  of the error sum of squares SSE =

  (y i − ˆy i ) 2 and the total corrected sum of

  n

  i=1

  squares SST =

  (y i − ¯y i ) 2 . The latter represents the variation in the response

  i=1

  values that ideally would be explained by the model. The SSE value is the variation due to error, or variation unexplained. Clearly, if SSE = 0, all variation is explained. The quantity that represents variation explained is SST − SSE. The

  R 2 is

  Coeff. of determination: R 2 =1 SSE − .

  SST

  Note that if the fit is perfect, all residuals are zero, and thus R 2 = 1.0. But if SSE is only slightly smaller than SST , R 2 ≈ 0. Note from the printout in Figure 11.7

  that the coefficient of determination suggests that the model fit to the data explains

  91.3 of the variability observed in the response, the reduction in chemical oxygen demand.

  Figure 11.10 provides an illustration of a good fit (R 2 ≈ 1.0) in plot (a) and a

  poor fit (R 2 ≈ 0) in plot (b).

  Pitfalls in the Use of R 2

  Analysts quote values of R 2 quite often, perhaps due to its simplicity. However, there are pitfalls in its interpretation. The reliability of R 2 is a function of the

  Chapter 11 Simple Linear Regression and Correlation

  Figure 11.10: Plots depicting a very good fit and a poor fit.

  size of the regression data set and the type of application. Clearly, 0

  ≤R 2 ≤1

  and the upper bound is achieved when the fit to the data is perfect (i.e., all of

  the residuals are zero). What is an acceptable value for R 2 ? This is a difficult

  question to answer. A chemist, charged with doing a linear calibration of a high-

  precision piece of equipment, certainly expects to experience a very high R 2 -value

  (perhaps exceeding 0.99), while a behavioral scientist, dealing in data impacted

  by variability in human behavior, may feel fortunate to experience an R 2 as large

  as 0.70. An experienced model fitter senses when a value is large enough, given the situation confronted. Clearly, some scientific phenomena lend themselves to modeling with more precision than others.

  The R 2 criterion is dangerous to use for comparing competing models for the

  same data set. Adding additional terms to the model (e.g., an additional regressor)

  decreases SSE and thus increases R 2 (or at least does not decrease it). This implies

  that R 2 can be made artificially high by an unwise practice of overfitting (i.e., the

  inclusion of too many model terms). Thus, the inevitable increase in R 2 enjoyed

  by adding an additional term does not imply the additional term was needed. In fact, the simple model may be superior for predicting response values. The role of overfitting and its influence on prediction capability will be discussed at length in Chapter 12 as we visit the notion of models involving more than a single regressor . Suffice it to say at this point that one should not subscribe to a model

  selection process that solely involves the consideration of R 2 .