Inferences Concerning the Regression Coefficients

11.5 Inferences Concerning the Regression Coefficients

Aside from merely estimating the linear relationship between x and Y for purposes of prediction, the experimenter may also be interested in drawing certain inferences about the slope and intercept. In order to allow for the testing of hypotheses and

the construction of confidence intervals on β 0 and β 1 , one must be willing to make the further assumption that each ǫ i , i = 1, 2, . . . , n, is normally distributed. This assumption implies that Y 1 ,Y 2 ,...,Y n are also normally distributed, each with

probability distribution n(y i ;β 0 +β 1 x i , σ).

From Section 11.4 we know that B 1 follows a normal distribution. It turns out that under the normality assumption, a result very much analogous to that given in Theorem 8.4 allows us to conclude that (n − 2)S 2 /σ 2 is a chi-squared variable with n − 2 degrees of freedom, independent of the random variable B 1 . Theorem

8.5 then assures us that the statistic √

(B 1 −β 1 )/(σ/ S xx )

has a t-distribution with n − 2 degrees of freedom. The statistic T can be used to construct a 100(1 − α)% confidence interval for the coefficient β 1 .

Confidence Interval

A 100(1 − α)% confidence interval for the parameter β 1 in the regression line

where t α/2 is a value of the t-distribution with n − 2 degrees of freedom.

Example 11.2: Find a 95% confidence interval for β 1 in the regression line μ Y |x =β 0 +β 1 x, based on the pollution data of Table 11.1. Solution : From the results given in Example 11.1 we find that S xx = 4152.18 and S xy = 3752.09. In addition, we find that S yy = 3713.88. Recall that b 1 = 0.903643. Hence,

Therefore, taking the square root, we obtain s = 3.2295. Using Table A.4, we find t 0.025 ≈ 2.045 for 31 degrees of freedom. Therefore, a 95% confidence interval for

4152.18 which simplifies to

404 Chapter 11 Simple Linear Regression and Correlation

Hypothesis Testing on the Slope

To test the null hypothesis H 0 that β 1 =β 10 against a suitable alternative, we again use the t-distribution with n − 2 degrees of freedom to establish a critical region and then base our decision on the value of

The method is illustrated by the following example. Example 11.3: Using the estimated value b 1 = 0.903643 of Example 11.1, test the hypothesis that

β 1 = 1.0 against the alternative that β 1 < 1.0. Solution : The hypotheses are H 0 :β 1 = 1.0 and H 1 :β 1 < 1.0. So

with n − 2 = 31 degrees of freedom (P ≈ 0.03). Decision: The t-value is significant at the 0.03 level, suggesting strong evidence

that β 1 < 1.0. One important t-test on the slope is the test of the hypothesis

H 0 :β 1 = 0 versus H 1 :β 1

When the null hypothesis is not rejected, the conclusion is that there is no signifi- cant linear relationship between E(y) and the independent variable x. The plot of the data for Example 11.1 would suggest that a linear relationship exists. However,

in some applications in which σ 2 is large and thus considerable “noise” is present in the data, a plot, while useful, may not produce clear information for the researcher. Rejection of H 0 above implies that a significant linear regression exists. Figure 11.7 displays a MINITAB printout showing the t-test for

H 0 :β 1 = 0 versus H 1 :β 1

for the data of Example 11.1. Note the regression coefficient (Coef), standard error (SE Coef), t-value (T), and P -value (P). The null hypothesis is rejected. Clearly, there is a significant linear relationship between mean chemical oxygen demand reduction and solids reduction. Note that the t-statistic is computed as

coefficient

t=

standard error

s/ S xx

The failure to reject H 0 :β 1 = 0 suggests that there is no linear relationship between Y and x. Figure 11.8 is an illustration of the implication of this result. It may mean that changing x has little impact on changes in Y , as seen in (a). However, it may also indicate that the true relationship is nonlinear, as indicated by (b).

When H 0 :β 1 = 0 is rejected, there is an implication that the linear term in x residing in the model explains a significant portion of variability in Y . The two

11.5 Inferences Concerning the Regression Coefficients 405

Regression Analysis: COD versus Per_Red The regression equation is COD = 3.83 + 0.904 Per_Red

Predictor

Coef SE Coef

Per_Red

R-Sq(adj) = 91.0%

Analysis of Variance Source

F P Regression

Residual Error 31 323.3

Total

Figure 11.7: MINITAB printout for t-test for data of Example 11.1.

(a)

(b)

Figure 11.8: The hypothesis H 0 :β 1 = 0 is not rejected.

plots in Figure 11.9 illustrate possible scenarios. As depicted in (a) of the figure, rejection of H 0 may suggest that the relationship is, indeed, linear. As indicated in (b), it may suggest that while the model does contain a linear effect, a better representation may be found by including a polynomial (perhaps quadratic) term (i.e., terms that supplement the linear term).

Statistical Inference on the Intercept

Confidence intervals and hypothesis testing on the coefficient β 0 may be established from the fact that B 0 is also normally distributed. It is not difficult to show that

B 0 −β 0 T= %

x 2 i /(nS xx )

406 Chapter 11 Simple Linear Regression and Correlation

(a)

(b)

Figure 11.9: The hypothesis H 0 :β 1 = 0 is rejected.

has a t-distribution with n − 2 degrees of freedom from which we may construct a 100(1 − α)% confidence interval for α.

Confidence Interval

A 100(1 − α)% confidence interval for the parameter β 0 in the regression line

where t α/2 is a value of the t-distribution with n − 2 degrees of freedom.

Example 11.4: Find a 95% confidence interval for β 0 in the regression line μ Y |x =β 0 +β 1 x, based on the data of Table 11.1. Solution : In Examples 11.1 and 11.2, we found that

S xx = 4152.18

and

s = 3.2295.

From Example 11.1 we had

x 2 i = 41,086

Using Table A.4, we find t 0.025 ≈ 2.045 for 31 degrees of freedom. Therefore, a

95% confidence interval for β 0 is

which simplifies to 0.2132 < β 0 < 7.4461.

11.5 Inferences Concerning the Regression Coefficients 407 To test the null hypothesis H 0 that β 0 =β 00 against a suitable alternative,

we can use the t-distribution with n − 2 degrees of freedom to establish a critical region and then base our decision on the value of

t=

x 2 i /(nS xx )

i=1

Example 11.5: Using the estimated value b 0 = 3.829633 of Example 11.1, test the hypothesis that β 0 = 0 at the 0.05 level of significance against the alternative that β 0

Solution : The hypotheses are H 0 :β 0 = 0 and H 1 :β 0 3.829633 − 0

with 31 degrees of freedom. Thus, P = P -value ≈ 0.038 and we conclude that β 0

in Figure 11.7. The SE Coef is the standard error of the estimated intercept.

A Measure of Quality of Fit: Coefficient of Determination

Note in Figure 11.7 that an item denoted by R-Sq is given with a value of 91.3%. This quantity, R 2 , is called the coefficient of determination. This quantity is

a measure of the proportion of variability explained by the fitted model. In Section 11.8, we shall introduce the notion of an analysis-of-variance approach to hypothesis testing in regression. The analysis-of-variance approach makes use

of the error sum of squares SSE = (y i − ˆy i ) 2 and the total corrected sum of

i=1

squares SST =

(y i

i ) − ¯y 2 . The latter represents the variation in the response

i=1

values that ideally would be explained by the model. The SSE value is the variation due to error, or variation unexplained. Clearly, if SSE = 0, all variation is explained. The quantity that represents variation explained is SST − SSE. The

R 2 is

Coeff. of determination: R 2 =1−

SSE

SST

Note that if the fit is perfect, all residuals are zero, and thus R 2 = 1.0. But if SSE is only slightly smaller than SST , R 2 ≈ 0. Note from the printout in Figure 11.7 that the coefficient of determination suggests that the model fit to the data explains 91.3% of the variability observed in the response, the reduction in chemical oxygen demand.

Figure 11.10 provides an illustration of a good fit (R 2 ≈ 1.0) in plot (a) and a

poor fit (R 2 ≈ 0) in plot (b).

Pitfalls in the Use of R 2

Analysts quote values of R 2 quite often, perhaps due to its simplicity. However, there are pitfalls in its interpretation. The reliability of R 2 is a function of the

408 Chapter 11 Simple Linear Regression and Correlation

(b) R 2 ≈0 Figure 11.10: Plots depicting a very good fit and a poor fit.

(a) R 2 ≈ 1.0

size of the regression data set and the type of application. Clearly, 0 ≤ R 2 ≤1 and the upper bound is achieved when the fit to the data is perfect (i.e., all of the residuals are zero). What is an acceptable value for R 2 ? This is a difficult question to answer. A chemist, charged with doing a linear calibration of a high- precision piece of equipment, certainly expects to experience a very high R 2 -value (perhaps exceeding 0.99), while a behavioral scientist, dealing in data impacted by variability in human behavior, may feel fortunate to experience an R 2 as large as 0.70. An experienced model fitter senses when a value is large enough, given the situation confronted. Clearly, some scientific phenomena lend themselves to modeling with more precision than others.

The R 2 criterion is dangerous to use for comparing competing models for the same data set. Adding additional terms to the model (e.g., an additional regressor) decreases SSE and thus increases R 2 (or at least does not decrease it). This implies that R 2 can be made artificially high by an unwise practice of overfitting (i.e., the inclusion of too many model terms). Thus, the inevitable increase in R 2 enjoyed by adding an additional term does not imply the additional term was needed. In fact, the simple model may be superior for predicting response values. The role of overfitting and its influence on prediction capability will be discussed at length in Chapter 12 as we visit the notion of models involving more than a single regressor. Suffice it to say at this point that one should not subscribe to a model

selection process that solely involves the consideration of R 2 .

Dokumen yang terkait

Optimal Retention for a Quota Share Reinsurance

0 0 7

Digital Gender Gap for Housewives Digital Gender Gap bagi Ibu Rumah Tangga

0 0 9

Challenges of Dissemination of Islam-related Information for Chinese Muslims in China Tantangan dalam Menyebarkan Informasi terkait Islam bagi Muslim China di China

0 0 13

Family is the first and main educator for all human beings Family is the school of love and trainers of management of stress, management of psycho-social-

0 0 26

THE EFFECT OF MNEMONIC TECHNIQUE ON VOCABULARY RECALL OF THE TENTH GRADE STUDENTS OF SMAN 3 PALANGKA RAYA THESIS PROPOSAL Presented to the Department of Education of the State Islamic College of Palangka Raya in Partial Fulfillment of the Requirements for

0 3 22

GRADERS OF SMAN-3 PALANGKA RAYA ACADEMIC YEAR OF 20132014 THESIS Presented to the Department of Education of the State College of Islamic Studies Palangka Raya in Partial Fulfillment of the Requirements for the Degree of Sarjana Pendidikan Islam

0 0 20

A. Research Design and Approach - The readability level of reading texts in the english textbook entitled “Bahasa Inggris SMA/MA/MAK” for grade XI semester 1 published by the Ministry of Education and Culture of Indonesia - Digital Library IAIN Palangka R

0 1 12

A. Background of Study - The quality of the english textbooks used by english teachers for the tenth grade of MAN Model Palangka Raya Based on Education National Standard Council (BSNP) - Digital Library IAIN Palangka Raya

0 0 15

1. The definition of textbook - The quality of the english textbooks used by english teachers for the tenth grade of MAN Model Palangka Raya Based on Education National Standard Council (BSNP) - Digital Library IAIN Palangka Raya

0 0 38

CHAPTER IV DISCUSSION - The quality of the english textbooks used by english teachers for the tenth grade of MAN Model Palangka Raya Based on Education National Standard Council (BSNP) - Digital Library IAIN Palangka Raya

0 0 95