Analysis-of-Variance Approach

11.8 Analysis-of-Variance Approach

Often the problem of analyzing the quality of the estimated regression line is han- dled by an analysis-of-variance (ANOVA) approach: a procedure whereby the total variation in the dependent variable is subdivided into meaningful compo- nents that are then observed and treated in a systematic fashion. The analysis of variance, discussed in Chapter 13, is a powerful resource that is used for many applications.

Suppose that we have n experimental data points in the usual form (x i ,y i ) and that the regression line is estimated. In our estimation of σ 2 in Section 11.4, we established the identity

S yy =b 1 S xy + SSE. An alternative and perhaps more informative formulation is

We have achieved a partitioning of the total corrected sum of squares of y into two components that should reflect particular meaning to the experimenter. We shall indicate this partitioning symbolically as

SST = SSR + SSE.

11.8 Analysis-of-Variance Approach 415 The first component on the right, SSR, is called the regression sum of squares,

and it reflects the amount of variation in the y-values explained by the model, in this case the postulated straight line. The second component is the familiar error sum of squares, which reflects variation about the regression line.

Suppose that we are interested in testing the hypothesis

H 0 :β 1 = 0 versus H 1 :β 1

where the null hypothesis says essentially that the model is μ Y |x =β 0 . That is, the variation in Y results from chance or random fluctuations which are independent of the values of x. This condition is reflected in Figure 11.10(b). Under the conditions

of this null hypothesis, it can be shown that SSR/σ 2 and SSE/σ 2 are values of independent chi-squared variables with 1 and n−2 degrees of freedom, respectively,

and then by Theorem 7.12 it follows that SST /σ 2 is also a value of a chi-squared variable with n − 1 degrees of freedom. To test the hypothesis above, we compute

SSE/(n − 2)

and reject H 0 at the α-level of significance when f > f α (1, n − 2). The computations are usually summarized by means of an analysis-of-variance table, as in Table 11.2. It is customary to refer to the various sums of squares divided by their respective degrees of freedom as the mean squares.

Table 11.2: Analysis of Variance for Testing β 1 =0 Source of

Computed Variation

Sum of

Degrees of

When the null hypothesis is rejected, that is, when the computed F -statistic exceeds the critical value f α (1, n − 2), we conclude that there is a significant amount of variation in the response accounted for by the postulated model, the straight-line function. If the F -statistic is in the fail to reject region, we conclude that the data did not reflect sufficient evidence to support the model postulated.

In Section 11.5, a procedure was given whereby the statistic

is used to test the hypothesis

H 0 :β 1 =β 10 versus H 1 :β 1 10 ,

where T follows the t-distribution with n − 2 degrees of freedom. The hypothesis is rejected if |t| > t α/2 for an α-level of significance. It is interesting to note that

416 Chapter 11 Simple Linear Regression and Correlation in the special case in which we are testing

H 0 :β 1 = 0 versus H 1 :β 1

the value of our T -statistic becomes

t= √

s/ S xx

and the hypothesis under consideration is identical to that being tested in Table

11.2. Namely, the null hypothesis states that the variation in the response is due merely to chance. The analysis of variance uses the F -distribution rather than the t-distribution. For the two-sided alternative, the two approaches are identical. This we can see by writing

which is identical to the f -value used in the analysis of variance. The basic relation- ship between the t-distribution with v degrees of freedom and the F -distribution with 1 and v degrees of freedom is

t 2 = f (1, v).

Of course, the t-test allows for testing against a one-sided alternative while the

F -test is restricted to testing against a two-sided alternative.

Annotated Computer Printout for Simple Linear Regression

Consider again the chemical oxygen demand reduction data of Table 11.1. Figures

11.14 and 11.15 show more complete annotated computer printouts. Again we illustrate it with MINITAB software. The t-ratio column indicates tests for null hypotheses of zero values on the parameter. The term “Fit” denotes ˆ y-values, often called fitted values. The term “SE Fit” is used in computing confidence intervals

on mean response. The item R 2 is computed as (SSR/SST )×100 and signifies the proportion of variation in y explained by the straight-line regression. Also shown are confidence intervals on the mean response and prediction intervals on a new observation.