Residuals and Model Fit

9.3 Residuals and Model Fit

The points in the scatter plot, which represent the paired data values, tend not to lie on the regression line that summarizes the relationship between the specified variables. The vertical distance that separates each plotted point from the regression line is a key attribute of the analysis. This distance from the i th point on the regression line to the corresponding value of the response variable is the i th residual, denoted e i .

residual: Given the value X i , the difference between

Residual of i th value of X, X i : e i = Y i −ˆ Y i

the data value Y i and the fitted value, ˆY i .

For example, consider the data in the Employee data set for the 16th row of data, the data for Laura Kralik, who has worked for the company for 10 years and earns Y 16 = $82 , 681 .

19. As fitted value for previously computed, the fitted value of Salary for 10 years of employment is X=10, Figure 9.2.2, Y ˆ

16 = $65206 . 42. p. 206 From these values the 16th residual can be calculated, that is, for i = 16.

Residual Ex: e 16 = Y 16 −ˆ Y 16 = 82681 . 19 − 65206 . 42 = 17474 . 77

In this example her actual Salary is $17,474.77 larger than the fitted value as illustrated in Figure 9.3 .

data value

fitted value for Y

5 10 15 20 Years

Figure 9.3 Example of a data value, fitted value, and associated residual.

least-squares

estimation: Regression

The residual is the basis for how the model is estimated, that is, how the regression coefficients coefficients are computed. According to the widely used least-squares estimation procedure, the minimize the sum of squared estimated coefficients b 0 = $32 , 710 . 90 and b 1 = $3249 . 55 are the two numbers from all possible residuals.

210 Regression I

pairs of numbers that result in the minimization of the squared residuals for that specific data set entered into the regression analysis.

least-squares estimation: Choose b 0 and b 1 to minimize

Calculate the residual for each pair of data values in the data set analyzed by the regression program. Then square each residual and sum the squared values. The result is the smallest possible sum. In the Regression output this value appears in the Analysis of Variance section of the output, in Figure 9.4 .

Analysis of Variance df Sum Sq Mean Sq F-value p-value

Years 1 12107157290.292 12107157290.292 90.265 0.0000 Residuals 34 4560399502.217 134129397.124

Figure 9.4 Annotated analysis of variance output for the regression model, with the sum of the squared residuals.

The sum of the squared residuals can be read directly from the output.

Sum of squared residuals:

e 2 i = 4560399502 . 217

The resulting value is large, but the scale of the coefficient is due to the unit of analysis of only $1 with the salaries in the tens of thousands dollars. The hypothesis test in Figure 9.4 is redundant with the hypothesis test already discussed regarding the slope coefficient of Years.

Least-squares estimation ensures that the sum of squared residuals has been minimized, but a related issue is how good is this minimization. Is there much or little scatter about the regression line? Too much scatter and the line poorly summarizes the relationship between X and Y, even if it is the best line for the given data.

9.3.1 Standard Deviation of the Residuals

One method of assessing fit is to calculate the standard deviation of the residuals. A small standard deviation is an indicator of good fit, and a large value not so good. This standard deviation is $11,580 as reported in the next section of the Regression function, called Model Fit , shown in Listing 9.1 . Assuming the residuals are normally distributed, then a range of two standard deviations on either side of their mean, which is zero, contains about 95% of the values of the distribution. The Regression function also reports the size of that range. Most of the values about the regression line vary across the considerable span of over $46,000.

9.3.2 R 2 Fit Statistic

The other primary fit statistic for a regression analysis is R 2 , here reported as 0.726 from Listing 9.1 . This statistic compares the scatter about the regression for two different models,

null model:

the current model compared to the null model. The null model is the model without X or any

Regression model with no predictor

other variable as a predictor variable. Prediction is still possible even without a predictor variable.

variables.

The fitted value of Y for each value of X without X in the model is just the mean of all the

Regression I 211

Standard deviation of residuals: 11581.42 for 34 degrees of freedom If normal, the approximate 95% range of residuals about each fitted

value is 2*t-cutoff*11581.42, with a 95% interval t-cutoff of 2.032 95% range of variation: 47072.57

R-squared: 0.726

Adjusted R-squared: 0.718

F-statistic for null hypothesis that population R-squared=0: 90.2648 Degrees of freedom:

1 and 34

p-value: 0.0000

Listing 9.1 Fit indices.

values of Y, Y. The null regression line and two of the corresponding residuals are illustrated in ¯ Figure 9.5 . As can be seen by comparing Figure 9.3 with Figure 9.5 , the residuals are considerably reduced from the actual model compared to the null model.

Salar 8 residual

Figure 9.5 Scatter plot with the null regression line and two illustrated residuals from that line.

The R 2 statistic assesses how much using X in the model reduces the amount of scatter about a Y generated for each value of X, the extent of the residuals, compared to the amount ˆ of scatter that results without X, the scatter about Y . An additional interpretation follows from ¯ the meaning of the residuals of the data values of Y from the null model. This sum of squared residuals from Y is the basis for the variance and its square root, the standard deviation, of the ¯

response variable Y, which describes the total variability of Y. So R 2 is also referred to as the percent of variance accounted for in the response variable by the predictor variable. Values of R 2 are generally considered rather high if above 0.5 or 0.6. Many published analyses have an R 2 in the 0.3 range or below. The corresponding hypothesis test is of the null hypothesis,

H 0 , that the population R 2 is zero. This value is usually significant, as is the case in this example. Test of population R 2 = 0 : p -value < α = 0 . 05 , so reject H 0

212 Regression I

The sample R 2 = 0 . 726 is large. Further, the null hypothesis of a zero value is rejected. So conclude that the size and extent of the residuals for this sample is reduced by adding the predictor variable to the model.

Unfortunately, the size of R 2 does not straightforwardly reflect the improvement in predictive accuracy by including the predictor variable(s) in the model. R 2 , and the standard deviation of residuals as well, are descriptive statistics. They describe properties of the sample but do not indicate performance of the model in new samples. That is, these fit statistics do not account

for sampling error. A high R 2 is a goal, but is not sufficient to indicate a useful model for prediction. The issue of sampling error is most salient for small samples. With R 2 there is the additional consideration that adding predictor variables to a model necessarily increases R 2 relative to the sample size. This bias increases to the extent that if the number of predictors equals the sample size then R 2 = 1 . 0. Particularly for a small sample size with a relatively large number of predictor variables, the estimation procedure minimizes the sum of squared residuals by taking advantage of random variation that will not replicate in another sample. To account for this upward bias,

an adjustment is needed that explicitly accounts for the increase of R 2 as the number of predictor variables increases. This companion statistic is the adjusted R 2 , or R 2 adj , an improvement on the original R 2 , and should always be reported in conjunction with R 2 . The distinction is that R 2 adj adjusts for the size of the sample compared to the number of predictor variables. The adjustment is based on dividing each of the two sums of squares in the definition of the statistic by their corresponding

degrees of freedom. The result is that R 2 adj provides a downward adjustment and more realistic assessment of the comparison of the proposed model to the null model. In very small samples the value of R 2 may be considerably reduced from the value of R 2 . In larger samples R adj 2 adj will still be smaller, but usually not much smaller.