Prediction Intervals

9.4 Prediction Intervals

One of the two primary purposes of regression analysis is to enter a value of the predictor variable X into the estimated model to calculate the prediction or forecast, Y, of the value of ˆ

fitted value

the response variable Y. Up until this section the calculated value of Y is called a fitted value ˆ

calculation, Section 9.2.2 ,

because it is calculated from the same data from which the model is estimated, the training

p. 206

sample. There is no prediction here in the calculation of Y because the value of the response ˆ

training sample:

variable Y is already known for training data. There is nothing to predict. Instead use the neutral

Sample of data from which the

term “fitted” value in this context in place of “predicted” value.

regression model is

Accomplish true prediction by entering data values for the predictor variables from a new

estimated.

sample of data, the test sample or validation sample, into the model estimated from the training

test data: Sample

sample. By coincidence some of these new values of the predictor variable may duplicate values

of data from which the regression

from the original data, but in general they will not. At the time the forecast is made the

model generates

researcher knows the value of the predictor variable, but not the corresponding value of the

predictions.

response variable. The true value will not be known until some later time, when the accuracy of the prediction can be assessed.

modeling error:

As is true of any statistical result, such as a predicted value, the presence of error confounds

The residual, the difference of the

the result. Unfortunately, two forms of error underlie a prediction. First, as discussed in the

actual value and fitted value from

previous section, is the residual from the model, the distance in the training sample of the

the original data.

fitted value of the response variable Y from its actual value. These residuals, the plotted data

Regression I 213

values scattered about the regression line, indicate a lack of fit of the model. The model does not account for all of the variation in the response variable, so another name for the residuals from the training sample is modeling errors.

The unfortunate reality of any statistical estimation process, such as for the estimation of b 0 and b 1 for the regression line, also necessarily involves sampling error. For each new sample of paired data values for the response and predictor variables, Y and X, a different set of estimates for the regression coefficients b 0 and b 1 would be obtained. The regression line randomly fluctuates from sample to sample, and then so does the point on the line for any single value of the predictor variable X.

Prediction necessarily involves new data, which means a new sample, the test sample. A problem here is that the regression model applied to the test sample to obtain the prediction was estimated from the original data, the training sample. So the regression model was optimized by choosing the regression coefficients that resulted in the smallest possible sum of the squared residuals, but only for the training sample, not the test sample.

The consequence is that the encountered level of prediction error for a true forecast of an unknown value of Y cumulates both modeling error and sampling error. The prominence of prediction error: modeling error is summarized by the indicators of fit already discussed, the standard deviation Difference of the residuals and R 2

between a

, preferably in its adjusted form. The extent of prediction error is larger prediction on new than indicated by these fit indices because it will consist of the influences of both modeling error data and the actual value later and sampling error. The smaller the sample, the more pronounced the effect of the sampling obtained. error on the size of the prediction interval.

The concept of prediction error is made practical by providing a 95% interval, the prediction 95% prediction

interval, for each predicted value. There is a 95% confidence that the actual value of Y later interval: Range of values that with obtained will be contained within this prediction interval. The size of these intervals is not the 95% confidence same for all values of the predictor variable X. Instead, the closer the value of the predictor contains the predicted value. variable is to its mean, the smaller the interval. This is because as the regression line fluctuates across samples values the extremes of the line vary more than do values in its middle, similar to a teeter-totter where sitting on the end provides much more up and down motion than does sitting further inward.

9.4.1 Prediction from Existing Data Values

The Regression function by default provides two different analyses for these prediction intervals. First the function displays the lower and upper bounds of the intervals as part of

the standard text output. The intervals by default are sorted from the smallest lower bound of pred.sort="off"

the prediction intervals to the largest lower bound. To leave the rows of data in their original option: Do not sort rows of data by the order, specify the pred.sort="off" option. To avoid voluminous output only representative prediction interval prediction intervals are provided, intervals for lowest values of the lower bound of the interval, lower bounds.

middle values of the lower bound, and intervals for the largest values. If the sample is sufficiently pred.rows

small, less than 25, or if the pred.rows option is set to "all" , then all the prediction intervals option: Number of displayed for all the rows of data are displayed.

prediction intervals

Annotated output for the prediction intervals of Salary from Years appears in Figure 9.6 . for first, middle, and last intervals, The fitted values and the 95% prediction intervals are highlighted. Also provided are the or set to "all". corresponding data values, the width of each prediction interval, and the 95% confidence

intervals of the point on the regression line. To save space in this figure the decimal digits digits.d option,

Section 1.3.5 ,

are not displayed, accomplished by setting digits.d=0 in the function call to Regression .

p. 14

214 Regression I

Years Salary fitted ci:lwr ci:upr pi:lwr pi:upr pi:wdh fitted fitted pi:lwr pi:upr pi:lwr pi:upr

Hamide, Bita 1 41037 35960 28933 42988 11397 60524 49126 35960 35960 11397 60524 11397 60524 Singh, Niral 2 51055 39210 32747 45673 14803 63617 48815 39210 39210 14803 63617 14803 63617 Korhalkar, Jessica 2 62502 39210 32747 45673 14803 63617 48815 39210 39210 14803 63617 14803 63617 Anastasiou, Crystal 2 46508 39210 32747 45673 14803 63617 48815 39210 39210 14803 63617 14803 63617

... for the middle 4 rows of sorted data ... Years Salary fitted ci:lwr ci:upr pi:lwr pi:upr pi:wdh fitted fitted pi:lwr pi:upr pi:lwr pi:upr

Kimball, Claire 8 51357 58707 54668 62747 34827 82588 47761 58707 58707 34827 82588 34827 82588 Saechao, Suzanne 8 45545 58707 54668 62747 34827 82588 47761 58707 58707 34827 82588 34827 82588 Tian, Fang 9 61084 61957 58025 65889 38094 85819 47725 61957 61957 38094 85819 38094 85819 Stanley, Grayson 9 59625 61957 58025 65889 38094 85819 47725 61957 61957 38094 85819 38094 85819

... for the last 4 rows of sorted data ... Years Salary fitted ci:lwr ci:upr pi:lwr pi:upr pi:wdh fitted fitted pi:lwr pi:upr pi:lwr pi:upr

Skrotzki, Sara 18 81352 91203 84046 98359 66603 115803 49200 91203 91203 66603 115803 66603 115803 James, Leslie 18 112563 91203 84046 98359 66603 115803 49200 91203 91203 66603 115803 66603 115803 Correll, Trevon 21 124419 100951 91978 109925 75763 126140 50378 100951 100951 75763 126140 75763 126140 Capelle, Adam 24 98138 110700 99813 121587 84768 136633 51865 110700 110700 84768 136633 84768 136633

Figure 9.6 Annotated 95% prediction intervals for representative rows of data.

The confidence intervals in Figure 9.6 reflect the sampling error, the variation of the corresponding point on the regression line from sample to sample. Larger sampling errors contribute to larger prediction errors. Assuming normality, the 95% range of the residuals provide the extent of the modeling error, a value already reported as $47073 in Listing 9.1 . Because the prediction intervals reflect both modeling and sampling error, they are larger than the corresponding 95% range of the residuals. From Figure 9.6 , the smallest prediction interval is $47,725 wide, from $38,094 to $85,819 for 9 years of employment. The largest prediction interval is $51,865 wide, from $84,768 to $136,633 for 24 years of employment.

The second type of results the Regression function provides for prediction intervals is an enhanced scatter plot that illustrates the size of the prediction intervals. This scatter plot, in Figure 9.7 also contains the regression line, the confidence intervals that reflect variability of the regression line, and the wider prediction intervals. The two (slightly) curved lines that define the prediction intervals define many such intervals, the lower and upper bound of the interval for each value of the predictor variable, Years.

As is frequently encountered in the estimation of regression models, the prediction intervals are wide. Precise prediction is not easy. A larger sample will reduce the effect of the sampling error, but the effect of the modeling error can only be reduced by improving the model, such as adding new predictor variables, the topic of the following chapter.

9.4.2 Prediction from Specified Data Values

The prediction intervals provided by the Regression function are for each row of the data table, the existing values of the predictor variable. As noted, prediction occurs from new values of the predictor variable, which generally do not equal the existing values. There also needs to

be a way to obtain these prediction intervals for specified new values.

Regression I 215

140000 Upper Boundary of Prediction Intervals

100000 y

Salar 60000

Lower Boundary of

Prediction Intervals

5 10 15 20 Years

Figure 9.7 Annotated scatter plot with prediction intervals of Salary, and the regression line and the confidence intervals for the variability of the regression line.

Scenario Obtain predictions for new values of the predictor variable The data include employees who have worked at the company for each of 1 to 10 years.

The interval from 10 to 20 years, however, contains gaps such as for 12 years and 16 years, for which the default analysis does not provide predictions. Generate a list of predictions for all integer values of Years from 10 to 20. Then a prospective employee can be provided an estimate of the Salary at the company after any specified number of Years employed.

The Regression function provides an option X1.new for listing specified values of the X1.new option: predictor variable from which to obtain a prediction of the response variable and associated Obtain predictions for specified values interval. The X1 refers to the first predictor variable, which is the only predictor variable in of the predictor this example. Specify the range of values of the predictor variable as 10:20 . Or, invoke the c variable.

function to specify a more customized list of variables. c function,

Section 1.3.6 , p. 15

lessR Input Prediction intervals for specified values of the predictor variable > Regression(Salary ∼ Years, X1.new=10:20)

The resulting predictions and associated 95% prediction intervals appear in Listing 9.2 .

216 Regression I

Years Salary fitted ci:lwr ci:upr pi:lwr pi:upr pi:wdh

Listing 9.2 Specified predictions and prediction intervals for Salary for values of Years employed from 10 to 20.

The remainder of the Regression output is identical to what is obtained without the X1.new option. The only distinction is the section for the prediction intervals. This section now analyzes the new, specified values of the predictor variables. The value for the response variable in this section is blank because it is not yet known.