Data Plots and Transformations

11.10 Data Plots and Transformations

In this chapter, we deal with building regression models where there is one in- dependent, or regressor, variable. In addition, we are assuming, through model formulation, that both x and y enter the model in a linear fashion. Often it is advisable to work with an alternative model in which either x or y (or both) enters in a nonlinear way. A transformation of the data may be indicated because of theoretical considerations inherent in the scientific study, or a simple plotting of the data may suggest the need to reexpress the variables in the model. The need to perform a transformation is rather simple to diagnose in the case of simple linear regression because two-dimensional plots give a true pictorial display of how each variable enters the model.

A model in which x or y is transformed should not be viewed as a nonlinear regression model. We normally refer to a regression model as linear when it is linear in the parameters. In other words, suppose the complexion of the data or other scientific information suggests that we should regress y* against x*, where each is a transformation on the natural variables x and y. Then the model of the form

y ∗ i =β 0 +β 1 x ∗ i +ǫ i

is a linear model since it is linear in the parameters β 0 and β 1 . The material given in Sections 11.2 through 11.9 remains intact, with y ∗ i and x ∗ i replacing y i and x i .

A simple and useful example is the log-log model log y i =β 0 +β 1 log x i +ǫ i .

Although this model is not linear in x and y, it is linear in the parameters and is thus treated as a linear model. On the other hand, an example of a truly nonlinear model is

y i =β 0 +β 1 x β 2 +ǫ i ,

where the parameter β 2 (as well as β 0 and β 1 ) is to be estimated. The model is

not linear in β 2 .

Transformations that may enhance the fit and predictability of a model are many in number. For a thorough discussion of transformations, the reader is referred to Myers (1990, see the Bibliography). We choose here to indicate a few of them and show the appearance of the graphs that serve as a diagnostic tool. Consider Table 11.6. Several functions are given describing relationships between y and x that can produce a linear regression through the transformation indicated.

11.10 Data Plots and Transformations 425 In addition, for the sake of completeness the reader is given the dependent and

independent variables to use in the resulting simple linear regression. Figure 11.19 depicts functions listed in Table 11.6. These serve as a guide for the analyst in choosing a transformation from the observation of the plot of y against x.

Table 11.6: Some Useful Transformations to Linearize Functional Form

Form of Simple Relating y to x

Proper

Transformation

Linear Regression

Regress y* against x Power: y = β 0 x β 1 y ∗ = log y; x ∗ = log x Regress y* against x* Reciprocal: y = β

Exponential: y = β 0 e β 1 x

y ∗ = ln y

Regress y against x* Hyperbolic: y =

0 +β 1 x

Regress y* against x*

β 0 +β 1 x

x (a) Exponential function

(b) Power function y

(c) Reciprocal function

(d) Hyperbolic function

Figure 11.19: Diagrams depicting functions listed in Table 11.6.

What Are the Implications of a Transformed Model?

The foregoing is intended as an aid for the analyst when it is apparent that a trans- formation will provide an improvement. However, before we provide an example, two important points should be made. The first one revolves around the formal writing of the model when the data are transformed. Quite often the analyst does not think about this. He or she merely performs the transformation without any

426 Chapter 11 Simple Linear Regression and Correlation concern about the model form before and after the transformation. The exponen-

tial model serves as a good illustration. The model in the natural (untransformed) variables that produces an additive error model in the transformed variables is given by

1 y i i =β 0 e β x ·ǫ i ,

which is a multiplicative error model. Clearly, taking logs produces

ln y i = ln β 0 +β 1 x i + ln ǫ i .

As a result, it is on ln ǫ i that the basic assumptions are made. The purpose of this presentation is merely to remind the reader that one should not view a transformation as merely an algebraic manipulation with an error added. Often a model in the transformed variables that has a proper additive error structure is a result of a model in the natural variables with a different type of error structure.

The second important point deals with the notion of measures of improvement. Obvious measures of comparison are, of course, R 2 and the residual mean square, s 2 . (Other measures of performance used to compare competing models are given in Chapter 12.) Now, if the response y is not transformed, then clearly s 2 and R 2 can be used in measuring the utility of the transformation. The residuals will be in the same units for both the transformed and the untransformed models. But when y is transformed, performance criteria for the transformed model should be based on values of the residuals in the metric of the untransformed response so that comparisons that are made are proper. The example that follows provides an illustration.

Example 11.9: The pressure P of a gas corresponding to various volumes V is recorded, and the data are given in Table 11.7.

Table 11.7: Data for Example 11.9

V (cm 3 )

50 60 70 90 100 P (kg/cm 2 ) 64.7 51.3 40.5 25.9 7.8

The ideal gas law is given by the functional form P V γ = C, where γ and C are constants. Estimate the constants C and γ. Solution : Let us take natural logs of both sides of the model

P i V γ =C·ǫ i ,

i = 1, 2, 3, 4, 5.

As a result, a linear model can be written

i = 1, 2, 3, 4, 5, where ǫ ∗ i = ln ǫ i . The following represents results of the simple linear regression:

ln P i = ln C − γ ln V i +ǫ ∗ i ,

Intercept: 2 ln C = 14.7589, + C = 2, 568, 862.88, Slope: ˆ γ = 2.65347221. The following represents information taken from the regression analysis.

ln P i

ln V i

ln P 2 i

e i =P i −+ P i

11.10 Data Plots and Transformations 427 It is instructive to plot the data and the regression equation. Figure 11.20

shows a plot of the data in the untransformed pressure and volume and the curve representing the regression equation.

Figure 11.20: Pressure and volume data and fitted regression.

Diagnostic Plots of Residuals: Graphical Detection of Violation of Assumptions

Plots of the raw data can be extremely helpful in determining the nature of the model that should be fit to the data when there is a single independent variable. We have attempted to illustrate this in the foregoing. Detection of proper model form is, however, not the only benefit gained from diagnostic plotting. As in much of the material associated with significance testing in Chapter 10, plotting methods can illustrate and detect violation of assumptions. The reader should recall that much of what is illustrated in this chapter requires assumptions made on the model errors, the ǫ i . In fact, we assume that the ǫ i are independent N (0, σ) random variables. Now, of course, the ǫ i are not observed. However, the e i =y i − ˆy i , the residuals, are the error in the fit of the regression line and thus serve to mimic the ǫ i . Thus, the general complexion of these residuals can often highlight difficulties. Ideally, of course, the plot of the residuals is as depicted in Figure 11.21. That is, they should truly show random fluctuations around a value of zero.

Nonhomogeneous Variance

Homogeneous variance is an important assumption made in regression analysis. Violations can often be detected through the appearance of the residual plot. In- creasing error variance with an increase in the regressor variable is a common condition in scientific data. Large error variance produces large residuals, and hence a residual plot like the one in Figure 11.22 is a signal of nonhomogeneous variance. More discussion regarding these residual plots and information regard-

428 Chapter 11 Simple Linear Regression and Correlation

Residual

Residual

y^ Figure 11.21: Ideal residual plot.

y^

Figure 11.22: Residual plot depicting heteroge- neous error variance.

ing different types of residuals appears in Chapter 12, where we deal with multiple linear regression.

Normal Probability Plotting

The assumption that the model errors are normal is made when the data analyst deals in either hypothesis testing or confidence interval estimation. Again, the numerical counterpart to the ǫ i , namely the residuals, are subjects of diagnostic plotting to detect any extreme violations. In Chapter 8, we introduced normal quantile-quantile plots and briefly discussed normal probability plots. These plots on residuals are illustrated in the case study introduced in the next section.