Potential Misconceptions and Hazards; Relationship to Material in Other Chapters

12.13 Potential Misconceptions and Hazards; Relationship to Material in Other Chapters

There are several procedures discussed in this chapter for use in the “attempt” to find the best model. However, one of the most important misconceptions under which na¨ıve scientists or engineers labor is that there is a true linear model and that it can be found. In most scientific phenomena, relationships between scientific variables are nonlinear in nature and the true model is unknown. Linear statistical models are empirical approximations.

At times, the choice of the model to be adopted may depend on what informa- tion needs to be derived from the model. Is it to be used for prediction? Is it to

be used for the purpose of explaining the role of each regressor? This “choice” can

be made difficult in the presence of collinearity. It is true that for many regression problems there are multiple models that are very similar in performance. See the Myers reference (1990) for details.

One of the most damaging misuses of the material in this chapter is to assign too much importance to R 2 in the choice of the so-called best model. It is important to remember that for any data set, one can obtain an R 2 as large as one desires,

2 within the constraint 0 ≤ R 2 ≤ 1. Too much attention to R often leads to overfitting.

Much attention was given in this chapter to outlier detection. A classical serious misuse of statistics centers around the decision made concerning the detection of outliers. We hope it is clear that the analyst should absolutely not carry out the exercise of detecting outliers, eliminating them from the data set, fitting a new model, reporting outlier detection, and so on. This is a tempting and disastrous procedure for arriving at a model that fits the data well, with the result being an example of how to lie with statistics. If an outlier is detected, the history of the data should be checked for possible clerical or procedural error before it is eliminated from the data set. One must remember that an outlier by definition is

a data point that the model did not fit well. The problem may not be in the data but rather in the model selection. A changed model may result in the point not being detected as an outlier.

There are many types of responses that occur naturally in practice but can’t

be used in an analysis of standard least squares because classic least squares as- sumptions do not hold. The assumptions that often fail are those of normal errors and homogeneous variance. For example, if the response is a proportion, say pro- portion defective, the response distribution is related to the binomial distribution.

A second response that occurs often in practice is that of Poisson counts. Clearly the distribution is not normal, and the response variance, which is equal to the Poisson mean, will vary from observation to observation. For more details on these nonideal conditions, see Myers et al. (2008) in the Bibliography.