16 In some locations, there is a strong association between concentrations of two differ-
Example 12.16 In some locations, there is a strong association between concentrations of two differ-
ent pollutants. The article “The Carbon Component of the Los Angeles Aerosol: Source Apportionment and Contributions to the Visibility Budget” (J. of Air Pollution Control Fed., 1984: 643–650) reports the accompanying data on ozone concentration
x (ppm) and secondary carbon concentration y (mgm 3 ) .
CHAPTER 12 Simple Linear Regression and Correlation
The summary quantities are n 5 16, g x 5 1.656, g y 5 170.6, g x 2 i i i 5 .196912,
gx i y i 5 20.0397 , and gy 2 i 5 2253.56 from which 20.0397 2 (1.656)(170.6)16
The point estimate of the population correlation coefficient r between ozone con- centration and secondary carbon concentration is rˆ 5 r 5 .716.
■
The small-sample intervals and test procedures presented in Chapters 7–9 were based on an assumption of population normality. To test hypotheses about r, an analogous assumption about the distribution of pairs of (x, y) values in the popu- lation is required. We are now assuming that both X and Y are random, whereas much of our regression work focused on x fixed by the experimenter.
ASSUMPTION
The joint probability distribution of (X, Y) is specified by
f (x, y) 5
e 2[((x2m 1 )s 1 ) 2 22r(x2m 1 )(y2m 2 )s 1 s 2 1((y2m 2 )s 2 ) 2 ][2(12r 2 )]
2p s 1 s 21 2 r 2 2
2` , x , ` 2` , y , `
(12.9) where m 1 and s 1 are the mean and standard deviation of X, and m 2 and s 2 are
the mean and standard deviation of Y; f(x, y) is called the bivariate normal
probability distribution.
The bivariate normal distribution is obviously rather complicated, but for our purposes we need only a passing acquaintance with several of its properties. The sur- face determined by f(x, y) lies entirely above the x, y plane [ f (x, y) 0] and has a three-dimensional bell- or mound-shaped appearance, as illustrated in Figure 12.21. If we slice through the surface with any plane perpendicular to the x, y plane and look at the shape of the curve sketched out on the “slicing plane,” the result is a normal curve. More precisely, if X5x , it can be shown that the (conditional) distribution of
Y is normal with mean m Yx 5m 2 2 rm
1 s 2 s 1 1 rs 2 xs 1 and variance (1 2 r )s 2 .
This is exactly the model used in simple linear regression with
1 5 rs 2 s 1 , and s 5 (1 2 r )s 2 independent of x. The
implication is that if the observed pairs (x i ,y i ) are actually drawn from a bivariate normal distribution, then the simple linear regression model is an appropriate way of
studying the behavior of Y for fixed x. If r50 , then m Yx 5m 2 independent of x; in
fact, when r50 , the joint probability density function f(x, y) of (12.9) can be
factored as f 1 (x)f 2 (y) , which implies that X and Y are independent variables.
A graph of the bivariate normal pdf Assuming that the pairs are drawn from a bivariate normal distribution allows us
Figure 12.21
to test hypotheses about r and to construct a CI. There is no completely satisfactory way to check the plausibility of the bivariate normality assumption. A partial check involves constructing two separate normal probability plots, one for the sample x i ’s and another for the sample y i ’s, since bivariate normality implies that the marginal distributions of both X and Y are normal. If either plot deviates substantially from a straight-line pattern, the following inferential procedures should not be used for small n.
Testing for the Absence of Correlation When H 0 :r50 is true, the test statistic R 1n 2 2
has a t distribution with
Alternative Hypothesis
Rejection Region for Level a Test
H a :r.0
tt a,n22
H a :r,0
t 2t a,n22
H a :r20
either or tt a2,n22 t 2t a2,n22
A P-value based on n22
df can be calculated as described previously.