a.
Construct a scatter plot. Does the simple linear regres- sion model appear to be plausible?
b.
Carry out a test of model utility.
c.
Estimate true average yield when distance upslope is 75 by giving an interval of plausible values.
55.
Verify that is indeed given by the expression in
the text. [Hint: .
]
56.
The article “Bone Density and Insertion Torque as Predictors of Anterior Cruciate Ligament Graft Fixation
Strength” The Amer. J. of Sports Med., 2004: 1421–1429 gave the accompanying data on maximum insertion torque
and yield load N, the latter being one measure of graft strength, for 15 different specimens.
N m
V gd
i
Y
i
5 gd
i 2
V Y
i
V bˆ
1 bˆ
1
x
a.
Is it plausible that yield load is normally distributed?
b.
Estimate true average yield load by calculating a confi- dence interval with a confidence level of 95, and inter-
pret the interval.
c.
Here is output from Minitab for the regression of yield load on torque. Does the simple linear regression model
specify a useful relationship between the variables?
Predictor Coef
SE Coef T
P Constant
152.44 91.17
1.67 0.118
Torque 178.23
45.97 3.88
0.002 Source
DF SS
MS F
P Regression
1 80554
80554 15.03
0.002 Residual Error
13 69684
5360 Total
14 150238
d.
The authors of the cited paper state, “Consequently, we cannot but conclude that simple regression analysis-
based methods are not clinically sufficient to predict individual fixation strength.” Do you agree? [Hint:
Consider predicting yield load when torque is 2.0.]
R –Sqadj 5 50.0
R –Sq 5 53.6
S 5 73.2141
x 10
20 30
45 50
70 y
500 590
410 470
450 480
510 x
80 100
120 140
160 170
190 y
450 360
400 300
410 280
350
Torque
1.8 2.2
1.9 1.3
2.1 2.2
1.6 2.1
Load
491 477
598 361
605 671
466 431
Torque
1.2 1.8
2.6 2.5
2.5 1.7
1.6
Load
384 422
554 577
642 348
446
12.5
Correlation
There are many situations in which the objective in studying the joint behavior of two variables is to see whether they are related, rather than to use one to predict the
value of the other. In this section, we first develop the sample correlation coefficient r
as a measure of how strongly related two variables x and y are in a sample and then relate r to the correlation coefficient r defined in Chapter 5.
The Sample Correlation Coefficient r
Given n numerical pairs , it is natural to speak of x and y
as having a positive relationship if large x’s are paired with large y’s and small x’s with small y’s. Similarly, if large x’s are paired with small y’s and small x’s with large y’s,
then a negative relationship between the variables is implied. Consider the quantity
Then if the relationship is strongly positive, an x
i
above the mean will tend to be paired with a y
i
above the mean , so that , and this product will
also be positive whenever both x
i
and y
i
are below their respective means. Thus a positive relationship implies that
will be positive. An analogous argument shows that when the relationship is negative,
will be negative, since most of the prod- ucts
will be negative. This is illustrated in Figure 12.19. x
i
2 x y
i
2 y S
xy
S
xy
x
i
2 xy
i
2 y . 0
y x
S
xy
5
g
n i5
1
x
i
2 x y
i
2 y 5
g
n i5
1
x
i
y
i
2 a
g
n i5
1
x
i
b a
g
n i5
1
y
i
b n
x
1
, y
1
, x
2
, y
2
, c
, x
n ,
y
n
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook andor eChapters. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Although S
xy
seems a plausible measure of the strength of a relationship, we do not yet have any idea of how positive or negative it can be. Unfortunately,
has a serious defect: By changing the unit of measurement for either x or y, can
be made either arbitrarily large in magnitude or arbitrarily close to zero. For example,
if when x
is measured in meters, then when x is measured in
millimeters and .025 when x is expressed in kilometers. A reasonable condition to impose on any measure of how strongly x and y are related is that the calculated
measure should not depend on the particular units used to measure them. This con- dition is achieved by modifying
to obtain the sample correlation coefficient. S
xy
S
xy
5 25,000
S
xy
5 25
S
xy
S
xy
x a
y ⴙ
ⴚ ⴙ
ⴚ
x b
y ⴙ
ⴚ ⴙ
ⴚ
Figure 12.19 a Scatter plot with
positive; b scatter plot with negative
[ , and ]
2 means
x
i
2 xy
i
2 y , 0
1 means
x
i
2 xy
i
2 y . 0
S
xy
S
xy
Example 12.15
DEFINITION
The sample correlation coefficient for the n pairs is
12.8 r 5
S
xy
2gx
i
2 x
2
2gy
i
2 y
2
5 S
xy
2S
xx
2S
yy
x
1
, y
1
, c
, x
n
, y
n
An accurate assessment of soil productivity is critical to rational land-use planning. Unfortunately, as the author of the article “Productivity Ratings Based on Soil Series”
Prof. Geographer, 1980: 158–163 argues, an acceptable soil productivity index is not so easy to come by. One difficulty is that productivity is determined partly by which
crop is planted, and the relationship between the yield of two different crops planted in the same soil may not be very strong. To illustrate, the article presents the accompany-
ing data on corn yield x and peanut yield y mTHa for eight different types of soil.
x 2.4
3.4 4.6
3.7 2.2
3.3 4.0
2.1 y
1.33 2.12
1.80 1.65
2.00 1.76
2.11 1.63
With , and
26.4324,
from which ■
r 5 .5960
15.75 1.5124 5 .347
S
xy
5 46.856 2
25.714.40 8
5 .5960 S
xx
5 88.31 2
25.7
2
8 5
5.75 S
yy
5 26.4324 2
14.40
2
8 5 .5124
gy
i 2
5 gx
i
5 25.7,
gy
i
5 14.40,
gx
i 2
5 88.31,
gx
i
y
i
5 46.856
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook andor eChapters. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Properties of r
The most important properties of r are as follows:
1.
The value of r does not depend on which of the two variables under study is labeled x and which is labeled y.
2.
The value of r is independent of the units in which x and y are measured.
3. 4.
if and only if iff all pairs lie on a straight line with positive slope,
and iff all
pairs lie on a straight line with negative slope.
5.
The square of the sample correlation coefficient gives the value of the coefficient of determination that would result from fitting the simple linear regression
model—in symbols, .
Property 1 stands in marked contrast to what happens in regression analysis, where virtually all quantities of interest the estimated slope, estimated y-intercept,
s
2
, etc. depend on which of the two variables is treated as the dependent variable. However, Property 5 shows that the proportion of variation in the dependent variable
explained by fitting the simple linear regression model does not depend on which variable plays this role.
Property 2 is equivalent to saying that r is unchanged if each x
i
is replaced by cx
i
and if each y
i
is replaced by dy
i
a change in the scale of measurement, as well as if each x
i
is replaced by and y
i
by which changes the location of
zero on the measurement axis. This implies, for example, that r is the same whether temperature is measured in °F or °C.
Property 3 tells us that the maximum value of r, corresponding to the largest possible degree of positive relationship, is
, whereas the most negative rela- tionship is identified with
. According to Property 4, the largest positive and largest negative correlations are achieved only when all points lie along a straight line.
Any other configuration of points, even if the configuration suggests a deterministic relationship between variables, will yield an r value less than 1 in absolute magnitude.
Thus r measures the degree of linear relationship among variables. A value of r near 0 is not evidence of the lack of a strong relationship, but only the absence of a linear
relation, so that such a value of r must be interpreted with caution. Figure 12.20 illus- trates several configurations of points associated with different values of r.
r 5 2 1
r 5 1
y
i
2 b x
i
2 a r
2
5 r
2
x
i
, y
i
r 5 2 1
x
i
, y
i
r 5 1
2 1 r 1
a r near ⫹1 b r near ⫺1
c r near 0, no apparent relationship
d r near 0, nonlinear relationship
Figure 12.20 Data plots for different values of r
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook andor eChapters. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
A frequently asked question is, “When can it be said that there is a strong cor- relation between the variables, and when is the correlation weak?” Here is an infor-
mal rule of thumb for characterizing the value of r:
Weak Moderate
Strong
either or either
or r 2
.8 r
.8 .5 , r , .8
2 .8 , r , 2.5
2 .5 r .5
It may surprise you that an r as substantial as .5 or goes in the weak category.
The rationale is that if or
, then in a regression with either vari-
able playing the role of y. A regression model that explains at most 25 of observed variation is not in fact very impressive. In Example 12.15, the correlation between
corn yield and peanut yield would be described as weak.
Inferences About the Population Correlation Coefficient
The correlation coefficient r is a measure of how strongly related x and y are in the observed sample. We can think of the pairs
as having been drawn from a bivariate population of pairs, with
having some joint pmf or pdf. In Chapter 5, we defined the correlation coefficient
by
where
If we think of or
as describing the distribution of pairs of values within the entire population, r becomes a measure of how strongly related x and y
are in that population. Properties of r analogous to those for r were given in Chapter 5.
The population correlation coefficient r is a parameter or population charac- teristic, just as
, and are, so we can use the sample correlation coeffi-
cient to make various inferences about r. In particular, r is a point estimate for r, and the corresponding estimator is
s
Y
m
X
, m
Y
, s
X
f x, y
p x, y
Cov X, Y 5 d
g
x
g
y
x 2 m
X
y 2 m
Y
px, y X, Y discrete
冮
` 2 `
冮
` 2`
x 2 m
X
y 2 m
Y
f x, y dx dy X, Y continuous r 5 r
X, Y 5 CovX, Y
s
X
s
Y
r X, Y
X
i
, Y
i
x
i
, y
i
r
2
5 .25
2 .5
r 5 .5
2 .5
rˆ 5 R 5 gX
i
2 X Y
i
2 Y 2gX
i
2 X
2
2 gY
i
2 Y
2
Example 12.16
In some locations, there is a strong association between concentrations of two differ- ent pollutants. The article “The Carbon Component of the Los Angeles Aerosol:
Source Apportionment and Contributions to the Visibility Budget” J. of Air Pollution Control Fed.
, 1984: 643–650 reports the accompanying data on ozone concentration x
ppm and secondary carbon concentration .
y mgm
3 Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook andor eChapters.
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.