Calculate a 95 CI for

a. Construct a scatter plot. Does the simple linear regres- sion model appear to be plausible? b. Carry out a test of model utility. c. Estimate true average yield when distance upslope is 75 by giving an interval of plausible values. 55. Verify that is indeed given by the expression in the text. [Hint: . ] 56. The article “Bone Density and Insertion Torque as Predictors of Anterior Cruciate Ligament Graft Fixation Strength” The Amer. J. of Sports Med., 2004: 1421–1429 gave the accompanying data on maximum insertion torque and yield load N, the latter being one measure of graft strength, for 15 different specimens. N m V gd i Y i 5 gd i 2 V Y i V bˆ 1 bˆ 1 x a. Is it plausible that yield load is normally distributed? b. Estimate true average yield load by calculating a confi- dence interval with a confidence level of 95, and inter- pret the interval. c. Here is output from Minitab for the regression of yield load on torque. Does the simple linear regression model specify a useful relationship between the variables? Predictor Coef SE Coef T P Constant 152.44 91.17 1.67 0.118 Torque 178.23 45.97 3.88 0.002 Source DF SS MS F P Regression 1 80554 80554 15.03 0.002 Residual Error 13 69684 5360 Total 14 150238 d. The authors of the cited paper state, “Consequently, we cannot but conclude that simple regression analysis- based methods are not clinically sufficient to predict individual fixation strength.” Do you agree? [Hint: Consider predicting yield load when torque is 2.0.] R –Sqadj 5 50.0 R –Sq 5 53.6 S 5 73.2141 x 10 20 30

45 50

70 y 500 590 410 470 450 480 510 x 80 100 120 140 160 170 190 y 450 360 400 300 410 280 350 Torque 1.8 2.2 1.9 1.3 2.1 2.2 1.6 2.1 Load 491 477 598 361 605 671 466 431 Torque 1.2 1.8 2.6 2.5 2.5 1.7 1.6 Load 384 422 554 577 642 348 446 12.5 Correlation There are many situations in which the objective in studying the joint behavior of two variables is to see whether they are related, rather than to use one to predict the value of the other. In this section, we first develop the sample correlation coefficient r as a measure of how strongly related two variables x and y are in a sample and then relate r to the correlation coefficient r defined in Chapter 5. The Sample Correlation Coefficient r Given n numerical pairs , it is natural to speak of x and y as having a positive relationship if large x’s are paired with large y’s and small x’s with small y’s. Similarly, if large x’s are paired with small y’s and small x’s with large y’s, then a negative relationship between the variables is implied. Consider the quantity Then if the relationship is strongly positive, an x i above the mean will tend to be paired with a y i above the mean , so that , and this product will also be positive whenever both x i and y i are below their respective means. Thus a positive relationship implies that will be positive. An analogous argument shows that when the relationship is negative, will be negative, since most of the prod- ucts will be negative. This is illustrated in Figure 12.19. x i 2 x y i 2 y S xy S xy x i 2 xy i 2 y . 0 y x S xy 5 g n i5 1 x i 2 x y i 2 y 5 g n i5 1 x i y i 2 a g n i5 1 x i b a g n i5 1 y i b n x 1 , y 1 , x 2 , y 2 , c , x n , y n Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook andor eChapters. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Although S xy seems a plausible measure of the strength of a relationship, we do not yet have any idea of how positive or negative it can be. Unfortunately, has a serious defect: By changing the unit of measurement for either x or y, can be made either arbitrarily large in magnitude or arbitrarily close to zero. For example, if when x is measured in meters, then when x is measured in millimeters and .025 when x is expressed in kilometers. A reasonable condition to impose on any measure of how strongly x and y are related is that the calculated measure should not depend on the particular units used to measure them. This con- dition is achieved by modifying to obtain the sample correlation coefficient. S xy S xy 5 25,000 S xy 5 25 S xy S xy x a y ⴙ ⴚ ⴙ ⴚ x b y ⴙ ⴚ ⴙ ⴚ Figure 12.19 a Scatter plot with positive; b scatter plot with negative [ , and ] 2 means x i 2 xy i 2 y , 0 1 means x i 2 xy i 2 y . 0 S xy S xy Example 12.15 DEFINITION The sample correlation coefficient for the n pairs is 12.8 r 5 S xy 2gx i 2 x 2 2gy i 2 y 2 5 S xy 2S xx 2S yy x 1 , y 1 , c , x n , y n An accurate assessment of soil productivity is critical to rational land-use planning. Unfortunately, as the author of the article “Productivity Ratings Based on Soil Series” Prof. Geographer, 1980: 158–163 argues, an acceptable soil productivity index is not so easy to come by. One difficulty is that productivity is determined partly by which crop is planted, and the relationship between the yield of two different crops planted in the same soil may not be very strong. To illustrate, the article presents the accompany- ing data on corn yield x and peanut yield y mTHa for eight different types of soil. x 2.4 3.4 4.6 3.7 2.2 3.3 4.0 2.1 y 1.33 2.12 1.80 1.65 2.00 1.76 2.11 1.63 With , and 26.4324, from which ■ r 5 .5960 15.75 1.5124 5 .347 S xy 5 46.856 2 25.714.40 8 5 .5960 S xx 5 88.31 2 25.7 2 8 5 5.75 S yy 5 26.4324 2 14.40 2 8 5 .5124 gy i 2 5 gx i 5 25.7, gy i 5 14.40, gx i 2 5 88.31, gx i y i 5 46.856 Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook andor eChapters. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Properties of r The most important properties of r are as follows: 1. The value of r does not depend on which of the two variables under study is labeled x and which is labeled y. 2. The value of r is independent of the units in which x and y are measured.

3. 4.

if and only if iff all pairs lie on a straight line with positive slope, and iff all pairs lie on a straight line with negative slope. 5. The square of the sample correlation coefficient gives the value of the coefficient of determination that would result from fitting the simple linear regression model—in symbols, . Property 1 stands in marked contrast to what happens in regression analysis, where virtually all quantities of interest the estimated slope, estimated y-intercept, s 2 , etc. depend on which of the two variables is treated as the dependent variable. However, Property 5 shows that the proportion of variation in the dependent variable explained by fitting the simple linear regression model does not depend on which variable plays this role. Property 2 is equivalent to saying that r is unchanged if each x i is replaced by cx i and if each y i is replaced by dy i a change in the scale of measurement, as well as if each x i is replaced by and y i by which changes the location of zero on the measurement axis. This implies, for example, that r is the same whether temperature is measured in °F or °C. Property 3 tells us that the maximum value of r, corresponding to the largest possible degree of positive relationship, is , whereas the most negative rela- tionship is identified with . According to Property 4, the largest positive and largest negative correlations are achieved only when all points lie along a straight line. Any other configuration of points, even if the configuration suggests a deterministic relationship between variables, will yield an r value less than 1 in absolute magnitude. Thus r measures the degree of linear relationship among variables. A value of r near 0 is not evidence of the lack of a strong relationship, but only the absence of a linear relation, so that such a value of r must be interpreted with caution. Figure 12.20 illus- trates several configurations of points associated with different values of r. r 5 2 1 r 5 1 y i 2 b x i 2 a r 2 5 r 2 x i , y i r 5 2 1 x i , y i r 5 1 2 1 r 1 a r near ⫹1 b r near ⫺1 c r near 0, no apparent relationship d r near 0, nonlinear relationship Figure 12.20 Data plots for different values of r Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook andor eChapters. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. A frequently asked question is, “When can it be said that there is a strong cor- relation between the variables, and when is the correlation weak?” Here is an infor- mal rule of thumb for characterizing the value of r: Weak Moderate Strong either or either or r 2 .8 r .8 .5 , r , .8 2 .8 , r , 2.5 2 .5 r .5 It may surprise you that an r as substantial as .5 or goes in the weak category. The rationale is that if or , then in a regression with either vari- able playing the role of y. A regression model that explains at most 25 of observed variation is not in fact very impressive. In Example 12.15, the correlation between corn yield and peanut yield would be described as weak. Inferences About the Population Correlation Coefficient The correlation coefficient r is a measure of how strongly related x and y are in the observed sample. We can think of the pairs as having been drawn from a bivariate population of pairs, with having some joint pmf or pdf. In Chapter 5, we defined the correlation coefficient by where If we think of or as describing the distribution of pairs of values within the entire population, r becomes a measure of how strongly related x and y are in that population. Properties of r analogous to those for r were given in Chapter 5. The population correlation coefficient r is a parameter or population charac- teristic, just as , and are, so we can use the sample correlation coeffi- cient to make various inferences about r. In particular, r is a point estimate for r, and the corresponding estimator is s Y m X , m Y , s X f x, y p x, y Cov X, Y 5 d g x g y x 2 m X y 2 m Y px, y X, Y discrete 冮 ` 2 ` 冮 ` 2` x 2 m X y 2 m Y f x, y dx dy X, Y continuous r 5 r X, Y 5 CovX, Y s X s Y r X, Y X i , Y i x i , y i r 2 5 .25 2 .5 r 5 .5 2 .5 rˆ 5 R 5 gX i 2 X Y i 2 Y 2gX i 2 X 2 2 gY i 2 Y 2 Example 12.16 In some locations, there is a strong association between concentrations of two differ- ent pollutants. The article “The Carbon Component of the Los Angeles Aerosol: Source Apportionment and Contributions to the Visibility Budget” J. of Air Pollution Control Fed. , 1984: 643–650 reports the accompanying data on ozone concentration x ppm and secondary carbon concentration . y mgm 3 Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook andor eChapters. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.