SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Chapter 19 SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Let X and Y be two random variables with joint probability density function f(x, y). Then the conditional density of Y given that X = x is

f (x, y)

f (yx) = g(x)

where

Z 1

g(x) =

f (x, y) dy

is the marginal density of X. The conditional mean of Y

Z 1

E (Y |X = x) =

yf (yx) dy

is called the regression equation of Y on X. Example 19.1. Let X and Y be two random variables with the joint prob-

ability density function

( xe x(1+y) if x > 0, y > 0

f (x, y) =

0 otherwise.

Find the regression equation of Y on X and then sketch the regression curve.

Simple Linear Regression and Correlation Analysis

Answer: The marginal density of X is given by

Z 1

g(x) =

xe x(1+y) dy

The conditional density of Y given X = x is

f (x, y) xe x(1+y)

f (yx) =

The conditional mean of Y given X = x is given by

Z 1 Z 1 1

E(Y x) =

yf (yx) dy =

yxe xy dy = .

Thus the regression equation of Y on X is

1 E(Y x) = , x > 0. x

The graph of this equation of Y on X is shown below.

Graph of the regression equation E(Yx) = 1 x

Probability and Mathematical Statistics

From this example it is clear that the conditional mean E(Yx) is a function of x. If this function is of the form ↵ + x, then the corresponding regression equation is called a linear regression equation; otherwise it is called a nonlinear regression equation. The term linear regression refers to

a specification that is linear in the parameters. Thus E(Yx) = ↵ + x 2 is

also a linear regression equation. The regression equation E(Yx) = ↵x is an example of a nonlinear regression equation.

The main purpose of regression analysis is to predict Y i from the knowl- edge of x i using the relationship like

E(Y i x i )=↵+x i .

The Y i is called the response or dependent variable where as x i is called the predictor or independent variable. The term regression has an interesting his- tory, dating back to Francis Galton (1822-1911). Galton studied the heights of fathers and sons, in which he observed a regression (a “turning back”) from the heights of sons to the heights of their fathers. That is tall fathers tend to have tall sons and short fathers tend to have short sons. However,

he also found that very tall fathers tend to have shorter sons and very short fathers tend to have taller sons. Galton called this phenomenon regression towards the mean.

In regression analysis, that is when investigating the relationship between a predictor and response variable, there are two steps to the analysis. The first step is totally data oriented. This step is always performed. The second step is the statistical one, in which we draw conclusions about the (population) regression equation E(Y i x i ). Normally the regression equation contains several parameters. There are two well known methods for finding the estimates of the parameters of the regression equation. These two methods are: (1) The least square method and (2) the normal regression method.

19.1. The Least Squares Method Let {(x i ,y i ) | i = 1, 2, ..., n} be a set of data. Assume that

E(Y i x i )=↵+x i ,

that is

y i =↵+x i ,

i = 1, 2, ..., n.

Simple Linear Regression and Correlation Analysis

Then the sum of the squares of the error is given by

X n

E(↵, ) =

The least squares estimates of ↵ and are defined to be those values which minimize E(↵, ). That is,

⇣

⌘

b ↵, b = arg min (↵, ) E(↵, ).

This least squares method is due to Adrien M. Legendre (1752-1833). Note that the least squares method also works even if the regression equation is nonlinear (that is, not of the form (1)).

Next, we give several examples to illustrate the method of least squares. Example 19.2. Given the five pairs of points (x, y) shown in table below

what is the line of the form y = x + b best fits the data by method of least squares?

Answer: Suppose the best fit line is y = x + b. Then for each x i ,x i + b is the estimated value of y i . The di↵erence between y i and the estimated value of y i is the error or the residual corresponding to the i th measurement. That is, the error corresponding to the i th measurement is given by

✏ i =y i x i b.

Hence the sum of the squares of the errors is

X 5 ✏ E(b) = 2 i

Di↵erentiating E(b) with respect to b, we get

d 5 X

db E(b) = 2 i=1

(y i x i

b) ( 1).

Probability and Mathematical Statistics

Setting d db E(b) equal to 0, we get

which is

Using the data, we see that

5b = 14 6

which yields b = 8 5 . Hence the best fitted line is

8 y=x+ . 5

Example 19.3. Suppose the line y = bx + 1 is fit by the method of least squares to the 3 data points

What is the value of the constant b? Answer: The error corresponding to the i th measurement is given by

✏ i =y i bx i 1.

Hence the sum of the squares of the errors is

X 3 ✏ E(b) = 2 i

Di↵erentiating E(b) with respect to b, we get

d 3 X

db E(b) = 2 i=1

(y i bx i

1) ( x i ).

Simple Linear Regression and Correlation Analysis

Setting d db E(b) equal to 0, we get

which in turn yields

Using the given data we see that

and the best fitted line is

21 x + 1. Example 19.4. Observations y 1 ,y 2 , ..., y n are assumed to come from a model

with

E(Y i x i ) = ✓ + 2 ln x i

where ✓ is an unknown parameter and x 1 ,x 2 , ..., x n are given constants. What

is the least square estimate of the parameter ✓? Answer: The sum of the squares of errors is

X n

E(✓) = 2 ✏ 2 X

Di↵erentiating E(✓) with respect to ✓, we get

d n X

E(✓) = 2

Setting d d✓ E(✓) equal to 0, we get

which is

1 X X

✓=

y i 2 ln x i .

n i=1

i=1

Probability and Mathematical Statistics

Hence the least squares estimate of ✓ is b ✓=y 2 n

ln x i .

i=1

Example 19.5. Given the three pairs of points (x, y) shown below:

What is the curve of the form y = x best fits the data by method of least squares?

Answer: The sum of the squares of the errors is given by

Di↵erentiating E( ) with respect to , we get

Setting this derivative d E( ) to 0, we get

Using the given data we obtain

(2) 4 ln 4 = 4 2 ln 4 + 2 2 ln 2

which simplifies to

Taking the natural logarithm of both sides of the above expression, we get

ln 3 ln 2

ln 4

Simple Linear Regression and Correlation Analysis

Thus the least squares best fit model is y = x 0.2925 .

Example 19.6. Observations y 1 ,y 2 , ..., y n are assumed to come from a model

with E(Y i x i )=↵+x i , where ↵ and

are unknown parameters, and

x 1 ,x 2 , ..., x n are given constants. What are the least squares estimate of the parameters ↵ and ?

Answer: The sum of the squares of the errors is given by

X n ✏ E(↵, ) = 2 i

Di↵erentiating E(↵, ) with respect to ↵ and respectively, we get

n X

↵ E(↵, ) = 2

E(↵, ) = 2

Setting these partial derivatives ↵ E(↵, ) and E(↵, ) to 0, we get

From (3), we obtain

which is

y=↵+ x.

Similarly, from (4), we have

X n X X

x i y i =↵

x i +

x 2 i

i=1

Probability and Mathematical Statistics

which can be rewritten as follows

X n

(x x)(y y) + nx y = n ↵ x +

x)(x x) + n x i 2 i i i

(x i x)(y i y)

i=1

we see that (6) reduces to

Substituting (5) into (7), we have

Simplifying the last equation, we get

S xy =S xx

which is

In view of (8) and (5), we get

Thus the least squares estimates of ↵ and are

We need some notations. The random variable Y given X = x will be denoted by Y x . Note that this is the variable appears in the model E(Yx) =

↵ + x. When one chooses in succession values x 1 ,x 2 , ..., x n for x, a sequence

Y x 1 ,Y x 2 , ..., Y x n of random variable is obtained. For the sake of convenience,

we denote the random variables Y x 1 ,Y x 2 , ..., Y x n simply as Y 1 ,Y 2 , ..., Y n . To

do some statistical analysis, we make following three assumptions: (1) E(Y x ) = ↵ + x so that µ i = E(Y i )=↵+x i ;

Simple Linear Regression and Correlation Analysis

(2) Y 1 ,Y 2 , ..., Y n are independent;

(3) Each of the random variables Y 1 ,Y 2 , ..., Y n has the same variance 2 .

Theorem 19.1. Under the above three assumptions, the least squares estimators b↵ and b of a linear model E(Yx) = ↵ + x are unbiased. Proof: From the previous example, we know that the least squares estima-

tors of ↵ and are

(x i x)(Y i Y ).

i=1

First, we show b is unbiased. Consider

E (x i x)(Y i Y)

S xx

i=1

1 n X

(x i x) E Y i Y

S xx i=1

1 n X 1 X

(x i x) E (Y i )

(x i x) E Y

S xx i=1

1 X n

X 1 n

(x i x) E (Y i )

(x i x)

S xx i=1

S xx

i=1

1 n X 1 X

(x i x) E (Y i )=

(x i x) (↵ + x i )

S xx i=1

1 n X 1 X

=↵

(x i x) +

(x i x) x i

S xx i=1

1 n X

(x i x) x i

S xx i=1

1 n X 1 X

(x i x) x i

(x i x) x

S xx i=1

X 1 n

(x i x) (x i x)

S xx i=1

S xx =.

S xx

Probability and Mathematical Statistics

Thus the estimator b is unbiased estimator of the parameter .

Next, we show that b↵ is also an unbiased estimator of ↵. Consider

=↵+xx =↵

This proves that b↵ is an unbiased estimator of ↵ and the proof of the theorem is now complete.

19.2. The Normal Regression Analysis In a regression analysis, we assume that the x i ’s are constants while y i ’s

are values of the random variables Y i ’s. A regression analysis is called a normal regression analysis if the conditional density of Y i given X i =x i is of the form

where 2 denotes the variance, and ↵ and are the regression coefficients.

That is Y | x

i ⇠ N(↵ + x, ). If there is no danger of confusion, then we

will write Y i for Y | x i . The figure on the next page shows the regression model of Y with equal variances, and with means falling on the straight line µ y = ↵ + x.

Normal regression analysis concerns with the estimation of , ↵, and . We use maximum likelihood method to estimate these parameters. The maximum likelihood function of the sample is given by

Y n

L( , ↵, ) =

f (y i x i )

i=1

Simple Linear Regression and Correlation Analysis

= n ln 2 ln(2⇡) (y

µ 1 µ 2 µ 3

0 y= α+β x

x 1

x 2 x 3

Normal Regression

Taking the partial derivatives of ln L( , ↵, ) with respect to ↵, and respectively, we get

Equating each of these partial derivatives to zero and solving the system of three equations, we obtain the maximum likelihood estimator of , ↵, as

s 

S xY

1 S xY

↵=Y b x, and b =

S YY

S xY ,

S xx

Probability and Mathematical Statistics

Theorem 19.2. In the normal regression analysis, the likelihood estimators

b and b↵ are unbiased estimators of and ↵, respectively. Proof: Recall that

where S 2

xx =P i=1 (x i x) . Thus b is a linear combination of Y i ’s. Since

Y i ⇠N↵+x i , 2 , we see that b is also a normal random variable.

First we show b is an unbiased estimator of . Since

the maximum likelihood estimator of is unbiased.

Next, we show that b↵ is also an unbiased estimator of ↵. Consider

=↵+xx = ↵.

Simple Linear Regression and Correlation Analysis

This proves that b↵ is an unbiased estimator of ↵ and the proof of the theorem is now complete.

Theorem 19.3. In normal regression analysis, the distributions of the estimators b and b↵ are given by

Proof: Since

the b is a linear combination of Y i ’s. As Y i ⇠N↵+x i , 2 , we see that

b is also a normal random variable. By Theorem 19.2, b is an unbiased estimator of .

The variance of b is given by

Hence b is a normal random variable with mean (or expected value) and

2 ⇣

2 ⌘

variance S xx . That is b ⇠ N , S xx .

Now determine the distribution of b↵. Since each Y i ⇠ N(↵ + x i , 2 ),

the distribution of Y is given by

✓

2 ◆

Y ⇠N ↵ + x,

Probability and Mathematical Statistics

the distribution of x b is given by

Since b↵ = Y x b and Y and x b being two normal random variables, ↵ is b also a normal random variable with mean equal to ↵ + x

x = ↵ and

variance variance equal to 2

n + x 2 S xx . That is ✓

and the proof of the theorem is now complete.

It should be noted that in the proof of the last theorem, we have assumed the fact that Y and x b are statistically independent.

In the next theorem, we give an unbiased estimator of the variance 2 .

For this we need the distribution of the statistic U given by

n 2

U= b .

It can be shown (we will omit the proof, for a proof see Graybill (1961)) that the distribution of the statistic

n 2

U= b ⇠ 2 (n 2).

Theorem 19.4. An unbiased estimator S 2 of 2 is given by

where b = n S

Proof: Since

✓

2 ◆

E(S 2 )=E n b n 2

2 ✓ 2 n ◆

E b

n 2

E( 2 (n 2))

n 2

(n 2) = 2 .

n 2

Simple Linear Regression and Correlation Analysis

The proof of the theorem is now complete.

Note that the estimator S 2 can be written as S 2 = SSE n2 , where

the estimator S 2 is unbiased estimator of 2 . The proof of the theorem is

now complete.

In the next theorem we give the distribution of two statistics that can

be used for testing hypothesis and constructing confidence interval for the regression parameters ↵ and .

Theorem 19.5. The statistics

have both a t-distribution with n 2 degrees of freedom. Proof: From Theorem 19.3, we know that

Hence by standardizing, we get

b Z= q 2 ⇠ N(0, 1).

S xx

Further, we know that the likelihood estimator of is

and the distribution of the statistic U = n 2 b is chi-square with n 2 degrees

of freedom.

Probability and Mathematical Statistics

Since Z = b p

2 ⇠ N(0, 1) and U =

⇠

n 2 b 2

(n 2), by Theorem 14.6,

the statistic Sxx p Z

⇠ t(n 2). Hence

Similarly, it can be shown that

This completes the proof of the theorem.

In the normal regression model, if = 0, then E(Y x ) = ↵. This implies that E(Y x ) does not depend on x. Therefore if 6= 0, then E(Y x ) is dependent on x. Thus the null hypothesis H o : = 0 should be tested against

H a : 6= 0. To devise a test we need the distribution of b. Theorem 19.3 says that b is normally distributed with mean 2 and variance

S x . Therefore, we

have

b Z= q 2 ⇠ N(0, 1).

S xx

In practice the variance V ar(Y i x i ) which is 2 is usually unknown. Hence

the above statistic Z is not very useful. However, using the statistic Q , we can devise a hypothesis test to test the hypothesis H o : = o against

H a : 6= o at a significance level . For this one has to evaluate the quantity

and compare it to quantile t 2 (n 2). The hypothesis test, at significance

level , is then “Reject H o := o if |t| > t 2 (n 2)”.

The statistic

b (n 2) S xx Q=

Simple Linear Regression and Correlation Analysis

is a pivotal quantity for the parameter since the distribution of this quantity Q is a t-distribution with n

2 degrees of freedom. Thus it can be used for

the construction of a (1

)100 confidence interval for the parameter as

2 (n 2)b (n 2)S b+t 2 (n 2)b

Hence, the (1

) confidence interval for is given by

In a similar manner one can devise hypothesis test for ↵ and construct confidence interval for ↵ using the statistic Q ↵ . We leave these to the reader.

Now we give two examples to illustrate how to find the normal regression line and related things.

Example 19.7. Let the following data on the number of hours, x which ten persons studied for a French test and their scores, y on the test is shown below:

Find the normal regression line that approximates the regression of test scores on the number of hours studied. Further test the hypothesis H o : = 3 versus

H a : 6= 3 at the significance level 0.02. Answer: From the above data, we have

X 10 X 10

x i = 100,

x 2 i = 1376

y i = 564,

X 10 x i y i = 6945

i=1

Probability and Mathematical Statistics

S xx = 376,

S xy = 1305,

S yy = 4752.4.

Hence

s b= xy = 3.471 and b↵ = y bx = 21.690.

s xx

Thus the normal regression line is

y = 21.690 + 3.471x.

This regression line is shown below.

Regression line y = 21.690 + 3.471 x

Now we test the hypothesis H o : = 3 against H a : 6= 3 at 0.02 level

of significance. From the data, the maximum likelihood estimate of is

1 

S xy

S yy

S xy

S xx

1 h i

S yy bS xy

Simple Linear Regression and Correlation Analysis

1.73 = |t| < t 0.01 (8) = 2.896.

Thus we do not reject the null hypothesis that H o : = 3 at the significance level 0.02.

This means that we can not conclude that on the average an extra hour of study will increase the score by more than 3 points.

Example 19.8. The frequency of chirping of a cricket is thought to be related to temperature. This suggests the possibility that temperature can

be estimated from the chirp frequency. Let the following data on the number chirps per second, x by the striped ground cricket and the temperature, y in Fahrenheit is shown below:

Find the normal regression line that approximates the regression of temperature on the number chirps per second by the striped ground cricket. Further

test the hypothesis H o : = 4 versus H a : 6= 4 at the significance level 0.1.

Answer: From the above data, we have

10 X 10 X

x i = 170,

x 2 i = 2920

y i = 64270

i=1

X 10 x i y i = 13688

i=1

S xx = 376,

S xy = 1305,

S yy = 4752.4.

Hence

s b= xy

s = 4.067 and b↵ = y b x = 9.761. xx

Thus the normal regression line is

y = 9.761 + 4.067x.

Probability and Mathematical Statistics

This regression line is shown below.

Regression line y = 9.761 + 4.067x

Now we test the hypothesis H o : = 4 against H a : 6= 4 at 0.1 level of

significance. From the data, the maximum likelihood estimate of is

0.528 = |t| < t 0.05 (8) = 1.860.

Simple Linear Regression and Correlation Analysis

Thus we do not reject the null hypothesis that H o : = 4 at a significance level 0.1.

Let µ x = ↵ + x and write b Y x = b↵ + bx for an arbitrary but fixed x. Then b Y x is an estimator of µ x . The following theorem gives various properties of this estimator.

Theorem 19.6. Let x be an arbitrary but fixed real number. Then

(i) b Y x is a linear estimator of Y 1 ,Y 2 , ..., Y n ,

(ii) b Y x is an unbiased estimator of µ x , and

⇣ ⌘ n

(iii) V ar 2 Y b (x x) o

Proof: First we show that b Y x is a linear estimator of Y 1 ,Y 2 , ..., Y n . Since

Y b x is a linear estimator of Y 1 ,Y 2 , ..., Y n .

Next, we show that b Y x is an unbiased estimator of µ x . Since

E Y b x =E ↵+bx b ⇣ ⌘ = E (b↵) + E bx

=↵+x =µ x

Y b x is an unbiased estimator of µ x .

Finally, we calculate the variance of b Y x using Theorem 19.3. The variance

Probability and Mathematical Statistics

of b Y x is given by

V ar Y b x = V ar ↵+bx b

= V ar (b↵) + V ar bx + 2Cov ↵, b x b

In this computation we have used the fact that

whose proof is left to the reader as an exercise. The proof of the theorem is now complete.

By Theorem 19.3, we see that

Since b Y x = b↵ + bx, the random variable bY x is also a normal random variable with mean µ x and variance

Hence standardizing b Y x , we have

Y b x µ x

⇣ ⌘ ⇠ N(0, 1).

V ar Y b x

If 2 is known, then one can take the statistic Q = q b Y x µ x as a pivotal

V ar Y b x

quantity to construct a confidence interval for µ x . The (1 )100 confidence

interval for µ x when 2 is known is given by



Y b x z 2 V ar( b Y x ), b Y x +z 2 V ar( b Y x ).

Simple Linear Regression and Correlation Analysis

Example 19.9. Let the following data on the number chirps per second, x by the striped ground cricket and the temperature, y in Fahrenheit is shown below:

What is the 95 confidence interval for ? What is the 95 confidence interval for µ x when x = 14 and = 3.047?

Answer: From Example 19.8, we have

n = 10,

b = 4.067, b = 3.047 and S xx = 376.

The (1

) confidence interval for is given by

(n 2) S ,b+t

Therefore the 90 confidence interval for is

which is

[ 4.067 t 0.025 (8) (0.1755) , 4.067 + t 0.025 (8) (0.1755)] . Since from the t-table, we have t 0.025 (8) = 2.306, the 90 confidence interval

for becomes

which is [3.6623, 4.4717].

If variance 2 is not known, then we can use the fact that the statistic U= n 2 b is chi-squares with n

2 degrees of freedom to obtain a pivotal

quantity for µ x . This can be done as follows:

Y b x µ x

(n 2) S xx

b S xx + n (x x) 2

q µ b x x

n +

1 (x x)2

= Sxx q

⇠ t(n 2).

n b 2

(n 2) 2

Probability and Mathematical Statistics

Using this pivotal quantity one can construct a (1

)100 confidence in-

terval for mean µ as

Next we determine the 90 confidence interval for µ x when x = 14 and = 3.047. The (1

)100 confidence interval for µ x when 2 is known is

given by

From the data, we have

Y b x = b↵ + bx = 9.761 + (4.067) (14) = 66.699

V ar Y b 2 x 2 = + = (0.124) (3.047) = 1.1512.

The 90 confidence interval for µ x is given by

66.699 z 0.025 1.1512, 66.699 + z 0.025 1.1512

and since z 0.025 = 1.96 (from the normal table), we have

which is [64.596, 68.802].

We now consider the predictions made by the normal regression equation Y b x = b↵ + bx. The quantity bY x gives an estimate of µ x = ↵ + x. Each time we compute a regression line from a random sample we are observing one possible linear equation in a population consisting all possible linear equations. Further, the actual value of Y x that will be observed for given

value of x is normal with mean ↵ + x and variance 2 . So the actual

observed value will be di↵erent from µ x . Thus, the predicted value for b Y x will be in error from two di↵erent sources, namely (1) b↵ and b are randomly

distributed about ↵ and , and (2) Y x is randomly distributed about µ x .

Simple Linear Regression and Correlation Analysis

Let y x denote the actual value of Y x that will be observed for the value x and consider the random variable

D=Y x ↵ b bx.

Since D is a linear combination of normal random variables, D is also a normal random variable.

The mean of D is given by

D) = E(Y x ) E(b↵) x E(b) =↵+x↵x

The variance of D is given by

V ar(

D) = V ar(Y x ↵ b bx)

= V ar(Y x

) + V ar(b↵) + x 2

V ar( b) + 2 x Cov( ↵, b) b

We standardize D to get

D 0

q (n+1) S xx +n

⇠ N(0, 1).

2 nS xx

Since in practice the variance of Y x which is 2 is unknown, we can not use

Z to construct a confidence interval for a predicted value y x .

We know that U = 2 n b ⇠ 2 (n 2). By Theorem 14.6, the statistic

Probability and Mathematical Statistics

p Z ⇠ t(n 2). Hence

(n+1) Sxx+n b ↵ b p x

The statistic Q is a pivotal quantity for the predicted value y x and one can use it to construct a (1

)100 confidence interval for y x . The (1

confidence interval, [a, b], for y x is given by

a= b ↵+bx t 2 (n 2) b

b= ↵+bx+t b 2 (n 2) b

(n 2) S .

This confidence interval for y x is usually known as the prediction interval for predicted value y x based on the given x. The prediction interval represents an interval that has a probability equal to 1

of containing not a parameter but

a future value y x of the random variable Y x . In many instances the prediction interval is more relevant to a scientist or engineer than the confidence interval on the mean µ x .

Example 19.10. Let the following data on the number chirps per second, x by the striped ground cricket and the temperature, y in Fahrenheit is shown below:

Simple Linear Regression and Correlation Analysis

What is the 95 prediction interval for y x when x = 14? Answer: From Example 19.8, we have

n = 10,

b = 4.067, b↵ = 9.761, b = 3.047 and S xx = 376.

Thus the normal regression line is

y x = 9.761 + 4.067x.

Since x = 14, the corresponding predicted value y x is given by

y x = 9.761 + (4.067) (14) = 66.699.

a= ↵+bx b t 2 (n 2) b

b= ↵+bx+t b 2 (n 2) b

Hence the 95 prediction interval for y x when x = 14 is [58.4501, 74.9479].

19.3. The Correlation Analysis In the first two sections of this chapter, we examine the regression prob-

lem and have done an in-depth study of the least squares and the normal regression analysis. In the regression analysis, we assumed that the values of X are not random variables, but are fixed. However, the values of Y x for

Probability and Mathematical Statistics

a given value of x are randomly distributed about E(Y x )=µ x = ↵ + x.

Further, letting " to be a random variable with E(") = 0 and V ar(") = 2 ,

one can model the so called regression problem by

Y x = ↵ + x + ".

In this section, we examine the correlation problem. Unlike the regression problem, here both X and Y are random variables and the correlation problem can be modeled by

E(Y ) = ↵ + E(X).

From an experimental point of view this means that we are observing random vector (X, Y ) drawn from some bivariate population.

Recall that if (X, Y ) is a bivariate random variable then the correlation coefficient ⇢ is defined as

where µ X and µ Y are the mean of the random variables X and Y , respec-

tively.

Definition 19.1. If (X 1 ,Y 1 ), (X 2 ,Y 2 ), ..., (X n ,Y n ) is a random sample from

a bivariate population, then the sample correlation coefficient is defined as

The corresponding quantity computed from data (x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )

will be denoted by r and it is an estimate of the correlation coefficient ⇢.

Now we give a geometrical interpretation of the sample correlation coeffi-

cient based on a paired data set {(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )}. We can asso- ciate this data set with two vectors ~x = (x 1 ,x 2 , ..., x n ) and ~y = (y 1 ,y 2 , ..., y n )

in IR n

. Let L be the subset { ~e | 2 IR} of IR n , where ~e = (1, 1, ..., 1) 2 IR .

Consider the linear space V given by IR n

modulo L, that is V = IR n L. The

linear space V is illustrated in a figure on next page when n = 2.

Simple Linear Regression and Correlation Analysis

[x]

Illustration of the linear space

V for n=2

We denote the equivalence class associated with the vector ~x by [~x]. In

the linear space V it can be shown that the points (x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )

are collinear if and only if the the vectors [~x] and [~y] in V are proportional.

We define an inner product on this linear space V by

X n

h[~x], [~y]i =

(x i x) (y i y).

i=1

Then the angle ✓ between the vectors [~x] and [~y] is given by

cos(✓) =

h[~x], [~y]i

p h[~x], [~x]i h[~y], [~y]i

which is

X n

(x i x) (y i y)

cos(✓) = i=1 v

Thus the sample correlation coefficient r can be interpreted geometrically as the cosine of the angle between the vectors [~x] and [~y]. From this view point the following theorem is obvious.

Probability and Mathematical Statistics

Theorem 19.7. The sample correlation coefficient r satisfies the inequality

1  r  1.

The sample correlation coefficient r = ±1 if and only if the set of points

{(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} for n 3 are collinear.

To do some statistical analysis, we assume that the paired data is a random sample of size n from a bivariate normal population (X, Y ) ⇠

BV N (µ 1 ,µ 2 , 2 1 2 , , ⇢). Then the conditional distribution of the random

variable Y given X = x is normal, that is

This can be viewed as a normal regression model E(Y | x ) = ↵ + x where

↵=µ ⇢ 2 1 µ 1 , =⇢ 2 1 , and V ar(Y | x )= 2 (1 ⇢ 2 ).

Since = ⇢ 2 1 , if ⇢ = 0, then = 0. Hence the null hypothesis H o :⇢=0

is equivalent to H o : = 0. In the previous section, we devised a hypothesis

test for testing H o : = o against H a : 6= o . This hypothesis test, at

significance level , is “Reject H o := o if |t| t 2 (n 2)”, where

If = 0, then we have

Now we express t in term of the sample correlation coefficient r. Recall that

S b= xy

S xx

2 1 

b =

S xy

S yy

S xy ,

S xx

and

S xy

r= p

S xx S yy

Simple Linear Regression and Correlation Analysis

Now using (11), (12), and (13), we compute

Hence to test the null hypothesis H o : ⇢ = 0 against H a : ⇢ 6= 0, at

significance level , is “Reject H o : ⇢ = 0 if |t|

t 2 (n 2)”, where t =

n 2 1r r 2 . This above test does not extend to test other values of ⇢ except ⇢ = 0.

However, tests for the nonzero values of ⇢ can be achieved by the following result.

Theorem 19.8. Let (X 1 ,Y 1 ), (X 2 ,Y 2 ), ..., (X n ,Y n ) be a random sample from

a bivariate normal population (X, Y ) ⇠ BV N(µ 1 ,µ 2 , 2 1 , 2 , ⇢). If

3 (V m) ! N(0, 1) as n ! 1.

This theorem says that the statistic V is approximately normal with

mean m and variance 1 n3 when n is large. This statistic can be used to

devise a hypothesis test for the nonzero values of ⇢. Hence to test the null

1 ⇣ 1+⇢ o H ⌘

hypothesis H o :⇢=⇢ o against H a : ⇢ 6= ⇢ o , at significance level , is “Reject

o :⇢=⇢ o if |z| z 2 ”, where z = n

3 (V m o ) and m o = 2 ln 1⇢ o .

Example 19.11. The following data were obtained in a study of the relationship between the weight and chest size of infants at birth:

x, weight in kg

y, chest size in cm

Probability and Mathematical Statistics

Determine the sample correlation coefficient r and then test the null hypoth-

esis H o : ⇢ = 0 against the alternative hypothesis H a : ⇢ 6= 0 at a significance

level 0.01. Answer: From the above data we find that

Next, we compute S xx ,S yy and S xy using a tabular representation.

(x x)(y y)

S xy = 18.557

S xx = 8.565

S yy = 65.788

Hence, the correlation coefficient r is given by

The computed t value is give by

From the t-table we have t 0.005 (4) = 4.604. Since

2.509 = |t| 6 t 0.005 (4) = 4.604

we do not reject the null hypothesis H o : ⇢ = 0.

19.4. Review Exercises

1. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that each Y i ⇠ N( x i , 2 ), where both

and 2 are unknown parameters. If

{(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

served values based on x 1 ,x 2 , ..., x n , then find the maximum likelihood esti-

mators of b and b 2 of and 2 .

Simple Linear Regression and Correlation Analysis

2. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that each Y i ⇠ N( x i , 2 ), where both

and 2 are unknown parameters. If

{(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

served values based on x 1 ,x 2 , ..., x n , then show that the maximum likelihood

estimator of b is normally distributed. What are the mean and variance of

3. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that each Y i ⇠ N( x i , 2 ), where both

and 2 are unknown parameters. If

{(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

served values based on x 1 ,x 2 , ..., x

n , then find an unbiased estimator b of

2 2 and then find a constant k such that k b 2 ⇠ (2n).

4. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that each

Y i

2 ⇠ N( x 2 i , ), where both and are unknown parameters. If

{(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

served values based on x 1 ,x 2 , ..., x n , then find a pivotal quantity for and

using this pivotal quantity construct a (1

)100 confidence interval for .

5. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that each Y i ⇠ N( x i , 2 ), where both

and 2 are unknown parameters. If

{(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the observed values based on x 1 ,x 2 , ..., x n , then find a pivotal quantity for 2 and

using this pivotal quantity construct a (1

)100 confidence interval for

6. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that

each Y i ⇠ EXP ( x i ), where

is an unknown parameter.

If {(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

served values based on x 1 ,x 2 , ..., x n , then find the maximum likelihood esti-

mator of b of .

7. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that

each Y i ⇠ EXP ( x i ), where

is an unknown parameter.

If {(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

served values based on x 1 ,x 2 , ..., x n , then find the least squares estimator of

b of .

8. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that

each Y i ⇠ P OI( x i ), where

is an unknown parameter.

If {(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

Probability and Mathematical Statistics

served values based on x 1 ,x 2 , ..., x n , then find the maximum likelihood esti-

mator of b of .

9. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that

each Y i ⇠ P OI( x i ), where

is an unknown parameter.

If {(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

served values based on x 1 ,x 2 , ..., x n , then find the least squares estimator of

b of .

10. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that

each Y i ⇠ P OI( x i ), where

is an unknown parameter.

If {(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

served values based on x 1 ,x 2 , ..., x n , show that the least squares estimator

and the maximum likelihood estimator of are both unbiased estimator of

11. Let Y 1 ,Y 2 , ..., Y n

be n independent random variables such that

each Y i ⇠ P OI( x i ), where

is an unknown parameter.

If {(x 1 ,y 1 ), (x 2 ,y 2 ), ..., (x n ,y n )} is a data set where y 1 ,y 2 , ..., y n are the ob-

served values based on x 1 ,x 2 , ..., x n , the find the variances of both the least

squares estimator and the maximum likelihood estimator of .

12. Given the five pairs of points (x, y) shown below:

0.131 What is the curve of the form y = a + bx + cx 2 best fits the data by method

of least squares?

13. Given the five pairs of points (x, y) shown below:

What is the curve of the form y = a + b x best fits the data by method of least squares?

14. The following data were obtained from the grades of six students selected at random:

Mathematics Grade, x

English Grade, y

Simple Linear Regression and Correlation Analysis

Find the sample correlation coefficient r and then test the null hypothesis

H o : ⇢ = 0 against the alternative hypothesis H a : ⇢ 6= 0 at a significance

level 0.01.

15. Given a set of data {(x 1 ,y 2 ), (x 2 ,y 2 ), ..., (x n ,y n )} what is the least square

estimate of ↵ if y = ↵ is fitted to this data set.

16. Given a set of data points {(2, 3), (4, 6), (5, 7)} what is the curve of the

form y = ↵ + x 2 best fits the data by method of least squares?

17. Given a data set {(1, 1), (2, 1), (2, 3), (3, 2), (4, 3)} and Y x ⇠ N(↵ +

x, 2 ), find the point estimate of 2 and then construct a 90 confidence

interval for .

18. For the data set {(1, 1), (2, 1), (2, 3), (3, 2), (4, 3)} determine the correla-

tion coefficient r. Test the null hypothesis H 0 : ⇢ = 0 versus H a : ⇢ 6= 0 at a

significance level 0.01.

Probability and Mathematical Statistics

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Chapter 19 SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Parts

Dokumen yang terkait

ANALISIS BAHAYA KEBAKARAN DAN LEDAKAN PADA TANGKI PENYIMPANAN LIQUIFIED PETROLEUM GAS (LPG) DENGAN METODE DOW’S FIRE AND EXPLOSION INDEX di PT.X

ANALISIS KEMAMPUAN SISWA SMP DALAM MENYELESAIKAN SOAL PISA KONTEN SHAPE AND SPACE BERDASARKAN MODEL RASCH

PERBANDINGAN HASIL BELAJAR SISWA MENGGUNAKAN MODEL PEMBELAJARAN KOOPERATIF TIPE TAKE AND GIVE DENGAN MODEL PEMBELAJARAN THINK PAIR SHARE PADA MATA PELAJARAN GEOGRAFI KELAS XI-IIS DI SMA NEGERI 7 BANDA ACEH

THE ANALYSIS OF CONNOTATIVE MEANING AND ITS IMPLICATURE ON THE JAVANESE TABOO WORDS, SURABAYA DIALECT

AN ANALYSIS OF DESCRIPTIVE TEXT WRITING COMPOSED BY THE HIGH AND THE LOW ACHIEVERS OF THE EIGHTH GRADE STUDENTS OF SMPN SUKORAMBI JEMBER

AN ANALYSIS OF LANGUAGE CONTENT IN THE SYLLABUS FOR ESP COURSE USING ESP APPROACH THE SECRETARY AND MANAGEMENT PROGRAM BUSINESS TRAINING CENTER (BTC) JEMBER IN ACADEMIC YEAR OF 2000 2001

A DISCOURSE ANALYSIS ON “SPA: REGAIN BALANCE OF YOUR INNER AND OUTER BEAUTY” IN THE JAKARTA POST ON 4 MARCH 2011

THE INTEGRATION BETWEEN INDONESIA AND WORLD RICE MARKET

MODEL KONSELING TRAIT AND FACTOR

THE EFFECTIVENESS OF THE LEADERSHIP'S ROLE AND FUNCTION OF MUHAMMADIYAH ELEMENTARY SCHOOL PRINCIPAL OF METRO EFEKTIVITAS PERAN DAN FUNGSI KEPALA SEKOLAH DASAR MUHAMMADIYAH METRO

Dukungan

Links

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Chapter 19 SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Parts

Dokumen yang terkait

ANALISIS BAHAYA KEBAKARAN DAN LEDAKAN PADA TANGKI PENYIMPANAN LIQUIFIED PETROLEUM GAS (LPG) DENGAN METODE DOW’S FIRE AND EXPLOSION INDEX di PT.X

ANALISIS KEMAMPUAN SISWA SMP DALAM MENYELESAIKAN SOAL PISA KONTEN SHAPE AND SPACE BERDASARKAN MODEL RASCH

PERBANDINGAN HASIL BELAJAR SISWA MENGGUNAKAN MODEL PEMBELAJARAN KOOPERATIF TIPE TAKE AND GIVE DENGAN MODEL PEMBELAJARAN THINK PAIR SHARE PADA MATA PELAJARAN GEOGRAFI KELAS XI-IIS DI SMA NEGERI 7 BANDA ACEH

THE ANALYSIS OF CONNOTATIVE MEANING AND ITS IMPLICATURE ON THE JAVANESE TABOO WORDS, SURABAYA DIALECT

AN ANALYSIS OF DESCRIPTIVE TEXT WRITING COMPOSED BY THE HIGH AND THE LOW ACHIEVERS OF THE EIGHTH GRADE STUDENTS OF SMPN SUKORAMBI JEMBER

AN ANALYSIS OF LANGUAGE CONTENT IN THE SYLLABUS FOR ESP COURSE USING ESP APPROACH THE SECRETARY AND MANAGEMENT PROGRAM BUSINESS TRAINING CENTER (BTC) JEMBER IN ACADEMIC YEAR OF 2000 2001

A DISCOURSE ANALYSIS ON “SPA: REGAIN BALANCE OF YOUR INNER AND OUTER BEAUTY” IN THE JAKARTA POST ON 4 MARCH 2011

THE INTEGRATION BETWEEN INDONESIA AND WORLD RICE MARKET

MODEL KONSELING TRAIT AND FACTOR

THE EFFECTIVENESS OF THE LEADERSHIP'S ROLE AND FUNCTION OF MUHAMMADIYAH ELEMENTARY SCHOOL PRINCIPAL OF METRO EFEKTIVITAS PERAN DAN FUNGSI KEPALA SEKOLAH DASAR MUHAMMADIYAH METRO

Dokumen yang Anda mencari sudah siap untuk unduhkan