The Simple Linear Regression Model

12.1 The Simple Linear Regression Model

The simplest deterministic mathematical relationship between two variables x and y

is a linear relationship y 5 b 0 1b 1 x . The set of pairs (x, y) for which y 5 b 0 1b 1 x determines a straight line with slope b 1 and y-intercept b 0 . The objective of this

section is to develop a linear probabilistic model.

If the two variables are not deterministically related, then for a fixed value of x, there is uncertainty in the value of the second variable. For example, if we are investigating the relationship between age of child and size of vocabulary and decide to select a child of age x 5 5.0 years, then before the selection is made, vocabulary size is a random variable Y. After a particular 5-year-old child has been selected and tested, a vocabulary of 2000 words may result. We would then say that the observed value of Y associated with fixing x 5 5.0 was y 5 2000.

More generally, the variable whose value is fixed by the experimenter will

be denoted by x and will be called the independent, predictor, or explanatory

variable . For fixed x, the second variable will be random; we denote this random variable and its observed value by Y and y, respectively, and refer to it as the dependent or response variable .

Usually observations will be made for a number of settings of the inde-

pendent variable. Let x 1 ,x 2 ,…, x n denote values of the independent variable for

which observations are made, and let Y i and y i , respectively, denote the random variable and observed value associated with x i . The available bivariate data then

consists of the n pairs (x 1 ,y 1 ), (x 2 ,y 2 ),…, (x n ,y n ). A picture of this data called a

scatterplot gives preliminary impressions about the nature of any relationship. In such a plot, each sx i ,y i d is represented as a point plotted on a two-dimensional coordinate system.

The slope of a line is the change in y for a 1-unit increase in x. For example, if y 5 23x 1 10, then y

decreases by 3 when x increases by 1, so the slope is 23. The y-intercept is the height at which the line crosses the vertical axis and is obtained by setting x 5 0 in the equation.

12.1 the Simple Linear regression Model 489

ExamplE 12.1

Visual and musculoskeletal problems associated with the use of visual display terminals (VDTs) have become rather common in recent years. Some researchers have focused on vertical gaze direction as a source of eye strain and irritation. This direction is known to be closely related to ocular surface area (OSA), so a method of measuring OSA is needed. The accompanying representative data on

y5 OSA scm 2 d and x 5 width of the palprebal fissure (i.e., the horizontal width

of the eye opening, in cm) is from the article “Analysis of Ocular Surface Area

for Comfortable VDT Workstation Layout” (Ergonomics, 1996: 877–884) . The order in which observations were obtained was not given, so for convenience they are listed in increasing order of x values.

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x i .40 .42 .48 .51 .57 .60 .70 .75 .75 .78 .84 .95 .99 1.03 1.12 y i

Thus sx 1 ,y 1 d 5 s.40, 1.02d, sx 5 ,y 5 d 5 s.57, 1.52d, and so on. A Minitab scatterplot is

shown in Figure 12.1; we used an option that produced a dotplot of both the x values and y values individually along the right and top margins of the plot, which makes it easier to visualize the distributions of the individual variables (histograms or box- plots are alternative options). Here are some things to notice about the data and plot:

●

Several observations have identical x values yet different y values (e.g.,

x 8 5 x 9 5 .75, but y 8 5 1.80 and y 9 5 1.74). Thus the value of y is not

determined solely by x but also by various other factors.

●

There is a strong tendency for y to increase as x increases. That is, larger values of OSA tend to be associated with larger values of fissure width—a positive

relationship between the variables.

Figure 12.1 Scatterplot from Minitab for the data from Example 12.1, along with dotplots of x and y values

490 Chapter 12 Simple Linear regression and Correlation

●

It appears that the value of y could be predicted from x by finding a line that is rea- sonably close to the points in the plot (the authors of the cited article superimposed such a line on their plot). In other words, there is evidence of a substantial (though not perfect) linear relationship between the two variables.

The horizontal and vertical axes in the scatterplot of Figure 12.1 intersect at the point (0, 0). In many data sets, the values of x or y or the values of both variables differ considerably from zero relative to the range(s) of the values. For example, a study of how air conditioner efficiency is related to maximum daily outdoor temperature might involve observations for temperatures ranging from 80°F to 100°F. When this is the case, a more informative plot would show the appropriately labeled axes intersecting at some point other than (0, 0).

ExamplE 12.2

Arsenic is found in many ground waters and some surface waters. Recent health effects research has prompted the Environmental Protection Agency to reduce allowable arsenic levels in drinking water so that many water systems are no longer compliant with standards. This has spurred interest in the development of methods to remove arsenic. The accompanying data on x 5 pH and y 5 arsenic removed () by

a particular process was read from a scatterplot in the article “Optimizing Arsenic Removal During Iron Removal: Theoretical and Practical Considerations” (J.

of Water Supply Res. and Tech., 2005: 545–560) .

Figure 12.2 shows two Minitab scatterplots of this data. In Figure 12.2(a), the soft- ware selected the scale for both axes. We obtained Figure 12.2(b) by specifying scal- ing for the axes so that they would intersect at roughly the point (0, 0). The second plot is much more crowded than the first one; such crowding can make it difficult to ascertain the general nature of any relationship. For example, curvature can be overlooked in a crowded plot.

Figure 12.2 Minitab scatterplots of data in Example 12.2

12.1 the Simple Linear regression Model 491

Large values of arsenic removal tend to be associated with low pH, a negative or inverse relationship. Furthermore, the two variables appear to be at least approxi- mately linearly related, although the points in the plot would spread out somewhat about any superimposed straight line (such a line appeared in the plot in the cited article).

A Linear Probabilistic Model

For the deterministic model y 5 b 0 1b 1 x , the actual observed value of y is a linear

function of x. The appropriate generalization of this to a probabilistic model assumes that the expected value of Y is a linear function of x, but that for fixed x the variable Y differs from its expected value by a random amount.

DEFINITION

the Simple Linear regression Model

There are parameters b 0 ,b 1 , and s 2 , such that for any fixed value of the inde-

pendent variable x, the dependent variable is a random variable related to x through the model equation

Y5b 0 1b 1 x1e (12.1)

The quantity e in the model equation is a random variable, assumed to be

normally distributed with E(e) 5 0 and V(e) 5 s 2 .

The variable e is usually referred to as the random deviation or random

error term in the model. Without e, any observed pair (x, y) would correspond to

a point falling exactly on the line y 5 b 0 1b 1 x , called the true (or population )

regression line . The inclusion of the random error term allows (x, y) to fall either above the true regression line (when e . 0) or below the line (when e , 0). The

points sx 1 ,y 1 d,…, sx n ,y n d resulting from n independent observations will then be

scattered about the true regression line, as illustrated in Figure 12.3. On occasion, the appropriateness of the simple linear regression model may be suggested by theoretical considerations (e.g., there is an exact linear relationship between the two variables, with e representing measurement error). Much more frequently, though, the reasonableness of the model is indicated by a scatterplot exhibiting a substantial linear pattern (as in Figures 12.1 and 12.2).

(x 1 ,y 1 )

True regression line y 5 0 1 x

Figure 12.3 Points corresponding to observations from the simple linear regression model

492 Chapter 12 Simple Linear regression and Correlation

Implications of the model equation (12.1) can best be understood with the aid of the following notation. Let x denote a particular value of the independent variable x and

m Y?x 5 the expected sor meand value of Y when x has value x s 2 Y?x 5 the variance of Y when x has value x

Alternative notation is E sYuxd and VsYuxd. For example, if x 5 applied stress

skgmmd 2 and y 5 timetofracture shrd, then m Y? 20 would denote the expected value of time-to-fracture when applied stress is 20 kgmm 2 . If we think of an entire population of (x, y) pairs, then m Y?x is the mean of all y values for which x 5 x, and

s 2 Y?x is a measure of how much these values of y spread out about the mean value.

If, for example, x 5 age of a child and y 5 vocabulary size, then m Y? 5 is the average vocabulary size for all 5-year-old children in the population, and s Y? 2 5 describes the

amount of variability in vocabulary size for this part of the population. Once x is fixed, the only randomness on the right-hand side of the model equation (12.1) is

in the random error e, and its mean value and variance are 0 and s 2 , respectively,

whatever the value of x. This implies that

m Y?x 5 E sb 0 1b 1 x 1e d5b 0 1b 1 x 1E sed 5 b 0 1b 1 x s 2 Y?x 5 V sb 0 1b 1 x 1e d 5 Vsb 0 1b 1 x d 1 Vsed 5 0 1 s 2 5s 2 Replacing x in m Y?x by x gives the relation m Y?x 5b 0 1b 1 x , which says that

the mean value of Y, rather than Y itself, is a linear function of x. The true regression

line y 5 b 0 1b 1 x is thus the line of mean values; its height above any particular x

value is the expected value of Y for that value of x. The slope b 1 of the true regression

line is interpreted as the expected change in Y associated with a 1-unit increase in the value of x. The second relation states that the amount of variability in the distribution of Y values is the same at each different value of x (homogeneity of variance). If the independent variable is vehicle weight and the dependent variable is fuel efficiency (mpg), then the model implies that the average fuel efficiency changes linearly with

weight (presumable b 1 is negative) and that the amount of variability in efficiency

for any particular weight is the same as at any other weight. Finally, for fixed x, Y is

the sum of a constant b 0 1b 1 x and a normally distributed rv e so itself has a normal

distribution. These properties are illustrated in Figure 12.4. The variance parameter

Normal, mean 0, standard deviation

Line y 5

0 1 x

x 1 x 2 x 3 (b)

Figure 12.4 (a) Distribution of e; (b) distribution of Y for different values of x

12.1 the Simple Linear regression Model 493

s 2 determines the extent to which each normal curve spreads out about its mean value; roughly speaking, the value of s is the size of a typical deviation from the true regression line. An observed point (x, y) will almost always fall quite close to the true regression line when s is small, whereas observations may deviate considerably from their expected values (corresponding to points far from the line) when s is large.

ExamplE 12.3

Suppose the relationship between applied stress x and time-to-failure y is described by the simple linear regression model with true regression line y5

65 2 1.2x and s 5 8. Then for any fixed value x of stress, time-to-failure has a normal distribution with mean value 65 2 1.2x and standard deviation 8. In the population consisting of all (x, y) points, the magnitude of a typical deviation from the true regression line is about 8. For x 5 20, Y has mean value

m Y ? 20 5 65 2 1.2 s20d 5 41, so

P sY . 50 when x 5 20d 5 P Z. 5 12F s1.13d 5 .1292

Because m Y? 25 5 35,

P sY . 50 when x 5 25d 5 P Z. 5 12F s1.88d 5 .0301

These probabilities are illustrated as the shaded areas in Figure 12.5.

P (Y . 50 when x 5 20) 5 .1292

P (Y . 50 when x 5 25) 5 .0301

True regression line y5

65 21.2x

x 20 25

Figure 12.5 Probabilities based on the simple linear regression model

Suppose that Y 1 denotes an observation on time-to-failure made with x 5 25 and Y 2 denotes an independent observation made with x 5 24. Then Y 1 2 Y 2 is normally distributed with mean value E sY 1 2 Y 2 d5b 1 52 1.2, variance V(Y 1 2 Y 2 ) 5 s 2 1s 2 5 128, and standard deviation Ï128 5 11.314. The probability that Y 1

exceeds Y 2 is

P (Y 1 2 Y 2 . 0) 5 P Z.

5 P (Z . .11) 5 .4562

That is, even though we expected Y to decrease when x increases by 1 unit, it is not unlikely that the observed Y at x 1 1 will be larger than the observed Y at x.

494 Chapter 12 Simple Linear regression and Correlation

ExERCiSES Section 12.1 (1–11)

1. The efficiency ratio for a steel specimen immersed in a

the accompanying observations on x 5 hydrogen con-

phosphating tank is the weight of the phosphate coating

centration (ppm) using a gas chromatography method

divided by the metal loss (both in mgft 2 ). The article

and y 5 concentration using a new sensor method were

“Statistical Process Control of a Phosphate Coating

read from a graph in the article “A New Method to

Line” (Wire J. Intl., May 1997: 78–81) gave the

Measure the Diffusible Hydrogen Content in Steel

accompanying data on tank temperature (x) and effi-

Weldments Using a Polymer Electrolyte-Based

ciency ratio (y).

Hydrogen Sensor” (Welding Res., July 1997:

251s–256s) .

Temp. 170 172 173 174 174 175 176

Ratio .84 1.31 1.42 1.03 1.07 1.08 1.04 x

y 127 114 134 139 142 170 149 154 200 215 Ratio 1.43 .90 1.81 1.94 2.68 1.49 2.52 Construct a scatterplot. Does there appear to be a very

Temp. 185 186 188

strong relationship between the two types of concentration measurements? Do the two methods appear to be

Ratio 3.00 1.87 3.08

measuring roughly the same quantity? Explain your

a. Construct stem-and-leaf displays of both tempera-

reasoning.

ture and efficiency ratio, and comment on interesting

4. The accompanying data on y 5 ammonium concentra-

features.

tion (mgL) and x 5 transpiration (mlh) was read from

b. Is the value of efficiency ratio completely and unique-

a graph in the article “Response of Ammonium

ly determined by tank temperature? Explain your

Removal to Growth and Transpiration of Juncus

reasoning.

effusus During the Treatment of Artificial Sewage in

c. Construct a scatterplot of the data. Does it appear

Laboratory-Scale Wetlands” (Water Research, 2013:

that efficiency ratio could be very well predicted by

4265–4273) . The article’s abstract stated “a linear cor-

the value of temperature? Explain your reasoning.

relation between the ammonium concentration inside

2. The article “Exhaust Emissions from Four-Stroke

the rhizosphere and the transpiration of the plant stocks

Lawn Mower Engines” (J. of the Air and Water Mgmnt.

implies that an influence of plant physiological activity

Assoc., 1997: 945–952)

on the efficiency of N-removal exists.” (The rhizo-

reported data from a study in

sphere is the narrow region of soil at the plant root–soil

which both a baseline gasoline mixture and a reformulat-

interface, and transpiration is the process of water

ed gasoline were used. Consider the following observa-

movement through a plant and its evaporation.) The

tions on age (yr) and NO x emissions (gkWh):

article reported summary quantities from a simple lin-

Engine

ear regression analysis. Based on a scatterplot, how

Age

would you describe the relationship between the vari-

Baseline

ables, and does simple linear regression appear to be an

Reformulated 1.88 5.93 5.54 2.67 6.53

appropriate modeling strategy?

Construct scatterplots of NO x emissions versus age. What

5. The article “Objective Measurement of the Stretchability

appears to be the nature of the relationship between these

of Mozzarella Cheese” (J. of Texture Studies, 1992:

two variables? [Note: The authors of the cited article

185–194) reported on an experiment to investigate how the

commented on the relationship.]

behavior of mozzarella cheese varied with temperature.

3. Bivariate data often arises from the use of two different

Consider the accompanying data on x 5 temperature and

techniques to measure the same quantity. As an example,

y5 elongation() at failure of the cheese. [Note: The

12.1 the Simple Linear regression Model 495

researchers were Italian and used real mozzarella cheese,

a. What is the expected value of 28-day strength when

not the poor cousin widely available in the United States.]

accelerated strength 5 2500? b. By how much can we expect 28-day strength to

change when accelerated strength increases by 1 psi?

y 118 182 247 208 197 135 132

c. Answer part (b) for an increase of 100 psi. d. Answer part (b) for a decrease of 100 psi.

a. Construct a scatterplot in which the axes intersect

8. Referring to Exercise 7, suppose that the standard devia-

at (0, 0). Mark 0, 20, 40, 60, 80, and 100 on the

tion of the random deviation e is 350 psi.

horizontal axis and 0, 50, 100, 150, 200, and 250 on the vertical axis.

a. What is the probability that the observed value of

b. Construct a scatterplot in which the axes intersect

28-day strength will exceed 5000 psi when the value

at (55, 100), as was done in the cited article. Does

of accelerated strength is 2000?

this plot seem preferable to the one in part (a)?

b. Repeat part (a) with 2500 in place of 2000.

Explain your reasoning.

c. Consider making two independent observations on

c. What do the plots of parts (a) and (b) suggest about

28-day strength, the first for an accelerated strength

the nature of the relationship between the two

of 2000 and the second for x 5 2500. What is the

variables?

probability that the second observation will exceed the first by more than 1000 psi?

6. One factor in the development of tennis elbow, a malady

d. Let Y

1 and Y 2 denote observations on 28-day strength when x 5 x 1 and x 5 x , respectively. By how much

that strikes fear in the hearts of all serious tennis players,

is the impact-induced vibration of the racket-and-arm

would x

2 have to exceed x 1 in order that P(Y 2 . Y 1 )5

system at ball contact. It is well known that the likeli-

hood of getting tennis elbow depends on various properties of the racket used. Consider the scatterplot of x 5

9. The flow rate y (m 3 min) in a device used for air-quality

racket resonance frequency (Hz) and y 5 sum of peak-

measurement depends on the pressure drop x (in. of

to-peak acceleration (a characteristic of arm vibration,

water) across the device’s filter. Suppose that for x values

in msecsec) for n 5 23 different rackets (“Transfer of

between 5 and 20, the two variables are related according

Tennis Racket Vibrations into the Human Forearm,”

to the simple linear regression model with true regres-

Medicine and Science in Sports and Exercise, 1992:

sion line y 5 2.12 1 .095x.

1134–1140) . Discuss interesting features of the data and

a. What is the expected change in flow rate associated

scatterplot.

with a 1-in. increase in pressure drop? Explain. b. What change in flow rate can be expected when pressure drop decreases by 5 in.?

c. What is the expected flow rate for a pressure drop of

10 in.? A drop of 15 in.?

d. Suppose s 5 .025 and consider a pressure drop of

10 in. What is the probability that the observed value

of flow rate will exceed .835? That observed flow rate will exceed .840?

e. What is the probability that an observation on flow

rate when pressure drop is 10 in. will exceed an

observation on flow rate made when pressure drop is

10. Suppose the expected cost of a production run is related to the size of the run by the equation y 5 4000 1 10x. Let Y

22 x

denote an observation on the cost of a run. If the variables’ size and cost are related according to the simple linear regression model, could it be the case that P sY . 50

7. The article “Some Field Experience in the Use of an

when x 5 100 d 5 .05 and P(Y . 6500 when x 5 200) 5

Accelerated Method in Estimating 28-Day Strength

.10? Explain.

of Concrete” (J. of Amer. Concrete Institute, 1969: 11. Suppose that in a certain chemical process the reaction

895) considered regressing y 5 28day standard-cured

time y (hr) is related to the temperature (°F) in the

strength (psi) against x 5 accelerated strength spsid.

chamber in which the reaction takes place according to

Suppose the equation of the true regression line is

the simple linear regression model with equation y 5

y5 1800 1 1.3x.

5.00 2 .01x and s 5 .075.

496 Chapter 12 Simple Linear regression and Correlation

a. What is the expected change in reaction time for a

What is the probability that all five times are between

1°F increase in temperature? For a 10°F increase in

2.4 and 2.6 hr?

temperature?

d. What is the probability that two independently observed

b. What is the expected reaction time when temperature

reaction times for temperatures 1° apart are such that

is 200°F? When temperature is 250°F?

the time at the higher temperature exceeds the time at

c. Suppose five observations are made independently on

the lower temperature?

reaction time, each one for a temperature of 250°F.

The Simple Linear Regression Model

12.1 The Simple Linear Regression Model

Parts

Dokumen yang terkait

AN ALIS IS YU RID IS PUT USAN BE B AS DAL AM P E RKAR A TIND AK P IDA NA P E NY E RTA AN M E L AK U K A N P R AK T IK K E DO K T E RA N YA NG M E N G A K IB ATK AN M ATINYA P AS IE N ( PUT USA N N O MOR: 9 0/PID.B /2011/ PN.MD O)

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

Anal isi s L e ve l Pe r tanyaan p ad a S oal Ce r ita d alam B u k u T e k s M at e m at ik a Pe n u n jang S MK Pr ogr a m Keahl ian T e k n ologi , Kese h at an , d an Pe r tani an Kelas X T e r b itan E r lan gga B e r d asarkan T ak s on om i S OL O

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

The Correlation between students vocabulary master and reading comprehension

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

Transmission of Greek and Arabic Veteri

Dukungan

Links

The Simple Linear Regression Model

12.1 The Simple Linear Regression Model

Parts

Dokumen yang terkait

AN ALIS IS YU RID IS PUT USAN BE B AS DAL AM P E RKAR A TIND AK P IDA NA P E NY E RTA AN M E L AK U K A N P R AK T IK K E DO K T E RA N YA NG M E N G A K IB ATK AN M ATINYA P AS IE N ( PUT USA N N O MOR: 9 0/PID.B /2011/ PN.MD O)

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

Anal isi s L e ve l Pe r tanyaan p ad a S oal Ce r ita d alam B u k u T e k s M at e m at ik a Pe n u n jang S MK Pr ogr a m Keahl ian T e k n ologi , Kese h at an , d an Pe r tani an Kelas X T e r b itan E r lan gga B e r d asarkan T ak s on om i S OL O

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

The Correlation between students vocabulary master and reading comprehension

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

Transmission of Greek and Arabic Veteri

Dokumen yang Anda mencari sudah siap untuk unduhkan