The Simple Linear Regression Model
12.1 The Simple Linear Regression Model
The simplest deterministic mathematical relationship between two variables x and y
is a linear relationship y 5 b 0 1b 1 x . The set of pairs (x, y) for which y 5 b 0 1b 1 x determines a straight line with slope b 1 and y-intercept b 0 . The objective of this
section is to develop a linear probabilistic model.
If the two variables are not deterministically related, then for a fixed value of x, there is uncertainty in the value of the second variable. For example, if we are investigating the relationship between age of child and size of vocabulary and decide to select a child of age x 5 5.0 years, then before the selection is made, vocabulary size is a random variable Y. After a particular 5-year-old child has been selected and tested, a vocabulary of 2000 words may result. We would then say that the observed value of Y associated with fixing x 5 5.0 was y 5 2000.
More generally, the variable whose value is fixed by the experimenter will
be denoted by x and will be called the independent, predictor, or explanatory
variable . For fixed x, the second variable will be random; we denote this random variable and its observed value by Y and y, respectively, and refer to it as the dependent or response variable .
Usually observations will be made for a number of settings of the inde-
pendent variable. Let x 1 ,x 2 ,…, x n denote values of the independent variable for
which observations are made, and let Y i and y i , respectively, denote the random variable and observed value associated with x i . The available bivariate data then
consists of the n pairs (x 1 ,y 1 ), (x 2 ,y 2 ),…, (x n ,y n ). A picture of this data called a
scatterplot gives preliminary impressions about the nature of any relationship. In such a plot, each sx i ,y i d is represented as a point plotted on a two-dimensional coordinate system.
The slope of a line is the change in y for a 1-unit increase in x. For example, if y 5 23x 1 10, then y
decreases by 3 when x increases by 1, so the slope is 23. The y-intercept is the height at which the line crosses the vertical axis and is obtained by setting x 5 0 in the equation.
12.1 the Simple Linear regression Model 489
ExamplE 12.1
Visual and musculoskeletal problems associated with the use of visual display terminals (VDTs) have become rather common in recent years. Some research- ers have focused on vertical gaze direction as a source of eye strain and irritation. This direction is known to be closely related to ocular surface area (OSA), so a method of measuring OSA is needed. The accompanying representative data on
y5 OSA scm 2 d and x 5 width of the palprebal fissure (i.e., the horizontal width
of the eye opening, in cm) is from the article “Analysis of Ocular Surface Area
for Comfortable VDT Workstation Layout” (Ergonomics, 1996: 877–884) . The order in which observations were obtained was not given, so for convenience they are listed in increasing order of x values.
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x i .40 .42 .48 .51 .57 .60 .70 .75 .75 .78 .84 .95 .99 1.03 1.12 y i
Thus sx 1 ,y 1 d 5 s.40, 1.02d, sx 5 ,y 5 d 5 s.57, 1.52d, and so on. A Minitab scatterplot is
shown in Figure 12.1; we used an option that produced a dotplot of both the x values and y values individually along the right and top margins of the plot, which makes it easier to visualize the distributions of the individual variables (histograms or box- plots are alternative options). Here are some things to notice about the data and plot:
●
Several observations have identical x values yet different y values (e.g.,
x 8 5 x 9 5 .75, but y 8 5 1.80 and y 9 5 1.74). Thus the value of y is not
determined solely by x but also by various other factors.
●
There is a strong tendency for y to increase as x increases. That is, larger values of OSA tend to be associated with larger values of fissure width—a positive
relationship between the variables.
Figure 12.1 Scatterplot from Minitab for the data from Example 12.1, along with dotplots of x and y values
490 Chapter 12 Simple Linear regression and Correlation
●
It appears that the value of y could be predicted from x by finding a line that is rea- sonably close to the points in the plot (the authors of the cited article superimposed such a line on their plot). In other words, there is evidence of a substantial (though not perfect) linear relationship between the two variables.
n
The horizontal and vertical axes in the scatterplot of Figure 12.1 intersect at the point (0, 0). In many data sets, the values of x or y or the values of both variables differ considerably from zero relative to the range(s) of the values. For example, a study of how air conditioner efficiency is related to maximum daily outdoor tem- perature might involve observations for temperatures ranging from 80°F to 100°F. When this is the case, a more informative plot would show the appropriately labeled axes intersecting at some point other than (0, 0).
ExamplE 12.2
Arsenic is found in many ground waters and some surface waters. Recent health effects research has prompted the Environmental Protection Agency to reduce allowable arsenic levels in drinking water so that many water systems are no longer compliant with standards. This has spurred interest in the development of methods to remove arsenic. The accompanying data on x 5 pH and y 5 arsenic removed () by
a particular process was read from a scatterplot in the article “Optimizing Arsenic Removal During Iron Removal: Theoretical and Practical Considerations” (J.
of Water Supply Res. and Tech., 2005: 545–560) .
Figure 12.2 shows two Minitab scatterplots of this data. In Figure 12.2(a), the soft- ware selected the scale for both axes. We obtained Figure 12.2(b) by specifying scal- ing for the axes so that they would intersect at roughly the point (0, 0). The second plot is much more crowded than the first one; such crowding can make it difficult to ascertain the general nature of any relationship. For example, curvature can be overlooked in a crowded plot.
Figure 12.2 Minitab scatterplots of data in Example 12.2
12.1 the Simple Linear regression Model 491
Large values of arsenic removal tend to be associated with low pH, a negative or inverse relationship. Furthermore, the two variables appear to be at least approxi- mately linearly related, although the points in the plot would spread out somewhat about any superimposed straight line (such a line appeared in the plot in the cited article).
n
A Linear Probabilistic Model
For the deterministic model y 5 b 0 1b 1 x , the actual observed value of y is a linear
function of x. The appropriate generalization of this to a probabilistic model assumes that the expected value of Y is a linear function of x, but that for fixed x the variable Y differs from its expected value by a random amount.
DEFINITION
the Simple Linear regression Model
There are parameters b 0 ,b 1 , and s 2 , such that for any fixed value of the inde-
pendent variable x, the dependent variable is a random variable related to x through the model equation
Y5b 0 1b 1 x1e (12.1)
The quantity e in the model equation is a random variable, assumed to be
normally distributed with E(e) 5 0 and V(e) 5 s 2 .
The variable e is usually referred to as the random deviation or random
error term in the model. Without e, any observed pair (x, y) would correspond to
a point falling exactly on the line y 5 b 0 1b 1 x , called the true (or population )
regression line . The inclusion of the random error term allows (x, y) to fall either above the true regression line (when e . 0) or below the line (when e , 0). The
points sx 1 ,y 1 d,…, sx n ,y n d resulting from n independent observations will then be
scattered about the true regression line, as illustrated in Figure 12.3. On occasion, the appropriateness of the simple linear regression model may be suggested by theoretical considerations (e.g., there is an exact linear relationship between the two variables, with e representing measurement error). Much more frequently, though, the reasonableness of the model is indicated by a scatterplot exhibiting a substantial linear pattern (as in Figures 12.1 and 12.2).
y
(x 1 ,y 1 )
True regression line y 5 0 1 x
Figure 12.3 Points corresponding to observations from the simple linear regression model
492 Chapter 12 Simple Linear regression and Correlation
Implications of the model equation (12.1) can best be understood with the aid of the following notation. Let x denote a particular value of the independent variable x and
m Y?x 5 the expected sor meand value of Y when x has value x s 2 Y?x 5 the variance of Y when x has value x
Alternative notation is E sYuxd and VsYuxd. For example, if x 5 applied stress
skgmmd 2 and y 5 timetofracture shrd, then m Y? 20 would denote the expected value of time-to-fracture when applied stress is 20 kgmm 2 . If we think of an entire popu- lation of (x, y) pairs, then m Y?x is the mean of all y values for which x 5 x, and
s 2 Y?x is a measure of how much these values of y spread out about the mean value.
If, for example, x 5 age of a child and y 5 vocabulary size, then m Y? 5 is the average vocabulary size for all 5-year-old children in the population, and s Y? 2 5 describes the
amount of variability in vocabulary size for this part of the population. Once x is fixed, the only randomness on the right-hand side of the model equation (12.1) is
in the random error e, and its mean value and variance are 0 and s 2 , respectively,
whatever the value of x. This implies that
m Y?x 5 E sb 0 1b 1 x 1e d5b 0 1b 1 x 1E sed 5 b 0 1b 1 x s 2 Y?x 5 V sb 0 1b 1 x 1e d 5 Vsb 0 1b 1 x d 1 Vsed 5 0 1 s 2 5s 2 Replacing x in m Y?x by x gives the relation m Y?x 5b 0 1b 1 x , which says that
the mean value of Y, rather than Y itself, is a linear function of x. The true regression
line y 5 b 0 1b 1 x is thus the line of mean values; its height above any particular x
value is the expected value of Y for that value of x. The slope b 1 of the true regression
line is interpreted as the expected change in Y associated with a 1-unit increase in the value of x. The second relation states that the amount of variability in the distribution of Y values is the same at each different value of x (homogeneity of variance). If the independent variable is vehicle weight and the dependent variable is fuel efficiency (mpg), then the model implies that the average fuel efficiency changes linearly with
weight (presumable b 1 is negative) and that the amount of variability in efficiency
for any particular weight is the same as at any other weight. Finally, for fixed x, Y is
the sum of a constant b 0 1b 1 x and a normally distributed rv e so itself has a normal
distribution. These properties are illustrated in Figure 12.4. The variance parameter
Normal, mean 0, standard deviation
Line y 5
0 1 x
x 1 x 2 x 3 (b)
Figure 12.4 (a) Distribution of e; (b) distribution of Y for different values of x
12.1 the Simple Linear regression Model 493
s 2 determines the extent to which each normal curve spreads out about its mean value; roughly speaking, the value of s is the size of a typical deviation from the true regression line. An observed point (x, y) will almost always fall quite close to the true regression line when s is small, whereas observations may deviate considerably from their expected values (corresponding to points far from the line) when s is large.
ExamplE 12.3
Suppose the relationship between applied stress x and time-to-failure y is described by the simple linear regression model with true regression line y5
65 2 1.2x and s 5 8. Then for any fixed value x of stress, time-to-failure has a normal distribution with mean value 65 2 1.2x and standard deviation 8. In the population consisting of all (x, y) points, the magnitude of a typical deviation from the true regression line is about 8. For x 5 20, Y has mean value
m Y ? 20 5 65 2 1.2 s20d 5 41, so
P sY . 50 when x 5 20d 5 P Z. 5 12F s1.13d 5 .1292
Because m Y? 25 5 35,
P sY . 50 when x 5 25d 5 P Z. 5 12F s1.88d 5 .0301
These probabilities are illustrated as the shaded areas in Figure 12.5.
y
P (Y . 50 when x 5 20) 5 .1292
P (Y . 50 when x 5 25) 5 .0301
True regression line y5
65 21.2x
x 20 25
Figure 12.5 Probabilities based on the simple linear regression model
Suppose that Y 1 denotes an observation on time-to-failure made with x 5 25 and Y 2 denotes an independent observation made with x 5 24. Then Y 1 2 Y 2 is nor- mally distributed with mean value E sY 1 2 Y 2 d5b 1 52 1.2, variance V(Y 1 2 Y 2 ) 5 s 2 1s 2 5 128, and standard deviation Ï128 5 11.314. The probability that Y 1
exceeds Y 2 is
P (Y 1 2 Y 2 . 0) 5 P Z.
5 P (Z . .11) 5 .4562
That is, even though we expected Y to decrease when x increases by 1 unit, it is not unlikely that the observed Y at x 1 1 will be larger than the observed Y at x.
n
494 Chapter 12 Simple Linear regression and Correlation
ExERCiSES Section 12.1 (1–11)
1. The efficiency ratio for a steel specimen immersed in a
the accompanying observations on x 5 hydrogen con-
phosphating tank is the weight of the phosphate coating
centration (ppm) using a gas chromatography method
divided by the metal loss (both in mgft 2 ). The article
and y 5 concentration using a new sensor method were
“Statistical Process Control of a Phosphate Coating
read from a graph in the article “A New Method to
Line” (Wire J. Intl., May 1997: 78–81) gave the
Measure the Diffusible Hydrogen Content in Steel
accompanying data on tank temperature (x) and effi-
Weldments Using a Polymer Electrolyte-Based
ciency ratio (y).
Hydrogen Sensor” (Welding Res., July 1997:
251s–256s) .
Temp. 170 172 173 174 174 175 176
Ratio .84 1.31 1.42 1.03 1.07 1.08 1.04 x
y 127 114 134 139 142 170 149 154 200 215 Ratio 1.43 .90 1.81 1.94 2.68 1.49 2.52 Construct a scatterplot. Does there appear to be a very
Temp. 185 186 188
strong relationship between the two types of concentra- tion measurements? Do the two methods appear to be
Ratio 3.00 1.87 3.08
measuring roughly the same quantity? Explain your
a. Construct stem-and-leaf displays of both tempera-
reasoning.
ture and efficiency ratio, and comment on interesting
4. The accompanying data on y 5 ammonium concentra-
features.
tion (mgL) and x 5 transpiration (mlh) was read from
b. Is the value of efficiency ratio completely and unique-
a graph in the article “Response of Ammonium
ly determined by tank temperature? Explain your
Removal to Growth and Transpiration of Juncus
reasoning.
effusus During the Treatment of Artificial Sewage in
c. Construct a scatterplot of the data. Does it appear
Laboratory-Scale Wetlands” (Water Research, 2013:
that efficiency ratio could be very well predicted by
4265–4273) . The article’s abstract stated “a linear cor-
the value of temperature? Explain your reasoning.
relation between the ammonium concentration inside
2. The article “Exhaust Emissions from Four-Stroke
the rhizosphere and the transpiration of the plant stocks
Lawn Mower Engines” (J. of the Air and Water Mgmnt.
implies that an influence of plant physiological activity
Assoc., 1997: 945–952)
on the efficiency of N-removal exists.” (The rhizo-
reported data from a study in
sphere is the narrow region of soil at the plant root–soil
which both a baseline gasoline mixture and a reformulat-
interface, and transpiration is the process of water
ed gasoline were used. Consider the following observa-
movement through a plant and its evaporation.) The
tions on age (yr) and NO x emissions (gkWh):
article reported summary quantities from a simple lin-
Engine
ear regression analysis. Based on a scatterplot, how
Age
would you describe the relationship between the vari-
Baseline
ables, and does simple linear regression appear to be an
Reformulated 1.88 5.93 5.54 2.67 6.53
appropriate modeling strategy?
Construct scatterplots of NO x emissions versus age. What
5. The article “Objective Measurement of the Stretchability
appears to be the nature of the relationship between these
of Mozzarella Cheese” (J. of Texture Studies, 1992:
two variables? [Note: The authors of the cited article
185–194) reported on an experiment to investigate how the
commented on the relationship.]
behavior of mozzarella cheese varied with temperature.
3. Bivariate data often arises from the use of two different
Consider the accompanying data on x 5 temperature and
techniques to measure the same quantity. As an example,
y5 elongation() at failure of the cheese. [Note: The
12.1 the Simple Linear regression Model 495
researchers were Italian and used real mozzarella cheese,
a. What is the expected value of 28-day strength when
not the poor cousin widely available in the United States.]
accelerated strength 5 2500? b. By how much can we expect 28-day strength to
x
change when accelerated strength increases by 1 psi?
y 118 182 247 208 197 135 132
c. Answer part (b) for an increase of 100 psi. d. Answer part (b) for a decrease of 100 psi.
a. Construct a scatterplot in which the axes intersect
8. Referring to Exercise 7, suppose that the standard devia-
at (0, 0). Mark 0, 20, 40, 60, 80, and 100 on the
tion of the random deviation e is 350 psi.
horizontal axis and 0, 50, 100, 150, 200, and 250 on the vertical axis.
a. What is the probability that the observed value of
b. Construct a scatterplot in which the axes intersect
28-day strength will exceed 5000 psi when the value
at (55, 100), as was done in the cited article. Does
of accelerated strength is 2000?
this plot seem preferable to the one in part (a)?
b. Repeat part (a) with 2500 in place of 2000.
Explain your reasoning.
c. Consider making two independent observations on
c. What do the plots of parts (a) and (b) suggest about
28-day strength, the first for an accelerated strength
the nature of the relationship between the two
of 2000 and the second for x 5 2500. What is the
variables?
probability that the second observation will exceed the first by more than 1000 psi?
6. One factor in the development of tennis elbow, a malady
d. Let Y
1 and Y 2 denote observations on 28-day strength when x 5 x 1 and x 5 x , respectively. By how much
that strikes fear in the hearts of all serious tennis players,
is the impact-induced vibration of the racket-and-arm
would x
2 have to exceed x 1 in order that P(Y 2 . Y 1 )5
system at ball contact. It is well known that the likeli-
hood of getting tennis elbow depends on various proper- ties of the racket used. Consider the scatterplot of x 5
9. The flow rate y (m 3 min) in a device used for air-quality
racket resonance frequency (Hz) and y 5 sum of peak-
measurement depends on the pressure drop x (in. of
to-peak acceleration (a characteristic of arm vibration,
water) across the device’s filter. Suppose that for x values
in msecsec) for n 5 23 different rackets (“Transfer of
between 5 and 20, the two variables are related according
Tennis Racket Vibrations into the Human Forearm,”
to the simple linear regression model with true regres-
Medicine and Science in Sports and Exercise, 1992:
sion line y 5 2.12 1 .095x.
1134–1140) . Discuss interesting features of the data and
a. What is the expected change in flow rate associated
scatterplot.
with a 1-in. increase in pressure drop? Explain. b. What change in flow rate can be expected when pres- sure drop decreases by 5 in.?
y
c. What is the expected flow rate for a pressure drop of
10 in.? A drop of 15 in.?
d. Suppose s 5 .025 and consider a pressure drop of
10 in. What is the probability that the observed value
of flow rate will exceed .835? That observed flow rate will exceed .840?
e. What is the probability that an observation on flow
rate when pressure drop is 10 in. will exceed an
observation on flow rate made when pressure drop is
10. Suppose the expected cost of a production run is related to the size of the run by the equation y 5 4000 1 10x. Let Y
22 x
denote an observation on the cost of a run. If the variables’ size and cost are related according to the simple linear regression model, could it be the case that P sY . 50
7. The article “Some Field Experience in the Use of an
when x 5 100 d 5 .05 and P(Y . 6500 when x 5 200) 5
Accelerated Method in Estimating 28-Day Strength
.10? Explain.
of Concrete” (J. of Amer. Concrete Institute, 1969: 11. Suppose that in a certain chemical process the reaction
895) considered regressing y 5 28day standard-cured
time y (hr) is related to the temperature (°F) in the
strength (psi) against x 5 accelerated strength spsid.
chamber in which the reaction takes place according to
Suppose the equation of the true regression line is
the simple linear regression model with equation y 5
y5 1800 1 1.3x.
5.00 2 .01x and s 5 .075.
496 Chapter 12 Simple Linear regression and Correlation
a. What is the expected change in reaction time for a
What is the probability that all five times are between
1°F increase in temperature? For a 10°F increase in
2.4 and 2.6 hr?
temperature?
d. What is the probability that two independently observed
b. What is the expected reaction time when temperature
reaction times for temperatures 1° apart are such that
is 200°F? When temperature is 250°F?
the time at the higher temperature exceeds the time at
c. Suppose five observations are made independently on
the lower temperature?
reaction time, each one for a temperature of 250°F.