The Simple Linear Regression Model

12.1 The Simple Linear Regression Model

  The simplest deterministic mathematical relationship between two variables x and y

  is a linear relationship y 5 b 0 1b 1 x . The set of pairs (x, y) for which y 5 b 0 1b 1 x determines a straight line with slope b 1 and y-intercept b 0 . The objective of this

  section is to develop a linear probabilistic model.

  If the two variables are not deterministically related, then for a fixed value of x, there is uncertainty in the value of the second variable. For example, if we are investigating the relationship between age of child and size of vocabulary and decide to select a child of age x 5 5.0 years, then before the selection is made, vocabulary size is a random variable Y. After a particular 5-year-old child has been selected and tested, a vocabulary of 2000 words may result. We would then say that the observed value of Y associated with fixing x 5 5.0 was y 5 2000.

  More generally, the variable whose value is fixed by the experimenter will

  be denoted by x and will be called the independent, predictor, or explanatory

  variable . For fixed x, the second variable will be random; we denote this random variable and its observed value by Y and y, respectively, and refer to it as the dependent or response variable .

  Usually observations will be made for a number of settings of the inde-

  pendent variable. Let x 1 ,x 2 ,…, x n denote values of the independent variable for

  which observations are made, and let Y i and y i , respectively, denote the random variable and observed value associated with x i . The available bivariate data then

  consists of the n pairs (x 1 ,y 1 ), (x 2 ,y 2 ),…, (x n ,y n ). A picture of this data called a

  scatterplot gives preliminary impressions about the nature of any relationship. In such a plot, each sx i ,y i d is represented as a point plotted on a two-dimensional coordinate system.

  The slope of a line is the change in y for a 1-unit increase in x. For example, if y 5 23x 1 10, then y

  decreases by 3 when x increases by 1, so the slope is 23. The y-intercept is the height at which the line crosses the vertical axis and is obtained by setting x 5 0 in the equation.

  12.1 the Simple Linear regression Model 489

  ExamplE 12.1

  Visual and musculoskeletal problems associated with the use of visual display terminals (VDTs) have become rather common in recent years. Some research- ers have focused on vertical gaze direction as a source of eye strain and irritation. This direction is known to be closely related to ocular surface area (OSA), so a method of measuring OSA is needed. The accompanying representative data on

  y5 OSA scm 2 d and x 5 width of the palprebal fissure (i.e., the horizontal width

  of the eye opening, in cm) is from the article “Analysis of Ocular Surface Area

  for Comfortable VDT Workstation Layout” (Ergonomics, 1996: 877–884) . The order in which observations were obtained was not given, so for convenience they are listed in increasing order of x values.

  i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x i .40 .42 .48 .51 .57 .60 .70 .75 .75 .78 .84 .95 .99 1.03 1.12 y i

  Thus sx 1 ,y 1 d 5 s.40, 1.02d, sx 5 ,y 5 d 5 s.57, 1.52d, and so on. A Minitab scatterplot is

  shown in Figure 12.1; we used an option that produced a dotplot of both the x values and y values individually along the right and top margins of the plot, which makes it easier to visualize the distributions of the individual variables (histograms or box- plots are alternative options). Here are some things to notice about the data and plot:

  ●

  Several observations have identical x values yet different y values (e.g.,

  x 8 5 x 9 5 .75, but y 8 5 1.80 and y 9 5 1.74). Thus the value of y is not

  determined solely by x but also by various other factors.

  ●

  There is a strong tendency for y to increase as x increases. That is, larger values of OSA tend to be associated with larger values of fissure width—a positive

  relationship between the variables.

  Figure 12.1 Scatterplot from Minitab for the data from Example 12.1, along with dotplots of x and y values

  490 Chapter 12 Simple Linear regression and Correlation

  ●

  It appears that the value of y could be predicted from x by finding a line that is rea- sonably close to the points in the plot (the authors of the cited article superimposed such a line on their plot). In other words, there is evidence of a substantial (though not perfect) linear relationship between the two variables.

  n

  The horizontal and vertical axes in the scatterplot of Figure 12.1 intersect at the point (0, 0). In many data sets, the values of x or y or the values of both variables differ considerably from zero relative to the range(s) of the values. For example, a study of how air conditioner efficiency is related to maximum daily outdoor tem- perature might involve observations for temperatures ranging from 80°F to 100°F. When this is the case, a more informative plot would show the appropriately labeled axes intersecting at some point other than (0, 0).

  ExamplE 12.2

  Arsenic is found in many ground waters and some surface waters. Recent health effects research has prompted the Environmental Protection Agency to reduce allowable arsenic levels in drinking water so that many water systems are no longer compliant with standards. This has spurred interest in the development of methods to remove arsenic. The accompanying data on x 5 pH and y 5 arsenic removed () by

  a particular process was read from a scatterplot in the article “Optimizing Arsenic Removal During Iron Removal: Theoretical and Practical Considerations” (J.

  of Water Supply Res. and Tech., 2005: 545–560) .

  Figure 12.2 shows two Minitab scatterplots of this data. In Figure 12.2(a), the soft- ware selected the scale for both axes. We obtained Figure 12.2(b) by specifying scal- ing for the axes so that they would intersect at roughly the point (0, 0). The second plot is much more crowded than the first one; such crowding can make it difficult to ascertain the general nature of any relationship. For example, curvature can be overlooked in a crowded plot.

  Figure 12.2 Minitab scatterplots of data in Example 12.2

  12.1 the Simple Linear regression Model 491

  Large values of arsenic removal tend to be associated with low pH, a negative or inverse relationship. Furthermore, the two variables appear to be at least approxi- mately linearly related, although the points in the plot would spread out somewhat about any superimposed straight line (such a line appeared in the plot in the cited article).

  n

  A Linear Probabilistic Model

  For the deterministic model y 5 b 0 1b 1 x , the actual observed value of y is a linear

  function of x. The appropriate generalization of this to a probabilistic model assumes that the expected value of Y is a linear function of x, but that for fixed x the variable Y differs from its expected value by a random amount.

  DEFINITION

  the Simple Linear regression Model

  There are parameters b 0 ,b 1 , and s 2 , such that for any fixed value of the inde-

  pendent variable x, the dependent variable is a random variable related to x through the model equation

  Y5b 0 1b 1 x1e (12.1)

  The quantity e in the model equation is a random variable, assumed to be

  normally distributed with E(e) 5 0 and V(e) 5 s 2 .

  The variable e is usually referred to as the random deviation or random

  error term in the model. Without e, any observed pair (x, y) would correspond to

  a point falling exactly on the line y 5 b 0 1b 1 x , called the true (or population )

  regression line . The inclusion of the random error term allows (x, y) to fall either above the true regression line (when e . 0) or below the line (when e , 0). The

  points sx 1 ,y 1 d,…, sx n ,y n d resulting from n independent observations will then be

  scattered about the true regression line, as illustrated in Figure 12.3. On occasion, the appropriateness of the simple linear regression model may be suggested by theoretical considerations (e.g., there is an exact linear relationship between the two variables, with e representing measurement error). Much more frequently, though, the reasonableness of the model is indicated by a scatterplot exhibiting a substantial linear pattern (as in Figures 12.1 and 12.2).

  y

  (x 1 ,y 1 )

  True regression line y 5 0 1 x

  Figure 12.3 Points corresponding to observations from the simple linear regression model

  492 Chapter 12 Simple Linear regression and Correlation

  Implications of the model equation (12.1) can best be understood with the aid of the following notation. Let x denote a particular value of the independent variable x and

  m Y?x 5 the expected sor meand value of Y when x has value x s 2 Y?x 5 the variance of Y when x has value x

  Alternative notation is E sYuxd and VsYuxd. For example, if x 5 applied stress

  skgmmd 2 and y 5 time­to­fracture shrd, then m Y? 20 would denote the expected value of time-to-fracture when applied stress is 20 kgmm 2 . If we think of an entire popu- lation of (x, y) pairs, then m Y?x is the mean of all y values for which x 5 x, and

  s 2 Y?x is a measure of how much these values of y spread out about the mean value.

  If, for example, x 5 age of a child and y 5 vocabulary size, then m Y? 5 is the average vocabulary size for all 5-year-old children in the population, and s Y? 2 5 describes the

  amount of variability in vocabulary size for this part of the population. Once x is fixed, the only randomness on the right-hand side of the model equation (12.1) is

  in the random error e, and its mean value and variance are 0 and s 2 , respectively,

  whatever the value of x. This implies that

  m Y?x 5 E sb 0 1b 1 x 1e d5b 0 1b 1 x 1E sed 5 b 0 1b 1 x s 2 Y?x 5 V sb 0 1b 1 x 1e d 5 Vsb 0 1b 1 x d 1 Vsed 5 0 1 s 2 5s 2 Replacing x in m Y?x by x gives the relation m Y?x 5b 0 1b 1 x , which says that

  the mean value of Y, rather than Y itself, is a linear function of x. The true regression

  line y 5 b 0 1b 1 x is thus the line of mean values; its height above any particular x

  value is the expected value of Y for that value of x. The slope b 1 of the true regression

  line is interpreted as the expected change in Y associated with a 1-unit increase in the value of x. The second relation states that the amount of variability in the distribution of Y values is the same at each different value of x (homogeneity of variance). If the independent variable is vehicle weight and the dependent variable is fuel efficiency (mpg), then the model implies that the average fuel efficiency changes linearly with

  weight (presumable b 1 is negative) and that the amount of variability in efficiency

  for any particular weight is the same as at any other weight. Finally, for fixed x, Y is

  the sum of a constant b 0 1b 1 x and a normally distributed rv e so itself has a normal

  distribution. These properties are illustrated in Figure 12.4. The variance parameter

  Normal, mean 0, standard deviation

  Line y 5

  0 1 x

  x 1 x 2 x 3 (b)

  Figure 12.4 (a) Distribution of e; (b) distribution of Y for different values of x

  12.1 the Simple Linear regression Model 493

  s 2 determines the extent to which each normal curve spreads out about its mean value; roughly speaking, the value of s is the size of a typical deviation from the true regression line. An observed point (x, y) will almost always fall quite close to the true regression line when s is small, whereas observations may deviate considerably from their expected values (corresponding to points far from the line) when s is large.

  ExamplE 12.3

  Suppose the relationship between applied stress x and time-to-failure y is described by the simple linear regression model with true regression line y5

  65 2 1.2x and s 5 8. Then for any fixed value x of stress, time-to-failure has a normal distribution with mean value 65 2 1.2x and standard deviation 8. In the population consisting of all (x, y) points, the magnitude of a typical deviation from the true regression line is about 8. For x 5 20, Y has mean value

  m Y ? 20 5 65 2 1.2 s20d 5 41, so

  P sY . 50 when x 5 20d 5 P Z. 5 12F s1.13d 5 .1292

  Because m Y? 25 5 35,

  P sY . 50 when x 5 25d 5 P Z. 5 12F s1.88d 5 .0301

  These probabilities are illustrated as the shaded areas in Figure 12.5.

  y

  P (Y . 50 when x 5 20) 5 .1292

  P (Y . 50 when x 5 25) 5 .0301

  True regression line y5

  65 21.2x

  x 20 25

  Figure 12.5 Probabilities based on the simple linear regression model

  Suppose that Y 1 denotes an observation on time-to-failure made with x 5 25 and Y 2 denotes an independent observation made with x 5 24. Then Y 1 2 Y 2 is nor- mally distributed with mean value E sY 1 2 Y 2 d5b 1 52 1.2, variance V(Y 1 2 Y 2 ) 5 s 2 1s 2 5 128, and standard deviation Ï128 5 11.314. The probability that Y 1

  exceeds Y 2 is

  P (Y 1 2 Y 2 . 0) 5 P Z.

  5 P (Z . .11) 5 .4562

  That is, even though we expected Y to decrease when x increases by 1 unit, it is not unlikely that the observed Y at x 1 1 will be larger than the observed Y at x.

  n

  494 Chapter 12 Simple Linear regression and Correlation

  ExERCiSES Section 12.1 (1–11)

  1. The efficiency ratio for a steel specimen immersed in a

  the accompanying observations on x 5 hydrogen con-

  phosphating tank is the weight of the phosphate coating

  centration (ppm) using a gas chromatography method

  divided by the metal loss (both in mgft 2 ). The article

  and y 5 concentration using a new sensor method were

  “Statistical Process Control of a Phosphate Coating

  read from a graph in the article “A New Method to

  Line” (Wire J. Intl., May 1997: 78–81) gave the

  Measure the Diffusible Hydrogen Content in Steel

  accompanying data on tank temperature (x) and effi-

  Weldments Using a Polymer Electrolyte-Based

  ciency ratio (y).

  Hydrogen Sensor” (Welding Res., July 1997:

  251s–256s) .

  Temp. 170 172 173 174 174 175 176

  Ratio .84 1.31 1.42 1.03 1.07 1.08 1.04 x

  y 127 114 134 139 142 170 149 154 200 215 Ratio 1.43 .90 1.81 1.94 2.68 1.49 2.52 Construct a scatterplot. Does there appear to be a very

  Temp. 185 186 188

  strong relationship between the two types of concentra- tion measurements? Do the two methods appear to be

  Ratio 3.00 1.87 3.08

  measuring roughly the same quantity? Explain your

  a. Construct stem-and-leaf displays of both tempera-

  reasoning.

  ture and efficiency ratio, and comment on interesting

  4. The accompanying data on y 5 ammonium concentra-

  features.

  tion (mgL) and x 5 transpiration (mlh) was read from

  b. Is the value of efficiency ratio completely and unique-

  a graph in the article “Response of Ammonium

  ly determined by tank temperature? Explain your

  Removal to Growth and Transpiration of Juncus

  reasoning.

  effusus During the Treatment of Artificial Sewage in

  c. Construct a scatterplot of the data. Does it appear

  Laboratory-Scale Wetlands” (Water Research, 2013:

  that efficiency ratio could be very well predicted by

  4265–4273) . The article’s abstract stated “a linear cor-

  the value of temperature? Explain your reasoning.

  relation between the ammonium concentration inside

  2. The article “Exhaust Emissions from Four-Stroke

  the rhizosphere and the transpiration of the plant stocks

  Lawn Mower Engines” (J. of the Air and Water Mgmnt.

  implies that an influence of plant physiological activity

  Assoc., 1997: 945–952)

  on the efficiency of N-removal exists.” (The rhizo-

  reported data from a study in

  sphere is the narrow region of soil at the plant root–soil

  which both a baseline gasoline mixture and a reformulat-

  interface, and transpiration is the process of water

  ed gasoline were used. Consider the following observa-

  movement through a plant and its evaporation.) The

  tions on age (yr) and NO x emissions (gkWh):

  article reported summary quantities from a simple lin-

  Engine

  ear regression analysis. Based on a scatterplot, how

  Age

  would you describe the relationship between the vari-

  Baseline

  ables, and does simple linear regression appear to be an

  Reformulated 1.88 5.93 5.54 2.67 6.53

  appropriate modeling strategy?

  Construct scatterplots of NO x emissions versus age. What

  5. The article “Objective Measurement of the Stretchability

  appears to be the nature of the relationship between these

  of Mozzarella Cheese” (J. of Texture Studies, 1992:

  two variables? [Note: The authors of the cited article

  185–194) reported on an experiment to investigate how the

  commented on the relationship.]

  behavior of mozzarella cheese varied with temperature.

  3. Bivariate data often arises from the use of two different

  Consider the accompanying data on x 5 temperature and

  techniques to measure the same quantity. As an example,

  y5 elongation() at failure of the cheese. [Note: The

  12.1 the Simple Linear regression Model 495

  researchers were Italian and used real mozzarella cheese,

  a. What is the expected value of 28-day strength when

  not the poor cousin widely available in the United States.]

  accelerated strength 5 2500? b. By how much can we expect 28-day strength to

  x

  change when accelerated strength increases by 1 psi?

  y 118 182 247 208 197 135 132

  c. Answer part (b) for an increase of 100 psi. d. Answer part (b) for a decrease of 100 psi.

  a. Construct a scatterplot in which the axes intersect

  8. Referring to Exercise 7, suppose that the standard devia-

  at (0, 0). Mark 0, 20, 40, 60, 80, and 100 on the

  tion of the random deviation e is 350 psi.

  horizontal axis and 0, 50, 100, 150, 200, and 250 on the vertical axis.

  a. What is the probability that the observed value of

  b. Construct a scatterplot in which the axes intersect

  28-day strength will exceed 5000 psi when the value

  at (55, 100), as was done in the cited article. Does

  of accelerated strength is 2000?

  this plot seem preferable to the one in part (a)?

  b. Repeat part (a) with 2500 in place of 2000.

  Explain your reasoning.

  c. Consider making two independent observations on

  c. What do the plots of parts (a) and (b) suggest about

  28-day strength, the first for an accelerated strength

  the nature of the relationship between the two

  of 2000 and the second for x 5 2500. What is the

  variables?

  probability that the second observation will exceed the first by more than 1000 psi?

  6. One factor in the development of tennis elbow, a malady

  d. Let Y

  1 and Y 2 denote observations on 28-day strength when x 5 x 1 and x 5 x , respectively. By how much

  that strikes fear in the hearts of all serious tennis players,

  is the impact-induced vibration of the racket-and-arm

  would x

  2 have to exceed x 1 in order that P(Y 2 . Y 1 )5

  system at ball contact. It is well known that the likeli-

  hood of getting tennis elbow depends on various proper- ties of the racket used. Consider the scatterplot of x 5

  9. The flow rate y (m 3 min) in a device used for air-quality

  racket resonance frequency (Hz) and y 5 sum of peak-

  measurement depends on the pressure drop x (in. of

  to-peak acceleration (a characteristic of arm vibration,

  water) across the device’s filter. Suppose that for x values

  in msecsec) for n 5 23 different rackets (“Transfer of

  between 5 and 20, the two variables are related according

  Tennis Racket Vibrations into the Human Forearm,”

  to the simple linear regression model with true regres-

  Medicine and Science in Sports and Exercise, 1992:

  sion line y 5 2.12 1 .095x.

  1134–1140) . Discuss interesting features of the data and

  a. What is the expected change in flow rate associated

  scatterplot.

  with a 1-in. increase in pressure drop? Explain. b. What change in flow rate can be expected when pres- sure drop decreases by 5 in.?

  y

  c. What is the expected flow rate for a pressure drop of

  10 in.? A drop of 15 in.?

  d. Suppose s 5 .025 and consider a pressure drop of

  10 in. What is the probability that the observed value

  of flow rate will exceed .835? That observed flow rate will exceed .840?

  e. What is the probability that an observation on flow

  rate when pressure drop is 10 in. will exceed an

  observation on flow rate made when pressure drop is

  10. Suppose the expected cost of a production run is related to the size of the run by the equation y 5 4000 1 10x. Let Y

  22 x

  denote an observation on the cost of a run. If the variables’ size and cost are related according to the simple linear regression model, could it be the case that P sY . 50

  7. The article “Some Field Experience in the Use of an

  when x 5 100 d 5 .05 and P(Y . 6500 when x 5 200) 5

  Accelerated Method in Estimating 28-Day Strength

  .10? Explain.

  of Concrete” (J. of Amer. Concrete Institute, 1969: 11. Suppose that in a certain chemical process the reaction

  895) considered regressing y 5 28­day standard-cured

  time y (hr) is related to the temperature (°F) in the

  strength (psi) against x 5 accelerated strength spsid.

  chamber in which the reaction takes place according to

  Suppose the equation of the true regression line is

  the simple linear regression model with equation y 5

  y5 1800 1 1.3x.

  5.00 2 .01x and s 5 .075.

  496 Chapter 12 Simple Linear regression and Correlation

  a. What is the expected change in reaction time for a

  What is the probability that all five times are between

  1°F increase in temperature? For a 10°F increase in

  2.4 and 2.6 hr?

  temperature?

  d. What is the probability that two independently observed

  b. What is the expected reaction time when temperature

  reaction times for temperatures 1° apart are such that

  is 200°F? When temperature is 250°F?

  the time at the higher temperature exceeds the time at

  c. Suppose five observations are made independently on

  the lower temperature?

  reaction time, each one for a temperature of 250°F.

Dokumen yang terkait

AN ALIS IS YU RID IS PUT USAN BE B AS DAL AM P E RKAR A TIND AK P IDA NA P E NY E RTA AN M E L AK U K A N P R AK T IK K E DO K T E RA N YA NG M E N G A K IB ATK AN M ATINYA P AS IE N ( PUT USA N N O MOR: 9 0/PID.B /2011/ PN.MD O)

0 82 16

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

19 819 7

Anal isi s L e ve l Pe r tanyaan p ad a S oal Ce r ita d alam B u k u T e k s M at e m at ik a Pe n u n jang S MK Pr ogr a m Keahl ian T e k n ologi , Kese h at an , d an Pe r tani an Kelas X T e r b itan E r lan gga B e r d asarkan T ak s on om i S OL O

2 99 16

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

1 29 9

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

7 202 3

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

0 63 87

The Correlation between students vocabulary master and reading comprehension

16 145 49

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

8 140 133

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

9 128 37

Transmission of Greek and Arabic Veteri

0 1 22