Statistics and Their Distributions

5.3 Statistics and Their Distributions

  The observations in a single sample were denoted in Chapter 1 by x 1 ,x 2 ,...,x n .

  Consider selecting two different samples of size n from the same population distri- bution. The x i ’s in the second sample will virtually always differ at least a bit from those in the first sample. For example, a first sample of n 5 3 cars of a particular

  type might result in fuel efficiencies x 1 5 30.7, x 2 5 29.4, x 3 5 31.1, whereas a second sample may give x 1 5 28.8, x 2 5 30.0, and x 3 5 32.5. Before we obtain data,

  there is uncertainty about the value of each x i . Because of this uncertainty, before the data becomes available we now regard each observation as a random variable

  and denote the sample by X 1 ,X 2 ,...,X n (uppercase letters for random variables).

  This variation in observed values in turn implies that the value of any function of the sample observations—such as the sample mean, sample standard deviation, or sample fourth spread—also varies from sample to sample. That is, prior to obtaining

  x 1 ,...,x n , there is uncertainty as to the value of x, the value of s, and so on. ExamplE 5.20 Suppose that material strength for a randomly selected specimen of a particular

  type has a Weibull distribution with parameter values a 5 2 (shape) and b 5 5 (scale). The corresponding density curve is shown in Figure 5.7. Formulas from Section 4.5 give

  m5 E (X) 5 4.4311 , 5 4.1628 m s 2 5 V (X) 5 5.365 s5 2.316

  The mean exceeds the median because of the distribution’s positive skew.

  Figure 5.7 The Weibull density curve for Example 5.20

  5.3 Statistics and their Distributions 221

  We used statistical software to generate six different samples, each with n 5 10, from this distribution (material strengths for six different groups of ten specimens each). The results appear in Table 5.1, followed by the values of the sample mean, sample median, and sample standard deviation for each sample. Notice first that the ten observations in any particular sample are all different from those in any other sample. Second, the six values of the sample mean are all different from one another, as are the six values of the sample median and the six values of the sample standard deviation. The same is true of the sample 10 trimmed means, sample fourth spreads, and so on.

  Table 5.1 Samples from the Weibull Distribution of Example 5.20

  Furthermore, the value of the sample mean from any particular sample can be regarded as a point estimate (“point” because it is a single number, corresponding to

  a single point on the number line) of the population mean m, whose value is known to be 4.4311. None of the estimates from these six samples is identical to what is being estimated. The estimates from the second and sixth samples are much too large, whereas the fifth sample gives a substantial underestimate. Similarly, the sam- ple standard deviation gives a point estimate of the population standard deviation. All six of the resulting estimates are in error by at least a small amount.

  In summary, the values of the individual sample observations vary from sample to sample, so will in general the value of any quantity computed from sample data, and the value of a sample characteristic used as an estimate of the corresponding popula- tion characteristic will virtually never coincide with what is being estimated.

  n

  DEFINITION A statistic is any quantity whose value can be calculated from sample data. Prior to obtaining data, there is uncertainty as to what value of any particular statistic will result. Therefore, a statistic is a random variable and will be denoted by an uppercase letter; a lowercase letter is used to represent the

  calculated or observed value of the statistic.

  Thus the sample mean, regarded as a statistic (before a sample has been selected or an experiment carried out), is denoted by X; the calculated value of this statistic is x. Similarly, S represents the sample standard deviation thought of as a statistic, and its computed value is s. If samples of two different types of bricks are selected

  and the individual compressive strengths are denoted by X 1 ,...,X m and Y 1 ,...,Y n ,

  222 ChapteR 5 Joint probability Distributions and Random Samples

  respectively, then the statistic X 2 Y, the difference between the two sample mean compressive strengths, is often of great interest.

  Any statistic, being a random variable, has a probability distribution. In par- ticular, the sample mean X has a probability distribution. Suppose, for example, that n 5 2 components are randomly selected and the number of breakdowns while under warranty is determined for each one. Possible values for the sample mean

  number of breakdowns X are 0 (if X 1 5X 2 5 0), .5 (if either X 1 5 0 and X 2 5 1 or

  X 1 5 1 and X 2 5 0), 1, 1.5, . . . . The probability distribution of X specifies P(X 5 0),

  P (X 5 .5), and so on, from which other probabilities such as P(1 X 3) and P ( X 2.5) can be calculated. Similarly, if for a sample of size n 5 2, the only pos-

  sible values of the sample variance are 0, 12.5, and 50 (which is the case if X 1 and

  X 2 can each take on only the values 40, 45, or 50), then the probability distribution

  of S 2 gives P(S 2 5 0), P(S 2 5 12.5), and P(S 2 5 50). The probability distribution of

  a statistic is sometimes referred to as its sampling distribution to emphasize that it describes how the statistic varies in value across all samples that might be selected.

  random Samples

  The probability distribution of any particular statistic depends not only on the population distribution (normal, uniform, etc.) and the sample size n but also on the method of sampling. Consider selecting a sample of size n 5 2 from a population consisting of just the three values 1, 5, and 10, and suppose that the statistic of inter-

  est is the sample variance. If sampling is done “with replacement,” then S 2 5 0 will result if X 1 5X 2 . However, S 2 cannot equal 0 if sampling is “without replacement.”

  So P(S 2 5 0) 5 0 for one sampling method, and this probability is positive for the other method. Our next definition describes a sampling method often encountered (at least approximately) in practice.

  DEFINITION The rv’s X 1 ,X 2 ,…, X n are said to form a (simple) random sample of size n if

  1. The X i ’s are independent rv’s.

  2. Every X i has the same probability distribution. Conditions 1 and 2 can be paraphrased by saying that the X i ’s are independent and

  identically distributed (iid). If sampling is either with replacement or from an infinite (conceptual) population, Conditions 1 and 2 are satisfied exactly. These conditions will be approximately satisfied if sampling is without replacement, yet the sample size n is much smaller than the population size N. In practice, if n yN .05 (at most

  5 of the population is sampled), we can proceed as if the X i ’s form a random sample. The virtue of such random sampling is that the probability distribution of any statistic can be more easily obtained than for any other sampling procedure.

  There are two general methods for obtaining information about a statistic’s sampling distribution. One method involves calculations based on probability rules, and the other involves carrying out a simulation experiment.

  deriving a Sampling distribution

  Probability rules can be used to obtain the distribution of a statistic provided that it is a “fairly simple” function of the X i ’s and either there are relatively few different

  X values in the population or else the population distribution has a “nice” form. Our next two examples illustrate such situations.

  5.3 Statistics and their Distributions 223

  ExamplE 5.21 A certain brand of MP3 player comes in three configurations: a model with 2 GB of memory, costing 80, a 4 GB model priced at 100, and an 8 GB version with a price tag of 120. If 20 of all purchasers choose the 2 GB model, 30 choose the 4 GB model, and 50 choose the 8 GB model, then the probability distribution of the cost

  X of a single randomly selected MP3 player purchase is given by

  x

  with m 5 106, s 2 5 244

  p (x) .2 .3 .5

  Suppose on a particular day only two MP3 players are sold. Let X 1 5 the revenue from the first sale and X 2 5 the revenue from the second. Suppose that X 1 and X 2 are independent, each with the probability distribution shown in (5.2) [so that X 1 and

  X 2 constitute a random sample from the distribution (5.2)]. Table 5.2 lists possible

  (x 1 ,x 2 ) pairs, the probability of each [computed using (5.2) and also the assumption

  of independence], and the resulting x and s 2 values. [Note that when n 5 2, s 2 5 (x 1 2 x) 2 1 (x 2 2 x) 2 .] Now to obtain the probability distribution of X, the sample

  average revenue per sale, we must consider each possible value x and compute its probability. For example, x 5 100 occurs three times in the table with probabilities .10, .09, and .10, so

  p X (100) 5 P(X 5 100) 5 .10 1 .09 1 .10 5 .29

  Similarly,

  p 2 (800) 5 P(S S 2 5 800) 5 P(X 1 5 80, X 2 5 120 or X 1 5 120, X 2 5 80)

  Table 5.2 Outcomes, Probabilities, and Values of x and

  s 2 for Example 5.21

  The complete sampling distributions of X and S 2 appear in (5.3) and (5.4).

  p S 2 (s 2 ) .38 .42 .20 Figure 5.8 pictures a probability histogram for both the original distribution (5.2)

  and the X distribution (5.3). The figure suggests first that the mean (expected value) of the X distribution is equal to the mean 106 of the original distribution, since both histograms appear to be centered at the same place.

  224 Chapter 5 Joint probability Distributions and random Samples

  Figure 5.8 Probability histograms for the underlying distribution and X distribution in Example 5.21

  From (5.3),

  m X 5 E(X) 5 oxp X (x) 5 (80)(.04) 1 . . . 1 (120)(.25) 5 106 5 m

  Second, it appears that the X distribution has smaller spread (variability) than the origi- nal distribution, since probability mass has moved in toward the mean. Again from (5.3),

  s 2 X 5 V (X) 5 2 ox ? p X (x) 2 m 2 X

  5 2 2 s80 2 ds.04d 1…1s120 ds.25d 2 s106d

  y2 5 s 2 y2

  The variance of X is precisely half that of the original variance (because n 5 2).

  Using (5.4), the mean value of S 2 is

  m S 2 5 E(S 2 2 )5 s 2 o ?p S 2 (s )

  5 (0)(.38) 1 (200)(.42) 1 (800)(.20) 5 244 5 s 2

  That is, the X sampling distribution is centered at the population mean m, and the S 2

  sampling distribution is centered at the population variance s 2 .

  If there had been four purchases on the day of interest, the sample average rev- enue X would be based on a random sample of four X i ’s, each having the distribution (5.2). Mildly tedious calculations yield the pmf of X for n 5 4 as

  x

  80 85 90 95 100 105 110 115 120 p X (x) .0016 .0096 .0376

  .2350 .1500 .0625 From this, m X 5 106 5 m and s 2 X 5 61 5 s 2 y4. Figure 5.9 is a probability histo-

  gram of this pmf.

  Figure 5.9 Probability histogram for

  X based on n 5 4 in Example 5.21 n Example 5.21 should suggest first of all that the computation of p X (x) and

  p S 2 (s 2 ) can be tedious. If the original distribution (5.2) had allowed for more than three possible values, then even for n 5 2 the computations would have been more involved. The example should also suggest, however, that there are some general

  relationships between E(X), V(X), E(S 2 ), and the mean m and variance s 2 of the

  original distribution. These are stated in the next section. Now consider an example in which the random sample is drawn from a continuous distribution.

  5.3 Statistics and their Distributions 225

  ExamplE 5.22 Service time for a certain type of bank transaction is a random variable having an

  exponential distribution with parameter l. Suppose X 1 and X 2 are service times for two

  different customers, assumed independent of each other. Consider the total service

  time T o 5X 1 1X 2 for the two customers, also a statistic. The cdf of T o is, for t 0,

  f (x 1 ,x 2 ) dx 1 dx 2

  F T 0 (t) 5 P(X 1 X 2 t )5

  {(x 1 ,x 2 ): x 1 x 2 t }

  t t2x 1 t

  5 le 2l x 1 ?l e 2l x 2 dx 2 dx 1 5 [le 2l x 1 2l e 2l t ] dx

  5 12e 2l t 2 lte 2l t The region of integration is pictured in Figure 5.10.

  x 2 (x 1 ,t2x 1 ) x 1 1x 2 5t

  x 1 x 1

  Figure 5.10 Region of integration to obtain cdf of T o in Example 5.22

  The pdf of T o is obtained by differentiating F T o (t):

  l 2 te 2l t

  5 t, 0

  f T o (t) 5

  This is a gamma pdf (a 5 2 and b 5 1 l). The pdf of X 5 T o y2 is obtained from the relation {X x} iff {T o 2x} as

  2 2lx 4l 2 xe x 0

  5 x, 0

  f X (x )5

  The mean and variance of the underlying exponential distribution are m 5 1 yl and

  s 2 51 yl 2 . From Expressions (5.5) and (5.6), it can be verified that E(X) 5 1 yl,

  V (X) 5 1 y(2l 2 ), E(T o )52 yl, and V(T o )52 yl 2 . These results again suggest some

  general relationships between means and variances of X, T o , and the underlying distribution.

  n

  Simulation Experiments

  The second method for obtaining information about a statistic’s sampling distribu- tion is to perform a simulation experiment. This method is usually used when a deri- vation via probability rules is very difficult or even impossible. Such an experiment is virtually always done with the aid of a computer. The following characteristics of an experiment must be specified:

  1. The statistic of interest (X, S, a particular trimmed mean, etc.)

  2. The population distribution (normal with m 5 100 and s 5 15, uniform with lower limit A 5 5 and upper limit B 5 10, etc.)

  3. The sample size n (e.g., n 5 10 or n 5 50)

  4. The number of replications k (number of samples to be obtained)

  226 Chapter 5 Joint probability Distributions and random Samples

  Then use appropriate software to obtain k different random samples, each of size n , from the designated population distribution. For each sample, calculate the value of the statistic and construct a histogram of the k values. This histogram gives the approximate sampling distribution of the statistic. The larger the value of k, the better the approximation will tend to be (the actual sampling distribution emerges as k S `). In practice, k 5 500 or 1000 is usually sufficient if the statistic is “fairly simple.”

  ExamplE 5.23 The population distribution for our first simulation study is normal with m 5 8.25

  and s 5 .75, as pictured in Figure 5.11. [The article “Platelet Size in Myocardial

  Infarction” (British Med. J., 1983: 449–451) suggests this distribution for platelet volume in individuals with no history of serious heart problems.]

  s .75 5 6.00 6.75 7.50 9.00 9.75 10.50

  m 5 8.25

  Figure 5.11 Normal distribution, with m 5 8.25 and s 5 .75

  We actually performed four different experiments, with 500 replications for each one. In the first experiment, 500 samples of n 5 5 observations each were generated using Minitab, and the sample sizes for the other three were n 5 10, n 5 20, and n 5 30, respectively. The sample mean was calculated for each sample, and the resulting histograms of x values appear in Figure 5.12.

  Figure 5.12 Sample histograms for x based on 500 samples, each consisting of n observations: Relative

  frequency (a) n 5 5; (b) n 5 10; (c) n 5 20; (d) n 5 30

  Relative

  frequency

  Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook andor eChapter(s). .25 .25

  5.3 Statistics and their Distributions (b) 227

  Figure 5.12 (continued)

  The first thing to notice about the histograms is their shape. To a reasonable approximation, each of the four looks like a normal curve. The resemblance would

  be even more striking if each histogram had been based on many more than 500 x values. Second, each histogram is centered approximately at 8.25, the mean of the population being sampled. Had the histograms been based on an unending sequence of x values, their centers would have been exactly the population mean, 8.25.

  The final aspect of the histograms to note is their spread relative to one another. The larger the value of n, the more concentrated is the sampling distribu- tion about the mean value. This is why the histograms for n 5 20 and n 5 30 are based on narrower class intervals than those for the two smaller sample sizes. For the larger sample sizes, most of the x values are quite close to 8.25. This is the effect of averaging. When n is small, a single unusual x value can result in an x value far from the center. With a larger sample size, any unusual x values, when averaged in with the other sample values, still tend to yield an x value close to m. Combining these insights yields a result that should appeal to your intuition: X based on a large

  n tends to be closer to m than does X based on a small n.

  n

  ExamplE 5.24 Consider a simulation experiment in which the population distribution is quite skewed. Figure 5.13 shows the density curve for lifetimes of a certain type of

  Figure 5.13 Density curve for the simulation experiment of Example 5.24 [ E(X) 5 21.7584, V(X) 5 82.1449]

  228 ChapteR 5 Joint probability Distributions and Random Samples

  electronic control [this is actually a lognormal distribution with E(ln(X)) 5 3 and

  V (ln(X)) 5 .16]. Again the statistic of interest is the sample mean X. The experiment utilized 500 replications and considered the same four sample sizes as in Exam- ple 5.23. The resulting histograms along with a normal probability plot from Minitab for the 500 x values based on n 5 30 are shown in Figure 5.14.

  Figure 5.14 Results of the simulation experiment of Example 5.24: (a) x histogram for n 5 5; (b) x histogram for n 5 10; (c) x histogram for n 5 20; (d) x histogram for n 5 30; (e) normal probability plot for n 5 30 (from Minitab)

  5.3 Statistics and their Distributions 229

  Unlike the normal case, these histograms all differ in shape. In particular, they become progressively less skewed as the sample size n increases. The average of the 500 x values for the four different sample sizes are all quite close to the mean value of the population distribution. If each histogram had been based on an unend- ing sequence of x values rather than just 500, all four would have been centered at exactly 21.7584. Thus different values of n change the shape but not the center of the sampling distribution of X. Comparison of the four histograms in Figure 5.14 also shows that as n increases, the spread of the histograms decreases. Increasing n results in a greater degree of concentration about the population mean value and makes the histogram look more like a normal curve. The histogram of Figure 5.14(d) and the normal probability plot in Figure 5.14(e) provide convincing evidence that

  a sample size of n 5 30 is sufficient to overcome the skewness of the population distribution and give an approximately normal X sampling distribution.

  n

  EXERCISES Section 5.3 (37–45)

  37. A particular brand of dishwasher soap is sold in three

  period (are “successes”). Suppose that n 5 15 drives are

  sizes: 25 oz, 40 oz, and 65 oz. Twenty percent of all pur-

  randomly selected. Let X 5 the number of successes in

  chasers select a 25-oz box, 50 select a 40-oz box, and the

  the sample. The statistic Xn is the sample proportion

  remaining 30 choose a 65-oz box. Let X 1 and X 2 denote

  (fraction) of successes. Obtain the sampling distribution of

  the package sizes selected by two independently selected

  this statistic. [Hint: One possible value of Xn is .2, corre-

  purchasers.

  sponding to X 5 3. What is the probability of this value

  a. Determine the sampling distribution of X, calculate

  (what kind of rv is X)?]

  E (X), and compare to m.

  40. A box contains ten sealed envelopes numbered 1, . . . , 10.

  b. Determine the sampling distribution of the sample

  The first five contain no money, the next three each con-

  variance S 2 , calculate E(S 2 ), and compare to s 2 .

  tains 5, and there is a 10 bill in each of the last two. A

  38. There are two traffic lights on a commuter’s route to and

  sample of size 3 is selected with replacement (so we have

  from work. Let X 1 be the number of lights at which the

  a random sample), and you get the largest amount in any

  commuter must stop on his way to work, and X 2 be the

  of the envelopes selected. If X 1 , X 2 , and X 3 denote the

  number of lights at which he must stop when returning

  amounts in the selected envelopes, the statistic of interest

  from work. Suppose these two variables are independent,

  is M 5 the maximum of X 1 ,X 2 , and X 3 . each with pmf given in the accompanying table (so X 1 ,X 2 a. Obtain the probability distribution of this statistic.

  is a random sample of size n 5 2).

  b. Describe how you would carry out a simulation

  0 1 2 x experiment to compare the distributions of M for 1

  m5

  2 1.1, s 5 .49

  various sample sizes. How would you guess the dis-

  p (x 1 ) .2 .5 .3

  tribution would change as n increases?

  a. Determine the pmf of T o 5X 1 1X 2 .

  41. Let X be the number of packages being mailed by a ran-

  b. Calculate m To . How does it relate to m, the population

  domly selected customer at a certain shipping facility.

  mean?

  Suppose the distribution of X is as follows:

  c. Calculate s 2 To . How does it relate to s 2 , the popula-

  x 12 34

  tion variance?

  p (x) .4

  d. Let X 3 and X 4 be the number of lights at which a stop

  is required when driving to and from work on a second

  a. Consider a random sample of size n 5 2 (two cus-

  day assumed independent of the first day. With

  tomers), and let X be the sample mean number of

  T o 5 the sum of all four X i ’s, what now are the values

  packages shipped. Obtain the probability distribution

  of E(T o ) and V(T o )?

  of X.

  e. Referring back to (d), what are the values of

  b. Refer to part (a) and calculate P(X 2.5).

  P (T o 5 8) and P(T o 7) [Hint: Don’t even think of

  c. Again consider a random sample of size n 5 2, but

  listing all possible outcomes!]

  now focus on the statistic R 5 the sample range (dif-

  39. It is known that 80 of all brand A external hard drives

  ference between the largest and smallest values in the

  work in a satisfactory manner throughout the warranty

  sample). Obtain the distribution of R. [Hint: Calculate

  230 ChapteR 5 Joint probability Distributions and Random Samples

  the value of R for each outcome and use the probabili-

  43. Suppose the amount of liquid dispensed by a certain

  ties from part (a).]

  machine is uniformly distributed with lower limit A 5 8 oz

  d. If a random sample of size n 5 4 is selected, what

  and upper limit B 5 10 oz. Describe how you would carry

  is P(X 1.5)? [Hint: You should not have to list all

  out simulation experiments to compare the sampling dis-

  possible outcomes, only those for which x 1.5.]

  tribution of the (sample) fourth spread for sample sizes

  42. A company maintains three offices in a certain region,

  n 5 5, 10, 20, and 30.

  each staffed by two employees. Information concerning

  44. Carry out a simulation experiment using a statistical

  yearly salaries (1000s of dollars) is as follows:

  computer package or other software to study the sam- pling distribution of X when the population distribution

  Office 123

  is Weibull with a 5 2 and b 5 5, as in Example 5.20.

  Employee 123456

  Consider the four sample sizes n 5 5, 10, 20, and 30, and

  Salary 29.7 33.6 30.2 33.6 25.8 29.7

  in each case use 1000 replications. For which of these

  a. Suppose two of these employees are randomly

  sample sizes does the X sampling distribution appear to

  selected from among the six (without replacement).

  be approximately normal?

  Determine the sampling distribution of the sample

  45. Carry out a simulation experiment using a statistical

  mean salary X.

  computer package or other software to study the sam-

  b. Suppose one of the three offices is randomly selected.

  pling distribution of X when the population distribu-

  Let X 1 and X 2 denote the salaries of the two employ-

  tion is lognormal with E(ln(X)) 5 3 and V(ln(X)) 5 1.

  ees. Determine the sampling distribution of X.

  Consider the four sample sizes n 5 10, 20, 30, and 50,

  c. How does E(X) from parts (a) and (b) compare to the

  and in each case use 1000 replications. For which of

  population mean salary m?

  these sample sizes does the X sampling distribution appear to be approximately normal?

Dokumen yang terkait

AN ALIS IS YU RID IS PUT USAN BE B AS DAL AM P E RKAR A TIND AK P IDA NA P E NY E RTA AN M E L AK U K A N P R AK T IK K E DO K T E RA N YA NG M E N G A K IB ATK AN M ATINYA P AS IE N ( PUT USA N N O MOR: 9 0/PID.B /2011/ PN.MD O)

0 82 16

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

19 819 7

Anal isi s L e ve l Pe r tanyaan p ad a S oal Ce r ita d alam B u k u T e k s M at e m at ik a Pe n u n jang S MK Pr ogr a m Keahl ian T e k n ologi , Kese h at an , d an Pe r tani an Kelas X T e r b itan E r lan gga B e r d asarkan T ak s on om i S OL O

2 99 16

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

1 29 9

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

7 202 3

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

0 63 87

The Correlation between students vocabulary master and reading comprehension

16 145 49

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

8 140 133

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

9 128 37

Transmission of Greek and Arabic Veteri

0 1 22