Estimating a Mean

3.2 Estimating a Mean

We now estimate the mean of a random variable X using a confidence interval around the sample mean, instead of a single measurement as in the previous

section. Let x = [ x 1 , x 2 , K , x n ] ’ be a random sample from a population, described

by the random variable X with mean µ and standard deviation σ. Let x be the arithmetic mean:

Therefore, x is a function t n (x) as in the general formulation of the previous section. The sampling distribution of X (whose values are x ), taking into account the properties of a sum of i.i.d. random variables (see section A.8.4), has the same mean as X and a standard deviation given by:

-3 -2.5 -2

Figure 3.3. Normal distribution of the arithmetic mean for several values of n and with µ = 0 (σ = 1 for n = 1).

Assuming that X is normally distributed, i.e., X ~ N µ , σ , then X is also

normally distributed with mean µ and standard deviation σ . The confidence X

86 3 Estimating Data Parameters

interval, following the procedure explained in the previous section, is now computed as:

x − z 1 − α / 2 σ / n < µ < x + z 1 − α / 2 σ / n . 3.9

As shown in Figure 3.3, with increasing n, the distribution of X gets more peaked; therefore, the confidence intervals decrease with n (the precision of our estimates of the mean increase). This is precisely why computing averages is so popular!

In normal practice one does not know the exact value of σ, using the previously mentioned (2.3.2) point estimate s instead. In this case, the sampling distribution is not the normal distribution any more. However, taking into account Property 3 described in section B.2.8, the following random variable:

has a Student’s t distribution with df = n – 1 degrees of freedom. The sample standard deviation of X , s / n , is known as the standard error of the statistic x and denoted SE.

We now compute the 1 − α/2 percentile for the Student s t distribution with ’

df = n – 1degrees of freedom:

T n − 1 ( t ) = 1 − α / 2 ⇒ t df , 1 − α / 2 , 3.10

and use this percentile in order to establish the two-sided confidence interval:

x − µ − t df , 1 − α / 2 <

< t df , 1 − α / 2 , 3.11 SE

or, equivalently:

x − t df , 1 − α / 2 SE < µ < x + t df , 1 − α / 2 SE . 3.12

Since the Student s t distribution is less peaked than the normal distribution, one ’ obtains larger intervals when using formula 3.12 than when using formula 3.9, reflecting the added uncertainty about the true value of the standard deviation.

When applying these results one must note that:

– For large n, the Central Limit theorem (see sections A.8.4 and A.8.5) legitimises the assumption of normal distribution of X even when X is not normally distributed (under very general conditions).

– For large n, the Student s t distribution does not deviate significantly from ’ the normal distribution, and one can then use, for unknown σ, the same percentiles derived from the normal distribution, as one would use in the case of known σ.

3.2 Estimating a Mean 87

There are several values of n in the literature that are considered “large”, from 20 to 30. In what concerns the normality assumption of

X , the value n = 20 is usually enough. As to the deviation between z 1 − α/2 and t 1 − α/2 it is about 5% for n = 25 and α = 0.05. In the sequel, we will use the threshold n = 25 to distinguish small samples from large samples. Therefore, when estimating a mean we adopt the following procedure:

1. Large sample ( n ≥ 25): Use formulas 3.9 (substituting σ by s) or 3.12 (if improved accuracy is needed). No normality assumption of

X is needed.

2. Small sample ( n < 25) and population distribution can be assumed to be normal: Use formula 3.12.

For simplicity most of the software products use formula 3.12 irrespective of the values of n (for small n the normality assumption has to be checked using the goodness of fit tests described in section 5.1).

Example 3.1

Q: Consider the data relative to the variable PRT for the first class (CLASS=1) of the Cork Stoppers’ dataset. Compute the 95% confidence interval of its mean.

A: There are n = 50 cases. The sample mean and sample standard deviation are x = 365 and s = 110, respectively. The standard error is SE = s / n = 15.6. We apply formula 3.12, obtaining the confidence interval:

x ± t 49 , 0 . 975 ×SE = x ± 2.01×15.6 = 365 ± 31.

Notice that this confidence interval corresponds to a tolerance of 31/365 ≈ 8%. If we used in this large sample situation the normal approximation formula 3.9 we would obtain a very close result.

Given the interpretation of confidence interval (sections 3.1 and 1.5) we expect that in a large number of repetitions of 50 PRT measurements, in the same conditions used for the presented dataset, the respective confidence intervals such as the one we have derived will cover the true PRT mean 95% of the times. In other words, when presenting [334, 396] as a confidence interval for the PRT mean, we are incurring only on a 5% risk of being wrong by basing our estimate on an atypical dataset.

Example 3.2

Q: Consider the subset of the previous PRT data constituted by the first n = 20 cases. Compute the 95% confidence interval of its mean.

A: The sample mean and sample standard deviation are now x = 351 and s = 83, respectively. The standard error is SE = s/ n = 18.56. Since n = 20, we apply the small sample estimate formula 3.12 assuming that the PRT distribution can be well

88 3 Estimating Data Parameters

approximated by the normal distribution. (This assumption should have to be checked with the methods described in section 5.1.) In these conditions the confidence interval is:

x ±t 19,0.975 ×SE = x ± 2.09×SE ⇒ [312, 390].

If the 95% confidence interval were computed with the z percentile, one would wrongly obtain a narrower interval: [315, 387].

Example 3.3

Q: How many cases should one have of the PRT data in order to be able to establish a 95% confidence interval for its mean, with a tolerance of 3%?

A: Since the tolerance is smaller than the one previously obtained in Example 3.1, we are clearly in a large sample situation. We have:

Using the previous sample mean and sample standard deviation and with z 0.975 =1.96, one obtains:

n ≥ 558.

Note the growth of n with the square of 1/ ε.

The solutions of all the previous examples can be easily computed using Tools.xls (see Appendix F). An often used tool in Statistical Quality Control is the control chart for the sample mean, the so-called x-bar chart. The x-bar chart displays means, e.g. of measurements performed on equal-sized samples of manufactured items, randomly drawn along the time. The chart also shows the centre line (CL), corresponding to the nominal value or the grand mean in a large sequence of samples, and lines of the upper control limit (UCL) and lower control limit (LCL), computed as a ks deviation from the mean, usually with k = 3 and s the sample standard deviation. Items above UCL or below LCL are said to be out of control. Sometimes, lines corresponding to a smaller deviation of the grand mean, e.g. with k = 2, are also drawn, corresponding to the so-called upper warning line (UWL) and lower warning line (LWL).

Example 3.4

Q: Consider the first 48 measurements of total area of defects, for the first class of the Cork Stoppers dataset, as constituting 16 samples of 3 cork stoppers randomly drawn at successive times. Draw the respective x-bar chart with 3-sigma control lines and 2-sigma warning lines.

3.2 Estimating a Mean 89

A: Using MATLAB command xbarplot (see Commands 3.1) the x-bar chart shown in Figure 3.4 is obtained. We see that a warning should be issued for sample #1 and sample #12. No sample is out of control.

100 LWL 80 LCL 60 0 Samples 2 4 6 8 10 12 14 16

Figure 3.4. Control chart of the sample mean obtained with MATLAB for variable ART of the first cork stopper class.

Commands 3.1. SPSS, STATISTICA, MATLAB and R commands used to obtain confidence intervals of the mean.

SPSS Analyze; Descriptive Statistics; Explore;

Statistics; Confidence interval for mean STATISTICA Statistics; Descriptive Statistics; Conf. limits for means

[m s mi si]=normfit(x,delta) xbarplot(data,conf,specs)

MATLAB

R t.test(x) ; cimean(x,alpha)

SPSS, STATISTICA, MATLAB and R compute confidence intervals for the mean using Student’s t distribution, even in the case of large samples.

The MATLAB normfit command computes the mean, m, standard deviation, s, and respective confidence intervals, mi and si, of a data vector x, using confidence level delta (95%, by default). For instance, assuming that the PRT data was stored in vector prt, Example 3.2 would be solved as:

» prt20 = prt(1:20); » [m s mi si] = normfit(prt20)

90 3 Estimating Data Parameters

m= 350.6000 s= 82.7071 mi = 311.8919 389.3081

si = 62.8979 120.7996

The MATLAB xbarplot command plots a control chart of the sample mean for the successive rows of data. Parameter conf specifies the percentile for the control limits (0.9973 for 3-sigma); parameter specs is a vector containing the values of extra specification lines. Figure 3.4 was obtained with:

» y=[ART(1:3:48) ART(2:3:48) ART(3:3:48)]; » xbarplot(y,0.9973,[89 185])

Confidence intervals for the mean are computed in R when using t.test (to

be described in the following chapter). A specific function for computing the confidence interval of the mean, cimean(x, alpha) is included in Tools (see Appendix F).

Commands 3.2. SPSS, STATISTICA, MATLAB and R commands for case selection.

SPSS Data; Select cases STATISTICA Tools; Selection Conditions; Edit MATLAB

x(x(:,i) == a,:) R

x[col == a,]

In order to solve Examples 3.1 and 3.2 one needs to select the values of PRT for CLASS=1 and, inside this class, to select the first 20 cases. Selection of cases is an often-needed operation in statistical analysis. STATISTICA and SPSS make available specific windows where the user can fill in the needed conditions for case selection (see e.g. Figure 3.5a corresponding to Example 3.2). Selection can be accomplished by means of logical conditions applied to the variables and/or the cases, as well as through the use of especially defined filter variables.

There is also the possibility of selecting random subsets of cases, as shown in Figures 3.5a ( Subset/Random Sampling tab) and 3.5b (Random sample of cases option).

3.2 Estimating a Mean 91

Figure 3.5. Selection of cases: a) Partial view of STATISTICA “Case Selection Conditions” window; b) Partial view of SPSS “Select Cases” window.

In MATLAB one may select a submatrix of matrix x based on a particular value,

a, of a column i using the construction x(x(:,i)==a,:). For instance, assuming the first column of cork contains the classifications of the cork stoppers, c = cork(cork(:,1)==1,:) will retrieve the submatrix of cork corresponding to the first 50 cases of class 1. Other relational operators can be used instead of the equality operator “== . (Attention: “= is an assignment operator, ” ” an equality operator.) For instance, c = cork(cork(:,1)<2,:) will have the same effect.

The selection of cases in R is usually based on the construction x[col == a,], which selects the submatrix whose column col is equal to a certain value a. For instance, cork[CL == 1,] selects the first 50 cases of class 1 of the data frame cork. As in MATLAB other relational operators can be used instead of the equality operator “== . ”

Selection of random subsets in MATLAB and R can be performed through the generation of filter variables using random number generators. An example is shown in Table 3.1. First, a filter variable with 150 random 0s and 1s is created by rounding random numbers with uniform distribution in [0,1]. Next, the filter variable is used to select a subset of the 150 cases of the cork data.

Table 3.1. Selecting a random subset of the cork stoppers dataset. ’ MATLAB

>> filter = round(unifrnd(0,1,150,1)); >> fcork = cork(filter==1,:);

R > filter <- round(runif(150,0,1))

> fcork <- cork[filter==1,]

92 3 Estimating Data Parameters

In parameter estimation one often needs to use percentiles of random distributions. We have seen that before, concerning the application of percentiles of the normal and the Student’s t distribution. Later on we will need to apply percentiles of the chi-square and

F distributions. Statistical software usually provides a large panoply of probabilistic functions (density and cumulative distribution functions, quantile functions and random number generators with particular distributions). In Commands 3.3 we present some of the possibilities. Appendix D also provides tables of the most usual distributions.

Commands 3.3. SPSS, STATISTICA, MATLAB and R commands for obtaining quantiles of distributions.

SPSS Compute Variable STATISTICA Statistics; Probability Calculator MATLAB

norminv(p,mu,sigma) ; tinv(p,df) ; chi2inv(p,df) ; finv(p,df1,df2)

qnorm(p,mean,sd) ; qt(p,df) ; qchisq(p,df) ; qf(p,df1,df2)

The Compute Variable window of SPSS allows the use of functions to compute percentiles of distributions, namely the functions Idf.IGauss, Idf.T, Idf.Chisq and Idf.F for the normal, Student’s t, chi-square and F distributions, respectively.

STATISTICA provides a versatile Probability Calculator allowing among other things the computation of percentiles of many common distributions.

The MATLAB and R functions allow the computation of quantiles of the normal, t, chi-square and F distributions, respectively.