Estimating a Proportion

3.3 Estimating a Proportion

Imagine that one wished to estimate the probability of occurrence, p , of a “success” event in a series of n Bernoulli trials. A Bernoulli trial is a dichotomous outcome experiment (see B.1.1). Let k be the number of occurrences of the success event. Then, the unbiased and consistent point estimate of p is (see Appendix C):

For instance, if there are k = 5 successes in n = 15 trials, the point estimate of p

(estimation of a proportion) is ˆ p = 0 . 33 . Let us now construct an interval

3.3 Estimating a Proportion

estimation for p. Remember that the sampling distribution of the number of “successes” is the binomial distribution (see B.1.5). Given the discreteness of the binomial distribution, it may be impossible to find an interval which has exactly the desired confidence level. It is possible, however, to choose an interval which covers p with probability at least 1– α.

Table 3.2. Cumulative binomial probabilities for n = 15, p = 0.33.

0 1 2 3 4 5 6 7 8 9 10 B(k) 0.002 0.021 0.083 0.217 0.415 0.629 0.805 0.916 0.971 0.992 0.998

Consider the cumulative binomial probabilities for n = 15, p = 0.33, as shown in Table 3.2. Using the values of this table, we can compute the following probabilities for intervals centred at k = 5:

P(4 ≤ k ≤ 6) = B(6) – B(3) = 0.59 P(3 ≤ k ≤ 7) = B(7) – B(2) = 0.83 P(2 ≤ k ≤ 8) = B(8) – B(1) = 0.95 P(1 ≤ k ≤ 9) = B(9) – B(0) = 0.99

Therefore, a 95% confidence interval corresponds to:

2 ≤k≤8 ⇒ ≤ p ≤

⇒ 0 . 13 ≤ p ≤ 0 . 53 .

This is too large an interval to be useful. This example shows the inherent high degree of uncertainty when performing an interval estimation of a proportion with small n. For large n (say n > 50), we use the normal approximation to the binomial distribution as described in section A.7.3. Therefore, the sampling distribution of

p ˆ is modelled as N µ,σ with: µ pq =

(q = p – 1; see A.7.3). 3.14

n Thus, the large sample confidence interval of a proportion is:

p ˆ − z 1 − α / 2 pq / n < p < p ˆ + z 1 − α / 2 pq / n . 3.15

This is the formula already alluded to in Chapter 1, when describing the “uncertainties” about the estimation of a proportion. Note that when applying formula 3.15, one usually substitutes the true standard deviation by its point estimate, i.e., computing:

p ˆ − z 1 − α / 2 p ˆ q ˆ / n < p < p ˆ + z 1 − α / 2 ˆ p q ˆ / n . 3.16

94 3 Estimating Data Parameters

The deviation of this formula from the exact formula is negligible for large n (see e.g. Spiegel MR, Schiller J, Srinivasan RA, 2000, for details). One can also assume a worst case situation for σ, corresponding to p = q = ½ ⇒ σ = ( 2 n ) − 1 . The approximate 95% confidence level is now easy to remember:

Also, note that if we decrease the tolerance while maintaining n, the confidence level decreases as already mentioned in Chapter 1 and shown in Figure 1.6.

Example 3.5

Q: Consider, for the Freshmen dataset, the estimation of the proportion of freshmen that are displaced from their home (variable DISPL). Compute the 95% confidence interval of this proportion.

A: There are n = 132 cases, 37 of which are displaced, i.e., pˆ = 0.28. Applying formula 3.15, we have:

pˆ − 1.96 p/ ˆ q ˆ n <p  < pˆ + 1.96 p/ ˆ q ˆ n ⇒ 0.20 < p  < 0.36.

Note that this confidence interval is quite large. The following example will give some hint as to when we start obtaining reasonably useful confidence intervals.

Example 3.6

Q: Consider the interval estimation of a proportion in the same conditions as the previous example, i.e., with estimated proportion pˆ = 0.28 and α = 5%. How large should the sample size be for the confidence interval endpoints deviating less than ε = 2%?

A: In general, we must apply the following condition:

 z 1 − α / 2 p ˆ q ˆ ≤  ε ⇒ n ≥  . 3.17 n

In the present case, we must have n > 1628. As with the estimation of a mean, n grows with the square of 1/ ε. As a matter of fact, assuming the worst case situation for σ, as we did above, the following approximate formula for 95% confidence

level holds: n ~ > ( 1 / ε ) 2 .

Confidence intervals for proportions, and lower bounds on n achieving a desired deviation in proportion estimation, can be computed with Tools.xls. Interval estimation of a proportion can be carried out with SPSS, STATISTICA, MATLAB and R in the same way as we did with means. The only preliminary step

3.4 Estimating a Variance 95

is to convert the variable being analysed into a Bernoulli type variable, i.e., a binary variable with 1 coding the “success” event, and 0 the “failure” event. As a

matter of fact, a dataset x 1 , …, x n , with k successes, represented as a sequence of values of Bernoulli random variables (therefore, with k ones and n – k zeros), has the following sample mean and sample variance:

In Example 3.5, variable DISPL with values 1 for “Yes” and 2 for “No” is converted into a Bernoulli type variable, DISPLB, e.g. by using the formula DISPLB = 2 – DISPL. Now, the “success” event (“Yes”) is coded 1, and the complement is coded 0. In SPSS and STATISTICA we can also use “if” constructs to build the Bernoulli variables. This is especially useful if one wants to create Bernoulli variables from continuous type variables. SPSS and STATISTICA also have a Rank command that can be useful for the purpose of creating Bernoulli variables.

Commands 3.4. MATLAB and R commands for obtaining confidence intervals of proportions.

MATLAB ciprop(n0,n1,alpha) R

ciprop(n0,n1,alpha)

There are no specific functions to compute confidence intervals of proportions in MATLAB and R. However, we provide for MATLAB and R the function ciprop(n0,n1,alpha)for that purpose (see Appendix F). For Example 3.5 we obtain in R:

> ciprop(95,37,0.05)