Bootstrap Estimation

3.6 Bootstrap Estimation

In the previous sections we made use of some assumptions regarding the sampling distributions of data parameters. For instance, we assumed the sample distribution of the variance to be a chi-square distribution in the case that the normal distribution assumption of the original data holds. Likewise for the F sampling distribution of the variance ratio. The exception is the distribution of the arithmetic mean which is always well approximated by the normal distribution, independently of the distribution law of the original data, whenever the data size is large enough. This is a result of the Central Limit theorem. However, no Central Limit theorem exists for parameters such as the variance, the median or the trimmed mean.

100 3 Estimating Data Parameters

The bootstrap idea (Efron, 1979) is to mimic the sampling distribution of the statistic of interest through the use of many resamples with replacement of the original sample. In the present chapter we will restrict ourselves to illustrating the idea when applied to the computation of confidence intervals (bootstrap techniques cover a vaster area than merely confidence interval computation). Let us then illustrate the bootstrap computation of confidence intervals by referring it to the mean of the n = 50 PRT measurements for Class=1 of the cork stoppers’ dataset (as in Example 3.1). The histogram of these data is shown in Figure 3.7a.

Denoting by X the associated random variable, we compute the sample mean of the data as x = 365.0. The sample standard deviation of X , the standard error, is SE = s/ n =15.6. Since the dataset size, n, is not that large one may have some suspicion concerning the bias of this estimate and the accuracy of the confidence interval based on the normality assumption.

Let us now consider extracting at random and with replacement m = 1000 samples of size n = 50 from the original dataset. These resamples are called bootstrap samples. Let us further consider that for each bootstrap sample we 2 compute its mean x . Figure 3.7b shows the histogram of the bootstrap distribution of the means. We see that this histogram looks similar to the normal distribution. As a matter of fact the bootstrap distribution of a statistic usually mimics the sample distribution of that statistic, which in this case happens to be normal.

Let us denote each bootstrap mean by x * . The mean and standard deviation of the 1000 bootstrap means are computed as:

x boot = ∑ x = ∑ x m = 365.1, 1000

1 * s 2 x , boot =

∑ x − − boot ( x ) m = 15.47, 1

where the summations extend to the m = 1000 bootstrap samples.

We see that the mean of the bootstrap distribution is quite close to the original sample mean. There is a bias of only x boot − x = 0.1. It can be shown that this is usually the size of the bias that can be expected between x and the true population mean, µ. This property is not an exclusive of the bootstrap distribution of the mean. It applies to other statistics as well.

The sample standard deviation of the bootstrap distribution, called bootstrap standard error and denoted SE boot , is also quite close to the theory-based estimate SE = s/ n . We could now use SE boot to compute a confidence interval for the mean. In the case of the mean there is not much advantage in doing so (we should get practically the same result as in Example 3.1), since we have the Central Limit theorem in which to base our confidence interval computations. The good thing

We should more rigorously say “one possible histogram”, since different histograms are possible depending on the resampling process. For n and m sufficiently large they are, however, close to each other.

3.6 Bootstrap Estimation 101

about the bootstrap technique is that it also often works for other statistics for which no theory on sampling distribution is available. As a matter of fact, the bootstrap distribution usually – for a not too small original sample size, say n > 50 − has the same shape and spread as the original sampling distribution, but is centred at the original statistic value rather than the true parameter value.

Figure 3.7. a) Histogram of the PRT data; b) Histogram of the bootstrap means.

Suppose that the bootstrap distribution of a statistic, w, is approximately normal and that the bootstrap estimate of bias is small. We then compute a two-sided bootstrap confidence interval at α risk, for the parameter that corresponds to the statistic, by the following formula:

w ± t n − 1 , 1 − α / 2 SE boot

We may use the percentiles of the normal distribution, instead of the Student’s t distribution, whenever m is very large. The question naturally arises on how large must the number of bootstrap samples be in order to obtain a reliable bootstrap distribution with reliable values of SE boot ? A good rule of thumb for m, based on theoretical and practical evidence, is to choose m ≥ 200.

The following examples illustrate the computation of confidence intervals using the bootstrap technique.

Example 3.9

Q: Consider the percentage of lime, CaO, in the composition of clays, a sample of which constitutes the Clays’ dataset. Compute the confidence interval at 95%

level of the two-tail 5% trimmed mean and discuss the results. (The two-tail 5% trimmed mean disregards 10% of the cases, 5% at each of the tails.)

102 3 Estimating Data Parameters

A: The histogram and box plot of the CaO data (n = 94 cases) are shown in Figure

3.8. Denoting the associated random variable by X we compute x = 0.28. We observe in the box plot a considerable number of “outliers” which leads us to mistrust the sample mean as a location measure and to use the two-tail 5% trimmed mean computed as (see Commands 2.7): x 0 . 05 ≡ w = 0.2755.

a 0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 b CaO

Figure 3.8. Histogram (a) and box plot (b) of the CaO data.

0 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 Figure 3.9. Histogram of the bootstrap distribution of the two-tail 5% trimmed

mean of the CaO data (1000 resamples).

We now proceed to computing the bootstrap distribution with m = 1000 resamples. Figure 3.9 shows the histogram of the bootstrap distribution. It is clearly visible that it is well approximated by the normal distribution (methods not relying on visual inspection are described in section 5.1). From the bootstrap distribution we compute:

w boot = 0.2764 SE boot = 0.0093

3.6 Bootstrap Estimation 103

The bias w boot − w = 0.2764 – 0.2755 = 0.0009 is quite small (less than 10% of the standard deviation). We therefore compute the bootstrap confidence interval of the trimmed mean as:

w ± t 93 , 0 . 975 SE boot = 0.2755 ± 1.9858×0.0093 = 0.276 ± 0.018 ฀

Example 3.10

Q: Compute the confidence interval at 95% level of the standard deviation for the data of the previous example.

A: The standard deviation of the original sample is s ≡ w = 0.086. The histogram of the bootstrap distribution of the standard deviation with m = 1000 resamples is shown in Figure 3.10. This empirical distribution is well approximated by the normal distribution. We compute:

w boot = 0.0854 SE boot = 0.0070

The bias w boot − w = 0.0854 – 0.086 = −0.0006 is quite small (less than 10% of the standard deviation). We therefore compute the bootstrap confidence interval of the standard deviation as:

w ± t 93 , 0 . 975 SE boot = 0.086 ± 1.9858×0.007 = 0.086 ± 0.014 ฀

0.05 0 0.06 0.07 0.08 0.09 0.1 0.11 Figure 3.10. Histogram of the bootstrap distribution of the standard deviation of

the CaO data (1000 resamples).

Example 3.11

Q: Consider the variable ART (total area of defects) of the cork stoppers’ dataset. Using the bootstrap method compute the confidence interval at 95% level of its median.

104 3 Estimating Data Parameters

A: The histogram and box plot of the ART data (n = 150 cases) are shown in Figure 3.11. The sample median and sample mean of ART are med ≡ w = 263 and x = 324, respectively. The distribution of ART is clearly right skewed; hence, the mean is substantially larger than the median (almost one and half times the standard deviation). The histogram of the bootstrap distribution of the median with m = 1000 resamples is shown in Figure 3.12. We compute:

w boot = 266.1210 SE boot = 20.4335

The bias w boot − w = 266 – 263 = 3 is quite small (less than 7% of the standard deviation). We therefore compute the bootstrap confidence interval of the median as:

w ± t 149 , 0 . 975 SE boot = 263 ± 1.976×20.4335 = 263 ± 40 ฀

Figure 3.11. Histogram (a) and box plot (b) of the ART data.

340 Figure 3.12. Histogram of the bootstrap distribution of the median of the ART data

(1000 resamples).

3.6 Bootstrap Estimation 105

In the above Example 3.11 we observe in Figure 3.12 a histogram that doesn’t look to be well approximated by the normal distribution. As a matter of fact any goodness of fit test described in section 5.1 will reject the normality hypothesis. This is a common difficulty when estimating bootstrap confidence intervals for the median. An explanation of the causes of this difficulty can be found e.g. in (Hesterberg T et al., 2003). This difficulty is even more severe when the data size n is small (see Exercise 3.20). Nevertheless, for data sizes larger then 100 cases, say, and for a large number of resamples, one can still rely on bootstrap estimates of the median as in Example 3.11.

Example 3.12

Q: Consider the variables Al2O3 and K2O of the Clays’ dataset (n = 94 cases). Using the bootstrap method compute the confidence interval at 5% level of their Pearson correlation.

A: The sample Pearson correlation of Al2O3 and K2O is r ≡ w = 0.6922. The histogram of the bootstrap distribution of the Pearson correlation with m = 1000 resamples is shown in Figure 3.13. It is well approximated by the normal distribution. From the bootstrap distribution we compute:

w boot = 0.6950 SE boot = 0.0719

The bias w boot − w = 0.6950 – 0.6922 = 0.0028 is quite small (about 0.4% of the correlation value). We therefore compute the bootstrap confidence interval of the Pearson correlation as:

w ± t 93 , 0 . 975 SE boot = 0.6922 ± 1.9858×0.0719 = 0.69 ± 0.14 ฀ 300

Figure 3.13. Histogram of the bootstrap distribution of the Pearson correlation between the variables Al2O3 and K2O of the Clays’ dataset (1000 resamples).

106 3 Estimating Data Parameters

We draw the reader’s attention to the fact that when generating bootstrap samples of associated variables, as in the above Example 3.12, these have to be generated by drawing cases at random with replacement (and not the variables individually), therefore preserving the association of the variables involved.

Commands 3.7. MATLAB and R commands for obtaining bootstrap distributions.

MATLAB bootstrp(m,’statistic’, arg1, arg2,...) R

boot(x, statistic, m, stype=“i”,...)

SPSS and STATISTICA don’t have menu options for obtaining bootstrap distributions (although SPSS has a bootstrap macro to be used in its Output Management System and STATISTICA has a bootstrapping facility built into its Structural Equation Modelling module).

The bootstrap function of MATLAB can be used directly with one of MATLAB’s statistical functions, followed by its arguments. For instance, the bootstrap distribution of Example 3.9 can be obtained with:

>> b = bootstrp(1000,’trimmean’,cao,10);

Notice the name of the statistical function written as a string (the function trimmean is indicated in Commands 2.7). The function call returns the vector b with the 1000 bootstrap replicates of the trimmed mean from where one can obtain the histogram and other statistics.

Let us now consider Example 3.12. Assuming that columns 7 and 13 of the clays’ matrix represent the variables Al2O3 and K2O, respectively, one obtains the bootstrap distribution with:

>> b=bootstrp(1000,’corrcoef’,clays(:,7),clays(:,13))

The corrcoef function (mentioned in Commands 2.9) generates a correlation matrix. Specifically, corrcoef(clays(:,7), clays(:,13)) produces:

ans = 1.0000 0.6922 0.6922 1.0000

As a consequence each row of the b matrix contains in this case the correlation matrix values of one bootstrap sample. For instance:

b= 1.0000 0.6956 0.6956 1.0000 1.0000 0.7019 0.7019 1.0000

Hence, one may obtain the histogram and the bootstrap statistics using b(:,2) or b(:,3).

Exercises 107

In order to obtain bootstrap distributions with R one must first install the boot package with library(boot). One can check if the package is installed with the search() function (see section 1.7.2.2).

The boot function of the boot package will generate m bootstrap replicates of

a statistical function, denoted statistic, passed (its name) as argument. However, this function should have as second argument a vector of indices, frequencies or weights. In our applications we will use a vector of indices, which corresponds to setting the stype argument to its default value, stype=“i”. Since it is the default value we really don’t need to mention it when calling boot. Anyway, the need to have the mentioned second argument obliges one to write the code of the statistical function. Let us consider Example 3.10. Supposing the clays data frame has been created and attached, it would be solved in R in the following way:

> sdboot <- function(x,i)sd(x[i]) > b <- boot(CaO,sdboot,1000)

The first line defines the function sdboot with two arguments. The first argument is the data. The second argument is the vector of indices which will be used to store the index information of the bootstrap samples. The function itself computes the standard deviation of those data elements whose indices are in the index vector i (see the last paragraph of section 2.1.2.4).

b. By listing b one may obtain:

The boot function returns a so-called bootstrap object, denoted above as

Bootstrap Statistics : original bias std. error t1* 0.08601075 -0.00082119 0.007099508

which agrees fairly well with the values computed with MATLAB in Example

3.10. One of the attributes of the bootstrap object is the vector with the bootstrap replicates, denoted t. The histogram of the bootstrap distribution can therefore be obtained with:

> hist(b$t)

Exercises

3.1 Consider the 1 − α 1 and 1 − α 2 confidence intervals of a given statistic with 1 − α 1 >1 − α 2 . Why is the confidence interval for 1 − α 1 always larger than or equal to the interval for 1 − α 2 ?

3.2 Consider the measurements of bottle bottoms of the Moulds dataset. Determine the 95% confidence interval of the mean and the x-charts of the three variables RC, CG and EG. Taking into account the x-chart, discuss whether the 95% confidence interval of the RC mean can be considered a reliable estimate.

108 3 Estimating Data Parameters

3.3 Compute the 95% confidence interval of the mean and of the standard deviation of the RC variable of the previous exercise, for the samples constituted by the first 50 cases and by the last 50 cases. Comment on the results.

3.4 Consider the ASTV and ALTV variables of the CTG dataset. Assume that only a 15-case random sample is available for these variables. Can one expect to obtain reliable estimates of the 95% confidence interval of the mean of these variables using the Student’s t distribution applied to those samples? Why? (Inspect the variable histograms.)

3.5 Obtain a 15-case random sample of the ALTV variable of the previous exercise (see Commands 3.2). Compute the respective 95% confidence interval assuming a normal and an exponential fit to the data and compare the results. The exponential fit can be performed in MATLAB with the function expfit.

3.6 Compute the 90% confidence interval of the ASTV and ALTV variables of the previous Exercise 3.4 for 10 random samples of 20 cases and determine how many times the confidence interval contains the mean value determined for the whole 2126 case set. In a long run of these 20-case experiments, which variable is expected to yield

a higher percentage of intervals containing the whole-set mean?

3.7 Compute the mean with the 95% confidence interval of variable ART of the Cork Stoppers dataset. Perform the same calculations on variable LOGART = ln(ART). Apply the Gauss’ approximation formula of A.6.1 in order to compare the results. Which point estimates and confidence intervals are more reliable? Why?

3.8 Consider the PERIM variable of the Breast Tissue dataset. What is the tolerance of the PERIM mean with 95% confidence for the carcinoma class? How many cases of the carcinoma class should one have available in order to reduce that tolerance to 2%?

3.9 Imagine that when analysing the TW=“Team Work” variable of the Metal Firms dataset, someone stated that the team-work is at least good (score 4) for 3/8 = 37.5% of the metallurgic firms. Does this statement deserve any credit? (Compute the 95% confidence interval of this estimate.)

3.10 Consider the Culture dataset. Determine the 95% confidence interval of the proportion of boroughs spending more than 20% of the budget for musical activities.

3.11 Using the CTG dataset, determine the percentage of foetal heart rate cases that have abnormal short term variability of the heart rate more than 50% of the time, during calm sleep (CLASS A). Also, determine the 95% confidence interval of that percentage and how many cases should be available in order to obtain an interval estimate with 1% tolerance.

3.12 A proportion pˆ was estimated in 225 cases. What are the approximate worst-case 95% confidence interval limits of the proportion?

3.13 Redo Exercises 3.2 and 3.3 for the 99% confidence interval of the standard deviation.

Exercises 109

3.14 Consider the CTG dataset. Compute the 95% and 99% confidence intervals of the standard deviation of the ASTV variable. Are the confidence interval limits equally away from the sample mean? Why?

3.15 Consider the computation of the confidence interval for the standard deviation performed in Example 3.6. How many cases should one have available in order to obtain confidence interval limits deviating less than 5% of the point estimate?

3.16 In order to represent the area values of the cork defects in a convenient measurement unit, the ART values of the Cork Stoppers dataset have been multiplied by 5 and stored into variable ART5. Using the point estimates and 95% confidence intervals of the mean and the standard deviation of ART, determine the respective statistics for ART5.

3.17 Consider the ART, ARM and N variables of the Cork Stoppers’ dataset. Since ARM = ART/N, why isn’t the point estimate of the ART mean equal to the ratio of the point estimates of the ART and N means? (See properties of the mean in A.6.1.)

3.18 Redo Example 3.8 for the classes C = “calm vigilance” and D = “active vigilance” of the CTG dataset.

3.19 Using the bootstrap technique compute confidence intervals at 95% level of the mean and standard deviation for the ART data of Example 3.11.

3.20 Determine histograms of the bootstrap distribution of the median of the river Cávado flow rate (see Flow Rate dataset). Explain why it is unreasonable to set confidence intervals based on these histograms.

3.21 Using the bootstrap technique compute confidence intervals at 95% level of the mean and the two-tail 5% trimmed mean for the BRISA data of the Stock Exchange dataset. Compare both results.

3.22 Using the bootstrap technique compute confidence intervals at 95% level of the Pearson correlation between variables CaO and MgO of the Clays’ dataset.