Frequencies and Histograms

2.2.2 Frequencies and Histograms

Consider now a continuous variable. Instead of a tally sheet/bar graph, representing an estimate of a discrete probability function, we now want a tabular and graphical representation of an estimate of a probability density function. For this purpose, we 2 establish a certain number of equal length intervals of the random variable and compute the frequency of occurrence in each of these intervals (also known as

bins). In practice, one determines the lowest, x l , and highest, x h , sample values and divides the range, x h −x l , into r equal length bins, h k , k = 1, 2,…,r. The computed frequencies are now:

Unequal length intervals are seldom used.

48 2 Presenting and Summarising the Data

f k =n k /n, where n k is the number of sample values (observations) in bin h k .

The tabular form of the f k is called a frequency table; the graphical form is known as a histogram. They are representations of estimates of the probability density function of the associated random variable. Usually the histogram range is

chosen somewhat larger than x h −x l , and adjusted so that convenient limits for the

bins are obtained. Let d = (x h −x l )/r denote the bin length. Then the probability density estimate

for each of the intervals h k is:

The areas of the h k intervals are therefore f k and they sum up to 1 as they should.

Table 2.2. Frequency table of the cork stopper PRT variable using 10 bins (table obtained with STATISTICA).

Count Cumulative Percent Cumulative

Count Percent 20.22222<x<=187.7778 3

690.4444<x<=858.0000 22 104 14.66667 69.3333 858.0000<x<=1025.556 15 119 10.00000 79.3333 1025.556<x<=1193.111 11 130 7.33333 86.6667 1193.111<x<=1360.667 11 141 7.33333 94.0000 1360.667<x<=1528.222 8 149 5.33333 99.3333 1528.222<x<=1695.778 1 150 0.66667 100.0000 Missing 0 150 0.00000 100.0000

Example 2.2

Q: Consider the variable PRT of the Cork Stoppers’ dataset (see Appendix E). This variable measures the total perimeter of cork defects, and can be considered a continuous (ratio type) variable. Determine the frequency table and the histogram of this variable, using 10 and 6 bins, respectively.

A: The frequency table and histogram can be obtained with the commands listed in Commands 2.1 and Commands 2.3, respectively. Table 2.2 shows the frequency table of PRT using 10 bins. Figure 2.17 shows the histogram of PRT, using 6 bins.

2.2 Presenting the Data 49

Let X denote the random variable associated to PRT. Then, the histogram of the

frequency values represents an estimate, f ˆ X ( x ) , of the unknown probability density function f X (x ) .

The number of bins to use in a histogram (or in a frequency table) depends on

its goodness of fit to the true density function f X (x ) , in terms of bias and variance. In order to clarify this issue, let us consider the histograms of PRT using r = 3 and r = 50 bins as shown in Figure 2.18. Consider in both cases the f ˆ X ( x ) estimate represented by a polygonal line passing through the mid-point values of the histogram bars. Notice that in the first case (r = 3) the f ˆ X ( x ) estimate is quite smooth and lacks detail, corresponding to a large bias of the expected value of f ˆ X ( x ) – f X (x ) ; i.e., in average terms (for an ensemble of similar histograms associated to X) the histogram will give a point estimate of the density that can be quite far from the true density. In the second case (r = 50) the f ˆ X ( x ) estimate is too rough; our polygonal line may pass quite near the true density values, but the

f ˆ X ( x ) values vary widely (large variance) around the f X (x ) curve (corresponding to an average of a large number of such histograms).

Figure 2.17. Histogram of variable PRT (cork stopper dataset) obtained with STATISTICA using r = 6 bins.

Some formulas for selecting a “reasonable” number of bins, r, achieving a trade- off between large bias and large variance, have been divulged in the literature, namely:

r = 1 + 3.3 log(n) (Sturges, 1926);

2.1 r = 1 + 2.2 log(n) (Larson, 1975).

50 2 Presenting and Summarising the Data

The choice of an optimal value for r was studied by Scott (Scott DW, 1979), using as optimality criterion the minimisation of the global mean square error:

MSE 2 = ∫

D Ε [ ( f ˆ X ( x ) − f X ( x )) ] dx ,

where D is the domain of the random variable. The MSE minimisation leads to a formula for the optimal choice of a bin width, h(n), which for the Gaussian density case is:

h(n) = 3.49sn −1/3 , 2.3

where s is the sample standard deviation of the data. Although the h(n) formula was derived for the Gaussian density case, it was experimentally verified to work well for other densities too. With this h(n) one can compute the optimal number of bins using the data range:

r = (x h −x l )/ h(n). 2.4

a 1551.68 104.000000 606.666667 1109.333333 1612.000000 b 224.64 465.92 707.20 948.48 1189.76 1431.04 PRT

Figure 2.18. Histogram of variable PRT, obtained with STATISTICA, using:

a) r = 3 bins (large bias); b) r = 50 bins (large variance).

The Bins worksheet, of the EXCEL Tools.xls file (included in the book CD), allows the computation of the number of bins according to the three formulas

2.1, 2.2 and 2.4. In the case of the PRT variable, we obtain the results of Table 2.3, legitimising the use of 6 bins as in Figure 2.17.

Table 2.3. Recommended number of bins for the PRT data (n =150 cases, s = 361, range = 1508).

Formula Number of Bins Sturges 8

Larson 6 Scott 6

2.2 Presenting the Data 51

Commands 2.3. SPSS, STATISTICA, MATLAB and R commands used to obtain histograms.

SPSS Graphs; Histogram |Interactive; Histogram STATISTICA

Graphs; Histograms MATLAB

hist(y,x) R

hist(x)

The commands used to obtain histograms of continuous type data, are similar to the ones already described in Commands 2.2.

In order to obtain a histogram with SPSS, one can use the Histogram option of Graphs, or preferably, use the sequence of commands Graphs; Interactive; Histogram. One can then select the appropriate number of bins, or alternatively, set the bin width. It is also possible to choose the starting point of the bins.

With STATISTICA, one simply defines the bins in appropriate windows as previously mentioned. Besides setting the desired number of bins, there is instead also the possibility of defining the bin width ( Step size) and the starting point of the bins.

With MATLAB one obtains both the frequencies and the histogram with the hist command. Consider the following commands applied to the cork stopper data stored in the MATLAB cork matrix:

» prt = cork(:,4) » [f,x] = hist(prt,6);

In this case the hist command generates an f vector containing the frequencies counted in 6 bins and an x vector containing the bin locations. Listing the values of f one gets:

»f f=

which are precisely the values shown in Figure 2.17. One can also use the hist command with specifications of bins stored in a vector

b, as hist(prt, b).

With R one can use the hist function either for obtaining a histogram or for obtaining a frequency list. The frequency list is obtained by assigning the outcome of the function to a variable identifier, which then becomes a “histogram” object. Assuming that a data frame has been created (and attached) for cork stoppers we get a “histogram” object for PRT issuing the following command:

> h <- hist(PRT)

By listing the contents of h one gets among other things the information of the break points of the histogram bins, the counts and the densities. The densities

52 2 Presenting and Summarising the Data

represent the probability density estimate for a given bin. We can list de densities of PRT as follows:

> h$density [1] 1.333333e-04 1.033333e-03 1.166667e-03 [4] 9.666667e-04 5.666667e-04 4.666667e-04 [7] 4.333333e-04 2.000000e-04 3.333333e-05

Thus, using the formula previously mentioned for the probability density estimates, we compute the relative frequencies using the bin length (200 in our case) as follows:

> h$density*200 [1] 0.026666661 0.206666667 0.233333333 0.193333333 [5] 0.113333333 0.093333333 0.086666667 0.040000000 [9] 0.006666667