Categorised Plots

2.2.4 Categorised Plots

Statistical studies often address the problem of comparing random distributions of the same variables for different values of an extra grouping variable. For instance, in the case of the cork stopper dataset, one might be interested in comparing numbers of defects for the three different groups (or classes) of the cork stoppers. The cork stopper dataset, described in Appendix E, is an example of a grouped (or classified) dataset. When dealing with grouped data one needs to compare the data across the groups. For that purpose there is a multitude of graphic tools, known as categorised plots. For instance, with the cork stopper data, one may wish to compare the histograms of the first two classes of cork stoppers. This comparison is shown as a categorised histogram plot in Figure 2.22, for the variable ART. Instead of displaying the individual histograms, it is also possible to display all histograms overlaid in only one plot.

ART 0 -100

Figure 2.22. Categorised histogram plot obtained with STATISTICA for variable ART and the first two classes of cork stoppers.

When the number of groups is high, the visual comparison of the histograms may be rather difficult. The situation usually worsens if one uses overlaid

2.2 Presenting the Data 57

histograms. A better alternative to comparing data distributions for several groups is to use the so-called box plot (or box-and-whiskers plot). As illustrated in Figure

2.23, a box plot uses a distinct rectangular box for each group, where each box corresponds to the central 50% of the cases, the so-called inter-quartile range (IQR). A central mark or line inside the box indicates the median, i.e., the value below which 50% of the cases are included. The boxes are prolonged with lines (whiskers) covering the range of the non-outlier cases, i.e., cases that do not exceed, by a certain factor of the IQR, the above or below box limits. A usual IQR factor for outliers is 1.5. Sometimes box plots also indicate, with an appropriate mark, the extreme cases, similarly defined as the outliers, but using a larger IQR factor, usually 3. As an alternative to using the central 50% range of the cases around the median, one can also use the mean ± standard deviation.

There is also the possibility of obtaining categorised scatter plots or categorised 3D plots. Their real usefulness is however questionable.

1 2 3 Figure 2.23. Box plot of variable ART, obtained with R, for the three classes of

the cork stoppers data. The “o” sign for Class 1 indicates an outlier, i.e., a case exceeding the top of the box by more than 1.5 × IQR.

Commands 2.6. SPSS, STATISTICA, MATLAB and R commands used to obtain box plots.

SPSS Graphs; Boxplot STATISTICA Graphs; 2D Graphs; Boxplots MATLAB

boxplot(x) R

boxplot(x~y); legend(x,y,label)

58 2 Presenting and Summarising the Data

The R boxplot function uses the so-called x~y “formula” to create a box plot of x grouped by y. The legend function places label as a legend at the (x,y) position of the plot. The graph of Figure 2.23 (CL is the Class variable) was obtained with:

> boxplot(ART~CL) > legend(3.2,100,legend=“CL”) > legend(0.5,900,legend=“ART”)