Dimensionality Ratio and Error Estimation

6.3.3 Dimensionality Ratio and Error Estimation

The Mahalanobis and the Bhattacharyya distances can only increase when adding more features, since for every added feature a non-negative distance contribution is also added. This would certainly be the case if we had the true values of the means and the covariances available, which, in practical applications, we do not.

When using a large number of features we get numeric difficulties in obtaining a good estimate of Σ -1 , given the finiteness of the training set. Surprising results can then be expected; for instance, the performance of the classifier can degrade when more features are added, instead of improving.

Figure 6.14 shows the classification matrix for the two-class, cork-stopper problem, using the whole ten-feature set and equal prevalences. The training set performance did not increase significantly compared with the two-feature solution presented previously, and is worse than the solution using the four-feature vector [ART PRM NG RAAR]’, as shown in Figure 6.14b.

There are, however, further compelling reasons for not using a large number of features. In fact, when using estimates of means and covariance derived from a training set, we are designing a biased classifier, fitted to the training set. Therefore, we should expect that our training set error estimates are, on average, optimistic. On the other hand, error estimates obtained in independent test sets are expected to be, on average, pessimistic. It is only when the number of cases, n, is sufficiently larger than the number of features, d, that we can expect that our classifier will generalise, that is it will perform equally well when presented with new cases. The n/d ratio is called the dimensionality ratio.

The choice of an adequate dimensionality ratio has been studied by several authors (see References). Here, we present some important results as an aid for the designer to choose sensible values for the n/d ratio. Later, when we discuss the topic of classifier evaluation, we will come back to this issue from another perspective.

6 Statistical Classification

Figure 6.14. Classification results obtained with STATISTICA, of two classes of cork stoppers using: (a) Ten features; (b) Four features.

Let us denote:

Pe – Probability of error of a given classifier; Pe *

– Probability of error of the optimum Bayesian classifier; Pe d (n)

– Training (design) set estimate of Pe based on a classifier designed on n cases; Pe t (n)

– Test set estimate of Pe based on a set of n test cases.

The quantity Pe d (n) represents an estimate of Pe influenced only by the finite size of the design set, i.e., the classifier error is measured exactly, and its deviation from Pe is due solely to the finiteness of the design set. The quantity Pe t (n) represents an estimate of Pe influenced only by the finite size of the test set, i.e., it is the expected error of the classifier when evaluated using n-sized test sets. These

quantities verify Pe d ( ∞) = Pe and Pe t ( ∞) = Pe, i.e., they converge to the theoretical value Pe with increasing values of n. If the classifier happens to be designed as an

optimum Bayesian classifier Pe d and Pe t converge to Pe * .

In normal practice, these error probabilities are not known exactly. Instead, we

compute estimates of these probabilities, P ˆ e d and P ˆ e t , as percentages of

misclassified cases, in exactly the same way as we have done in the classification matrices presented so far. The probability of obtaining k misclassified cases out of n for a classifier with a theoretical error Pe, is given by the binomial law:

 n  P ( k ) =  Pe k  ( 1 − Pe ) n − k . 6.26

The maximum likelihood estimation of Pe under this binomial law is precisely (see Appendix C):

P ˆ e = k / n , 6.27

with standard deviation:

Pe ( 1 − Pe ) σ =

. 6.28 n

6.3 Bayesian Classification 245

Formula 6.28 allows the computation of confidence interval estimates for Pˆ e , by substituting Pˆ e in place of Pe and using the normal distribution approximation for sufficiently large n (say, n ≥ 25). Note that formula 6.28 yields zero for the extreme cases of Pe = 0 or Pe = 1.

In normal practice, we first compute Pˆ e d by designing and evaluating the classifier in the same set with n cases, P ˆ e d () n . This is what we have done so far. As for Pˆ e t , we may compute it using an independent set of n cases, P ˆ e t () n . In

order to have some guidance on how to choose an appropriate dimensionality ratio, we would like to know the deviation of the expected values of these estimates from the Bayes error. Here the expectation is computed on a population of classifiers of the same type and trained in the same conditions. Formulas for these expectations,

Ε[ P ˆ e d () n ] and Ε[ P ˆ e t () n ], are quite intricate and can only be computed

numerically. Like formula 6.25, they depend on the Bhattacharyya distance. A software tool, SC Size, computing these formulas for two classes with normally distributed features and equal covariance matrices, separated by a linear discriminant, is included with on the book CD. SC Size also allows the computation of confidence intervals of these estimates, using formula 6.28.

Figure 6.15. Two-class linear discriminant Ε[ P ˆ e d () n ] and Ε[ P ˆ e t () n ] curves, for

d = 7 and δ 2 = 3, below and above the dotted line, respectively. The dotted line represents the Bayes error (0.193).

Figure 6.15 is obtained with SC Size and illustrates how the expected values of the error estimates evolve with the n/d ratio, where n is assumed to be the number of cases in each class. The feature set dimension id d = 7. Both curves have 4 an asymptotic behaviour with n → ∞ , with the average design set error estimate converging to the Bayes error from below and the average test set error estimate converging from above.

Numerical approximations in the computation of the average test set error may sometimes result in a slight deviation from the asymptotic behaviour, for large n.

6 Statistical Classification

Both standard deviations, which can be inspected in text boxes for a selected value of n/d, are initially high for low values of n and converge slowly to zero with

n → ∞ . For the situation shown in Figure 6.15, the standard deviation of P ˆ e d () n

changes from 0.089 for n = d (14 cases, 7 per class) to 0.033 for n = 10d (140 cases, 70 per class).

Based on the behaviour of the Ε[ P ˆ e d () n ] and Ε[ P ˆ e t () n ] curves, some criteria

can be established for the dimensionality ratio. As a general rule of thumb, using dimensionality ratios well above 3 is recommended.

If the cases are not equally distributed by the classes, it is advisable to use the smaller number of cases per class as value of n. Notice also that a multi-class problem can be seen as a generalisation of a two-class problem if every class is well separated from all the others. Then, the total number of needed training samples for a given deviation of the expected error estimates from the Bayes error can be estimated as cn * , where n * is the particular value of n that achieves such a deviation in the most unfavourable, two-class dichotomy of the multi-class problem.