Normal Bayesian Classification
6.3.2 Normal Bayesian Classification
Up to now, we have assumed no particular distribution model for the likelihoods. Frequently, however, the normal distribution model is a reasonable assumption. SPSS and STATISTICA make this assumption when computing posterior probabilities.
A normal likelihood for class ω i is expressed by the following pdf (see Appendix A):
d / 2 1 / 2 exp − ( x − µ − i ) ’ Σ i ( x − µ i ) , 6.24 () 2 π Σ i
with:
µ i = E i [] x , mean vector for class ω I ; 6.24a Σ i = E i [ ( x − µ i )( x − µ i ) ’ ] , covariance for class ω i .
6.24b
Since the likelihood 6.24 depends on the Mahalanobis distance of a feature vector to the respective class mean, we obtain the same types of classifiers shown in Table 6.5.
6.3 Bayesian Classification 241
Note that even when the data distributions are not normal, as long as they are symmetric and in correspondence to ellipsoidal shaped clusters of points, we obtain the same decision surfaces as for a normal classifier, although with different error rates and posterior probabilities.
As previously mentioned SPSS and STATISTICA use a pooled covariance matrix when performing linear discriminant analysis. The influence of this practice on the obtained error, compared with the theoretical optimal Bayesian error corresponding to a quadratic classifier, is discussed in detail in (Fukunaga, 1990). Experimental results show that when the covariance matrices exhibit mild deviations from the pooled covariance matrix, the designed classifier has a performance similar to the optimal performance with equal covariances. This makes sense since for covariance matrices that are not very distinct, the difference between the optimum quadratic solution and the sub-optimum linear solution should only be noticeable for cases that are far away from the prototypes, as illustrated in Figure 6.12.
As already mentioned in section 6.2.3, using decision functions based on the individual covariance matrices, instead of a pooled covariance matrix, will produce quadratic decision boundaries. SPSS affords the possibility of computing such quadratic discriminants, using the Separate-groups option of the Classify tab. However, a quadratic classifier is less robust (more sensitive to parameter deviations) than a linear one, especially in high dimensional spaces, and needs a much larger training set for adequate design (see e.g. Fukunaga and Hayes, 1989).
SPSS and STATISTICA provide complete listings of the posterior probabilities
6.18 for the normal Bayesian classifier, i.e., using the likelihoods 6.24.
Figure 6.12. Discrimination of two classes with optimum quadratic classifier (solid line) and sub-optimum linear classifier (dotted line).
Example 6.8
Q: Determine the posterior probabilities corresponding to the classification of two classes of cork stoppers with equal prevalences as in Example 6.4 and comment the results.
A: Table 6.7 shows a partial listing of the computed posterior probabilities, obtained with SPSS. Notice that case #55 is marked with **, indicating a misclassified case, with a posterior probability that is higher for class 1 (0.782)
6 Statistical Classification
than for class 2 (0.218). Case #61 is also misclassified, but with a small difference of posterior probabilities. Borderline cases as case #61 could be re-analysed, e.g. using more features.
Table 6.7. Partial listing of the posterior probabilities, obtained with SPSS, for the classification of two classes of cork stoppers with equal prevalences. The columns headed by “P(G=g | D=d)” are posterior probabilities.
Actual Group
Highest Group
Second Highest Group
Case Number
Predicted Group P(G=g | D=d)
Group
P(G=g | D=d)
** Misclassified case
For a two-class discrimination with normal distributions and equal prevalences and covariance, there is a simple formula for the probability of error of the classifier (see e.g. Fukunaga, 1990):
Pe = 1 − N 0 , 1 ( δ / 2 ) , 6.25
with:
6.25a
the square of the so-called Bhattacharyya distance, a Mahalanobis distance of the means, reflecting the class separability.
Figure 6.13 shows the behaviour of Pe with increasing squared Bhattacharyya distance. After an initial quick, exponential-like decay, Pe converges asymptotically to zero. It is, therefore, increasingly difficult to lower a classifier error when it is already small.
6.3 Bayesian Classification 243
0 2 4 6 8 10 12 14 16 18 δ 2 20 Figure 6.13. Error probability of a Bayesian two-class discrimination with normal
distributions and equal prevalences and covariance.