Minimum Mahalanobis Distance Discriminant

6.2.2 Minimum Mahalanobis Distance Discriminant

In the previous section, we used the Euclidian distance in order to derive the minimum distance, classifier rule. Since the features are random variables, it seems

a reasonable assumption that the distance of a feature vector to the class prototype (class sample mean) should reflect the multivariate distribution of the features. Many multivariate distributions have probability functions that depend on the joint covariance matrix. This is the case with the multivariate normal distribution, as described in section A.8.3 (see formula A.53). Let us assume that all classes have an identical covariance matrix Σ, reflecting a similar hyperellipsoidal shape of the corresponding feature vector distributions. The “surfaces” of equal probability density of the feature vectors relative to a sample mean vector m k correspond to a constant value of the following squared Mahalanobis distance:

2 d 1 k ( x ) = ( x − m k ) ’ − Σ ( x − m k ) , 6.9

When the covariance matrix is the unit matrix, we obtain:

which is the squared Euclidian distance of formula 6.7.

6.2 Linear Discriminants

Figure 6.6. 3D plots of 1000 points with normal distribution: a) Uncorrelated variables with equal variance; b) Correlated variables with unequal variance.

Let us now interpret these results. When all the features are uncorrelated and have equal variance, the covariance matrix is the unit matrix multiplied by the equal variance factor. In the three-dimensional space, the clouds of points are distributed as spheres, illustrated in Figure 6.6a, and the usual Euclidian distance to the mean is used in order to estimate the probability density at any point. The Mahalanobis distance is a generalisation of the Euclidian distance applicable to the general case of correlated features with unequal variance. In this case, the points of equal probability density lie on an ellipsoid and the data points cluster in the shape of an ellipsoid, as illustrated in Figure 6.6b. The orientations of the ellipsoid axes correspond to the correlations among the features. The lengths of straight lines passing through the centre and intersecting the ellipsoid correspond to the variances along the lines. The probability density is now estimated using the squared Mahalanobis distance 6.9.

Formula 6.9 can also be written as:

d 2 = ’ Σ k − ( x ) x 1 x − m ’ 1 1 k 1 Σ − x − x ’ Σ − m k + m ’ Σ − k m k . 6.10a

Grouping, as we have done before, the terms dependent on m k , we obtain:

d 2 k ( x ) = − 2 ( − 1 m ) ’ x − 0 . 5 m ’ − 1 ( 1 Σ k k Σ m − k ) + x ’ Σ x . 6.10b

Since 1 x ’ − Σ x is independent of k, minimising d

k (x) is equivalent to maximising

the following decision functions:

g k () x = w k ’ x + w k , 0 , 6.10c

with w

k , 0 = − 0 . 5 m k ’ Σ m k . 6.10d Using these decision functions, we again obtain linear discriminant functions in

the form of hyperplanes passing through the middle point of the line segment

6 Statistical Classification

linking the means. The only difference from the results of the previous section is that the hyperplanes separating class ω i from class ω j are now orthogonal to the vector Σ -1 (m i −m j ).

In practice, it is impossible to guarantee that all class covariance matrices are equal. Fortunately, the decision surfaces are usually not very sensitive to mild deviations from this condition; therefore, in normal practice, one uses an estimate of a pooled covariance matrix, computed as an average of the sample covariance matrices. This is the practice followed by SPSS and STATISTICA.

Example 6.3

Q: Redo Example 6.1, using a minimum Mahalanobis distance classifier. Check the computation of the discriminant parameters and determine to which class a cork with 65 defects is assigned.

A: Given the similarity of both distributions, the Mahalanobis classifier produces the same classification results as the Euclidian classifier. Table 6.1 shows the classification matrix (obtained with SPSS) with the predicted classifications along the columns and the true (observed) classifications along the rows. We see that for this simple classifier, the overall percentage of correct classification in the data sample (training set) is 77%, or equivalently, the overall training set error is 23%

(18% for ω 1 and 28% for ω 2 ). For the moment, we will not assess how the classifier performs with independent cases, i.e., we will not assess its test set error.

The decision function coefficients (also known as Fisher’s coefficients), as computed by SPSS, are shown in Table 6.2.

Table 6.1. Classification matrix obtained with SPSS of two classes of cork stoppers using only one feature, N.

Predicted Group Membership Total Class

1 2 Original Count

77.0% of original grouped cases correctly classified.

Table 6.2. Decision function coefficients obtained with SPSS for two classes of cork stoppers and one feature, N.

Class 2 N

Class 1

0.277 (Constant)

6.2 Linear Discriminants

Let us check these results. The class means are m 1 = [55.28] and m 2 = [79.74].

The average variance is s 2 = 287.63. Applying formula 6.10d we obtain:

1 = m 1 / s 2 = [ 0 . 192 ; w

1 , 0 = − 0 . 5 m 1 / s = − 6 . 005 . 6.11a

2 = m 2 / s = [ 0 . 277 ] ; w 2 , 0 = − 0 . 5 m 2 / s = − 11 . 746 . 6.11b

These results confirm the ones shown in Table 6.2. Let us determine the class assignment of a cork-stopper with 65 defects. As g 1 ([65]) = 0.192 × 65 – 6.005 =

6.48 is greater than g 2 ([65]) = 0.227 × 65 – 11.746 = 6.26 it is assigned to class ω 1 . ฀

Example 6.4

Q: Redo Example 6.2, using a minimum Mahalanobis distance classifier. Check the computation of the discriminant parameters and determine to which class a cork with 65 defects and with a total perimeter of 520 pixels (PRT10 = 52) is assigned.

A: The training set classification matrix is shown in Table 6.3. A significant improvement was obtained in comparison with the Euclidian classifier results mentioned in section 6.2.1; namely, an overall training set error of 10% instead of 18%. The Mahalanobis distance, taking into account the shape of the data clusters, not surprisingly, performed better. The decision function coefficients are shown in Table 6.4. Using these coefficients, we write the decision functions as:

g 1 ( x ) = w 1 ’ x + w 1 , 0 = [ 0 . 262 − 0 . 09783 ] x − 6 . 138 . 6.12a

g 2 ( x ) = w 2 ’ x + w 2 , 0 = [ 0 . 0803 0 . 2776 ] x − 12 . 817 . 6.12b

The point estimate of the pooled covariance matrix of the data is:

Substituting S -1 in formula 6.10d, the results shown in Table 6.4 are obtained.

Table 6.3. Classification matrix obtained with SPSS for two classes of cork stoppers with two features, N and PRT10.

Predicted Group Membership

Total

1 2 Original Count

90.0% of original grouped cases correctly classified.

6 Statistical Classification It is also straightforward to compute S -1 (m

1 − m 2 ) = [0.18 −0.376] . The ’

orthogonal line to this vector with slope 0.4787 and passing through the middle point between the means is shown with a solid line in Figure 6.7. As expected, the “hyperplane” leans along the regression direction of the features (see Figure 6.5 for comparison).

As to the classification of x = [65 52]’, since 1 g ([65 52]’) = 5.80 is smaller than

g 2 ([65 52] ) = 6.86, it is assigned to class ’ ω 2 . This cork stopper has a total

perimeter of the defects that is too big to be assigned to class ω 1 .

Table 6.4. Decision function coefficients, obtained with SPSS, for the two classes of cork stoppers with features N and PRT10.

Class 2 N 0.262

Class 1

0.0803 PRT10 -0.09783

0.278 (Constant) -6.138

Figure 6.7. Mahalanobis linear discriminant (solid line) for the two classes of cork stoppers. Scatter plot obtained with STATISTICA.

Notice that if the distributions of the feature vectors in the classes correspond to different hyperellipsoidal shapes, they will be characterised by unequal covariance matrices. The distance formula 6.10 will then be influenced by these different shapes in such a way that we obtain quadratic decision boundaries. Table 6.5 summarises the different types of minimum distance classifiers, depending on the covariance matrix.

6.2 Linear Discriminants 233

Table 6.5. Summary of minimum distance classifier types.

Equal-density

Covariance Classifier

Discriminants Σ 2 i Hyperplanes orthogonal to the segment = s I Linear, Euclidian Hyperspheres linking the means

surfaces

Σ Hyperplanes leaning along the

Linear, Mahalanobis

Hyperellipsoids

regression lines

Σ i Quadratic, Mahalanobis

Hyperellipsoids Quadratic surfaces

Commands 6.1. SPSS, STATISTICA, MATLAB and R commands used to perform discriminant analysis.

SPSS

Analyze; Classify; Discriminant

STATISTICA Statistics; Multivariate Exploratory Techniques; Discriminant Analysis MATLAB

classify(sample,training,group) classmatrix(x,y)

R classify(sample,training,group) classmatrix(x,y)

A large number of statistical analyses are available with SPSS and STATISTICA discriminant analysis commands. For instance, the pooled covariance matrix exemplified in 6.13 can be obtained with SPSS by checking the Pooled Within-Groups Matrices of the Statistics tab. There is also the possibility of obtaining several types of results, such as listings of decision function coefficients, classification matrices, graphical plots illustrating the separability of the classes, etc. The discriminant classifier can also be configured and evaluated in several ways. Many of these possibilities are described in the following sections.

The R stats package does not include discriminant analysis functions. However, it includes a function for computing Mahalanobis distances. We provide in the book CD two functions for performing discriminant analysis. The first function, classify(sample,training,group), returns a vector contain- ing the integer classification labels of a sample matrix based on a training data matrix with a corresponding group vector of supervised classifications (integers starting from 1). The returned classification labels correspond to the minimum Mahalanobis distance using the pooled covariance matrix. The second function, classmatrix(x,y), generates a classification matrix based on two

6 Statistical Classification

vectors, x and y, of integer classification labels. The classification matrix of Table

6.3 can be obtained as follows, assuming the cork data frame has been attached with columns ND, PRT and CL corresponding to variables N, PRT and CLASS, respectively:

> y <- cbind(ND[1:100],PRT[1:100]/10) > co <- classify(y,y,CL[1:100]) > classmatrix(CL[1:100],co)

The meanings of MATLAB’s classify arguments are the same as in R. MATLAB does not provide a function for obtaining the classification matrix. We include in the book CD the classmatrix function for this purpose, working in the same way as in R.

We didn t obtain the same values in MATLAB as we did with the other software ’ products. The reason may be attributed to the fact that MATLAB apparently does not use pooled covariances (therefore, is not providing linear discriminants).