Feature Selection

6.5 Feature Selection

As already discussed in section 6.3.3, great care must be exercised in reducing the number of features used by a classifier, in order to maintain a high dimensionality ratio and, therefore, reproducible performance, with error estimates sufficiently near the theoretical value. For this purpose, one may use the hypothesis test methods described in chapters 4 and 5 with the aim of discarding features that are clearly non-useful at an initial stage of the classifier design. This feature assessment task, while assuring that an information-carrying feature set is indeed used in the classifier, does not guarantee it will need the whole set. Consider, for instance, that we are presented with a classification problem described by 4

features, x 1 ,x 2 ,x 3 and x 4 , with x 1 and x 2 perfectly discriminating the classes, and x 3 and x 4 being linearly dependent of x 1 and x 2 . The hypothesis tests will then find that all features contribute to class discrimination. However, this discrimination could

be performed equally well using the alternative sets {x 1 , x 2 } or {x 3 , x 4 }. Briefly, discarding features with no aptitude for class discrimination is no guarantee against redundant features.

There is abundant literature on the topic of feature selection (see References). Feature selection uses a search procedure of a feature subset (model) obeying a stipulated merit criterion. A possible choice for this criterion is minimising Pe, with the disadvantage of the search process depending on the classifier type. More often, a class separability criterion such as the Bhattacharyya distance or the ANOVA F statistic is used. The Wilks’ lambda, defined as the ratio of the determinant of the pooled covariance over the determinant of the total covariance, is also a popular criterion. Physically, it can be interpreted as the ratio between the average class volume and the total volume of all cases. Its value will range from 0 (complete class separation) to 1 (complete class fusion).

As for the search method, the following are popular ones and available in STATISTICA and SPSS:

1. Sequential search (direct) The direct sequential search corresponds to performing successive feature

additions or eliminations to the target set, based on a separability criterion.

In a forward search, one starts with the feature of most merit and, at each step, all the features not yet included in the subset are revised; the one that contributes the most to class discrimination is evaluated through the merit criterion. This feature is then included in the subset and the procedure advances to the next search step. The process goes on until the merit criterion for any candidate feature is below a specified threshold.

6 Statistical Classification

In a backward search, the process starts with the whole feature set and, at each step, the feature that contributes the least to class discrimination is removed. The process goes on until the merit criterion for any candidate feature is above a specified threshold.

2. Sequential search (dynamic) The problem with the previous search methods is the possible existence of “nested”

feature subsets that are not detected by direct sequential search. This problem is tackled in a dynamic search by performing a combination of forward and backward searches at each level, known as “plus l-take away r” selection.

Direct sequential search methods can be applied using STATISTICA and SPSS, the latter affording a dynamic search procedure that is in fact a “plus 1-take away 1” selection. As merit criterion, STATISTICA uses the ANOVA F (for all selected features at a given step) with default value of one. SPSS allows the use of other merit criteria such as the squared Bhattacharyya distance (i.e., the squared Mahalanobis distance of the means).

It is also common to set a lower limit to the so-called tolerance level, T = 1 – r 2 , which must be satisfied by all features, where r is the multiple correlation factor of one candidate feature with all the others. Highly correlated features are therefore removed. One must be quite conservative, however, in the specification of the tolerance. A value at least as low as 1% is common practice.

Example 6.12

Q: Consider the first two classes of the Cork Stoppers’ dataset. Perform forward and backward searches on the available 10-feature set, using default values for the tolerance (0.01) and the ANOVA F (1.0). Evaluate the training set errors of both solutions.

A: Figure 6.21 shows the summary listing of a forward search for the first two classes of the cork-stopper data obtained with STATISTICA. Equal priors are assumed. Note that variable ART, with the highest F, entered in the model in Step 1 . “ ” The Wilk’s lambda, initially 1, decreased to 0.42 due to the contribution of ART. Next, in “ Step 2”, the variable with highest F contribution for the model containing ART, enters in the model, decreasing the Wilks’ lambda to 0.4. The process continues until there are no variables with F contribution higher than 1. In the listing an approximate F for the model, based on the Wilk’s lambda, is also indicated. Figure 6.21 shows that the selection process stopped with a highly significant ( p ≈ 0) Wilks’ lambda. The four-feature solution {ART, PRM, NG, RAAR} corresponds to the classification matrix shown before in Figure 6.14b.

Using a backward search, a solution with only two features (N and PRT) is obtained. It has the performance presented in Example 6.2. Notice that the backward search usually needs to start with a very low tolerance value (in the present case T = 0.002 is sufficient). The dimensionality ratio of this solution is

6.5 Feature Selection 255

comfortably high: n/d = 25. One can therefore be confident that this classifier performs in a nearly optimal way.

Example 6.13

Q: Redo the previous Example 6.12 for a three-class classifier, using dynamic search.

A: Figure 6.22 shows the listing produced by SPSS in a dynamic search performed on the cork-stopper data (three classes), using the squared Bhattacharyya distance (

D squared) of the two closest classes as a merit criterion. Furthermore, features were only entered or removed from the selected set if they contributed significantly to the ANOVA F. The solution corresponding to Figure 6.22 used a 5% level for the statistical significance of a candidate feature to enter the model, and a 10% level to remove it. Notice that PRT, which had entered at step 1, was later removed, at step 5. The nested solution {PRM, N, ARTG, RAAR} would not have been found by a direct forward search.

Figure 6.21. Feature selection listing, obtained with STATISTICA, using a forward search for two classes of the cork-stopper data.

6 Statistical Classification

Entered Removed Min. D Squared

Statistic Between

Exact F

Groups

Step Statistic df1 df2 Sig.

Figure 6.22. Feature selection listing, obtained with SPSS (Stepwise Method; Mahalanobis), using a dynamic search on the cork stopper data (three classes).