Tree Classifiers
6.7 Tree Classifiers
In multi-group classification, one is often confronted with the problem that reasonable performances can only be achieved using a large number of features. This requires a very large design set for proper training, probably much larger than what we have available. Also, the feature subset that is the most discriminating set for some classes can perform rather poorly for other classes. In an attempt to overcome these difficulties, a “divide and conquer” principle using multistage classification can be employed. This is the approach of decision tree classifiers, also known as hierarchical classifiers, in which an unknown case is classified into
a class using decision functions in successive stages.
6 Statistical Classification
At each stage of the tree classifier, a simpler problem with a smaller number of features is solved. This is an additional benefit, namely in practical multi-class problems where it is rather difficult to guarantee normal or even symmetric distributions with similar covariance matrices for all classes, but it may be possible, with the multistage approach, that those conditions are approximately met at each stage, affording then optimal classifiers.
Example 6.16
Q: Consider the Breast Tissue dataset (electric impedance measurements of freshly excised breast tissue) with 6 classes denoted CAR (carcinoma), FAD (fibro-adenoma), GLA (glandular), MAS (mastopathy), CON (connective) and ADI (adipose). Derive a decision tree solution for this classification problem.
A: Performing a Kruskal-Wallis analysis, it is readily seen that all the features have discriminative capabilities, namely I0 and PA500, and that it is practically impossible to discriminate between classes GLA, FAD and MAS. The low dimensionality ratio of this dataset for the individual classes (e.g. only 14 cases for class CON) strongly recommends a decision tree approach, with the use of merged classes and a greatly reduced number of features at each node.
As I0 and PA500 are promising features, it is worthwhile to look at the respective scatter diagram shown in Figure 6.23. Two case clusters are visually identified: one corresponding to {CON, ADI}, the other to {MAS, GLA, FAD, CAR}. At the first stage of the tree we then use I0 alone, with a threshold of I0 = 600, achieving zero errors.
At stage two, we attempt the most useful discrimination from the medical point of view: class CAR (carcinoma) vs. {FAD, MAS, GLA}. Using discriminant analysis, this can be performed with an overall training set error of about 8%, using features AREA_DA and IPMAX, whose distributions are well modelled by the normal distribution.
0.40 CLASS: car
0.35 CLASS: fad CLASS: mas
0.30 CLASS: gla 0.25 CLASS: con CLASS: adi
I0 Figure 6.23. Scatter plot of six classes of breast tissue with features I0 and PA500.
6.7 Tree Classifiers 261
Figure 6.24 shows the corresponding linear discriminant. Performing two randomised runs using the partition method in halves (i.e., the 2-fold cross- validation with half of the samples for design and the other half for testing), an average test set error of 8.6% was obtained, quite near the design set error. At stage two, the discrimination CON vs. ADI can also be performed with feature I0 (threshold I0 =1550), with zero errors for ADI and 14% errors for CON.
With these results, we can establish the decision tree shown in Figure 6.25. At each level of the decision tree, a decision function is used, shown in Figure 6.25 as
a decision rule to be satisfied. The left descendent tree branch corresponds to compliance with a rule, i.e., to a “Yes” answer; the right descendent tree branch corresponds to a “No” answer.
Since a small number of features is used at each level, one for the first level and two for the second level, respectively, we maintain a reasonably high dimensionality ratio at both levels; therefore, we obtain reliable estimates of the errors with narrow 95% confidence intervals (less than 2% for the first level and about 3% for the CAR vs. {FAD, MAS, GLA} level).
AREA_DA
Figure 6.24. Scatter plot of breast tissue classes CAR and {MAS, GLA, FAD} (denoted not car) using features AREA_DA and IPMAX, showing the linear discriminant separating the two classes.
For comparison purposes, the same four-class discrimination was carried out with only one linear classifier using the same three features I0, AREA_DA and IPMAX as in the hierarchical approach. Figure 6.26 shows the classification matrix. Given that the distributions are roughly symmetric, although with some deviations in the covariance matrices, the optimal error achieved with linear discriminants should be close to what is shown in the classification matrix. The degraded performance compared with the decision tree approach is evident.
On the other hand, if our only interest is to discriminate class car from all other ones, a linear classifier with only one feature can achieve this discrimination with a
6 Statistical Classification
performance of about 86% (see Exercise 6.5). This is a comparable result to the one obtained with the tree classifier.
Figure 6.25. Hierarchical tree classifier for the breast tissue data with percentages of correct classifications and decision functions used at each node. Left branch = “Yes”; right branch = “No”.
Figure 6.26. Classification matrix obtained with STATISTICA, of four classes of breast tissue using three features and linear discriminants. Class fad+ is actually the class set {FAD, MAS, GLA}.
The decision tree used for the Breast Tissue dataset is an example of a binary tree: at each node, a dichotomic decision is made. Binary trees are the most popular type of trees, namely when a single feature is used at each node, resulting in linear discriminants that are parallel to the feature axes, and easily interpreted by human experts. Binary trees also allow categorical features to be easily incorporated with node splits based on a “yes/no” answer to the question whether
6.7 Tree Classifiers
or not a given case belongs to a set of categories. For instance, this type of trees is frequently used in medical applications, and often built as a result of statistical studies of the influence of individual health factors in a given population.
The design of decision trees can be automated in many ways, depending on the split criterion used at each node, and the type of search used for best group discrimination. A split criterion has the form:
d(x) ≥ ∆,
where d(x) is a decision function of the feature vector x and ∆ is a threshold. Usually, linear decision functions are used. In many applications, the split criteria are expressed in terms of the individual features alone (the so-called univariate splits).
An important concept regarding split criteria is the concept of node impurity. The node impurity is a function of the fraction of cases belonging to a specific class at that node.
Consider the two-class situation shown in Figure 6.27. Initially, we have a node with equal proportions of cases belonging to the two classes (white and black circles). We say that its impurity is maximal. The right split results in nodes with zero impurity, since they contain cases from only one of the classes. The left split, on the contrary, increases the proportion of cases from one of the classes, therefore decreasing the impurity, although some impurity remains present.
t 11 t 12 t 21 t 22
Figure 6.27. Splitting a node with maximum impurity. The left split (x 1 ≥ ∆) decreases the impurity, which is still non-zero; the right split (w 1 x 1 + w 2 x 2 ≥ ∆)
achieves pure nodes.
A popular measure of impurity, expressed in the [0, 1] interval, is the Gini index of diversity:
i () t = ∑ P ()() j | t P k | t .
For the situation shown in Figure 6.27, we have:
6 Statistical Classification
In the automatic generation of binary trees the tree starts at the root node, which corresponds to the whole training set. Then, it progresses by searching for each variable the threshold level achieving the maximum decrease of the impurity at each node. The generation of splits stops when no significant decrease of the impurity is achieved. It is common practice to use the individual feature values of the training set cases as candidate threshold values. Sometimes, after generating a tree automatically, some sort of tree pruning should be performed in order to remove branches of no interest.
SPSS and STATISTICA have specific commands for designing tree classifiers, based on univariate splits. The method of exhaustive search for the best univariate splits is usually called the CRT (also CART or C&RT) method, pioneered by Breiman, Friedman, Olshen and Stone (see Breiman et al., 1993).
Example 6.17
Q: Use the CRT approach with univariate splits and the Gini index as splitting criterion in order to derive a decision tree for the Breast Tissue dataset. Assume equal priors of the classes.
A: Applying the commands for CRT univariate split with the Gini index, described in Commands 6.3, the tree presented in Figure 6.28 was found with SPSS (same solution with STATISTICA). The tree shows the split thresholds at each node as well as the improvement achieved in the Gini index. For instance, the first split variable PERIM was selected with a threshold level of 1563.84.
Table 6.13. Training set classification matrix, obtained with SPSS, corresponding to the tree shown in Figure 6.28.
con adi Percent Correct
car
20 0 1 0 0 0 95.2% fad
0 0 12 3 0 0 0.0% mas
2 0 15 1 0 0 83.3% gla
1 0 4 11 0 0 68.8% con
0 0 0 0 14 0 100.0% adi
6.7 Tree Classifiers 265
The classification matrix corresponding to this classification tree is shown in Table 6.13. The overall percent correct is 76.4% (overall error of 23.6%). Note the good classification results for the classes CAR, CON and ADI and the difficult splitting of {FAD,MAS,GLA} that we had already observed. Also note the gradual error increase as one progresses through the tree. Node splitting stops when no significant improvement is found.
Figure 6.28. CRT tree using the Gini index as impurity criterion, designed with SPSS.
The CRT algorithm based on exhaustive search tends to be biased towards selecting variables that afford more splits. It is also quite time consuming. Other
6 Statistical Classification
approaches have been proposed in order to remedy these shortcomings, namely the approach followed by the algorithm known as QUEST (“Quick, Unbiased, Efficient Statistical Trees”), proposed by Loh, WY and Shih, YS (1997), that employs a sort of recursive quadratic discriminant analysis for improving the reliability and efficiency of the classification trees that it computes.
It is often interesting to compare the CRT and QUEST solutions, since they tend to exhibit complementary characteristics. CRT, besides its shortcomings, is guaranteed to find the splits producing the best classification (in the training set, but not necessarily in test sets) because it employs an exhaustive search. QUEST is fast and unbiased. The speed advantage of QUEST over CRT is particularly dramatic when the predictor variables have dozens of levels (Loh, WY and Shih, YS, 1997). QUEST’s lack of bias in variable selection for splits is also an advantage when some independent variables have few levels and other variables have many levels.
Example 6.18
Q: Redo Example 6.17 using the QUEST approach. Assume equal priors of the classes.
A: Applying the commands for the QUEST algorithm, described in Commands
6.3, the tree presented in Figure 6.29 was found with STATISTICA (same solution with SPSS).
Figure 6.29. Tree plot, obtained with STATISTICA for the breast-tissue, using the QUEST approach.
6.7 Tree Classifiers 267
The classification matrix corresponding to this classification tree is shown in Table 6.14. The overall percent correct is 63.2% (overall error of 36.8%). Note the good classification results for the classes CON and ADI and the splitting off of {FAD,MAS,GLA} as a whole. This solution is similar to the solution we had derived “manually” and represented in Figure 6.25.
Table 6.14. Training set classification matrix corresponding to the tree shown in Figure 6.29.
Observed
Predicted
Percent car
Correct car
The tree solutions should be validated as with any other classifier type. SPSS and STATISTICA afford the possibility of cross-validating the designed trees using the partition method described in section 6.6. In the present case, since the dimensionality ratios are small, one has to perform the cross-validation with very small test samples. Using a 14-fold cross-validation for the CRT and QUEST solutions of Examples 6.17 and 6.18 we obtained the results shown in Table 6.13. We see that although CRT yielded a lower training set error compared with QUEST, this last method provided a solution with better generalization capability (smaller difference between training set and test set errors). Note that 14-fold cross-validation is equivalent to the leave-one-out method for the smaller sized class of this dataset.
Table 6.15. Overall errors and respective standard deviations (obtained with STATISTICA) in 14-fold cross-validation of the tree solutions found in Examples
6.17 and 6.18. Method Overall Error Stand. Deviation CRT 0.406 0.043 QUEST 0.349 0.040
6 Statistical Classification
Commands 6.3. SPSS and STATISTICA commands used to design tree classifiers.
SPSS Analyze; Classify; Tree... STATISTICA Statistics; Multivariate Exploratory Techniques; Classification Trees
When performing tree classification with SPSS it is advisable to first assign appropriate labels to the categorical variable. This can be done in a “Define Variable Properties...” window. The Tree window allows one to specify the dependent (categorical) and independent variables and the type of Output one wishes to obtain (usually, Chart − a display as in Figure 6.28 − and Classification Table from Statistics). One then proceeds to choosing a growing method (CRT, QUEST), the maximum number of cases per node at input and output (in Criteria), the priors (in Options) and the cross- validation method (in Validation).
In STATISTICA the independent variables are called “predictors”. Real-valued variables as the ones used in the previous examples are called “ordered predictors”. One must not forget to set the codes for the dependent variable. The CRT and QUEST methods appear in the Methods window denominated as “CR&T-style exhaustive search for univariate splits” and “Discriminant-based univariate splits for categ. and ordered predictors”, respectively.
The classification matrices in STATISTICA have a different configuration of the ones shown in Tables 6.13 and 6.14: the observations are along the columns and the predictions along the rows. Cross-validation in STATISTICA provides the average misclassification matrix which can be useful to individually analyse class behaviour.
Exercises
6.1 Consider the first two classes of the Cork Stoppers’ dataset described by features ART and PRT. a) Determine the Euclidian and Mahalanobis classifiers using feature ART alone,
then using both ART and PRT. b) Compute the Bayes error using a pooled covariance estimate as the true covariance for both classes.
c) Determine whether the Mahalanobis classifiers are expected to be near the optimal Bayesian classifier. d) Using SC Size, determine the average deviation of the training set error estimate from the Bayes error, and the 95% confidence interval of the error estimate.
Exercises 269
6.2 Repeat the previous exercise for the three classes of the Cork Stoppers’ dataset, using features N, PRM and ARTG.
6.3 Consider the problem of classifying cardiotocograms ( CTG dataset) into three classes: N (normal), S (suspect) and P (pathological). a) Determine which features are most discriminative and appropriate for a
Mahalanobis classifier approach for this problem. b) Design the classifier and estimate its performance using a partition method for the test set error estimation.
6.4 Repeat the previous exercise using the Rocks’ dataset and two classes: {granites} vs. {limestones, marbles}.
6.5 A physician would like to have a very simple rule available for screening out carcinoma situations from all other situations using the same diagnostic means and measurements as in the Breast Tissue dataset. a) Using the Breast Tissue dataset, find a linear Bayesian classifier with only
one feature for the discrimination of carcinoma versus all other cases (relax the normality and equal variance requirements). Use forward and backward search and estimate the priors from the training set sizes of the classes. b) Obtain training set and test set error estimates of this classifier, and 95% confidence intervals.
c) Using the SC Size program, assess the deviation of the error estimate from the
true Bayesian error, assuming that the normality and equal variance requirements were satisfied. d) Suppose that the risk of missing a carcinoma is three times higher than the risk of misclassifying a non-carcinoma. How should the classifying rule be reformulated in order to reflect these risks, and what is the performance of the new rule?
6.6 Design a linear discriminant classifier for the three classes of the Clays’ dataset and evaluate its performance.
6.7 Explain why all ROC curves start at (0,0) and finish at (1,1) by analysing what kind of situations these points correspond to.
6.8 Consider the Breast Tissue dataset. Use the ROC curve approach to determine single features that will discriminate carcinoma cases from all other cases. Compare the alternative methods using the ROC curve areas.
6.9 Repeat the ROC curve experiments illustrated in Figure 6.20 for the FHR Apgar dataset, using combinations of features.
6.10 Increase the amplitude of the signal impulses by 20% in the Signal & Noise dataset. Consider the following impulse detection rule:
An impulse is detected at time n when s(n) is bigger than
Determine the ROC curve corresponding to several α values, and determine the best α for the impulse/noise discrimination. How does this method compare with the amplitude threshold method described in section 6.4?
6 Statistical Classification
6.11 Consider the Infarct dataset, containing four continuous-type measurements of physiological variables of the heart (EF, CK, IAD, GRD), and one ordinal-type variable (SCR: 0 through 5) assessing the severity of left ventricle necrosis. Use ROC curves of the four continuous-type measurements in order to determine the best threshold discriminating “low” necrosis (SCR < 2) from “medium-high” necrosis (SCR 2), as ≥ well as the best discriminating measurement.
6.12 Repeat Exercises 6.3 and 6.4 performing sequential feature selection (direct and dynamic).
6.13 Perform a resubstitution and leave-one-out estimation of the classification errors for the three classes of cork stoppers, using the features obtained by dynamic selection (Example 6.13). Comment on the reliability of these estimates.
6.14 Compute the 95% confidence interval of the error for the classifier designed in Exercise 6.3 using the standard formula. Perform a partition method evaluation of the classifier, with 10 partitions, obtaining another estimate of the 95% confidence interval of the error.
6.15 Compute the decrease of impurity in the trees shown in Figure 6.25 and Figure 6.29, using the Gini index.
6.16 Compute the classification matrix CAR vs. {MAS, GLA, FAD} for the Breast Tissue dataset in the tree shown in Figure 6.25. Observe its dependence on the prevalences. Compute the linear discriminant shown in the same figure.
6.17 Using the CRT and QUEST approaches, find decision trees that discriminate the three classes of the CTG dataset, N, S and P, using several initial feature sets that contain the four variability indexes ASTV, ALTV, MSTV, MLTV. Compare the classification performances for the several initial feature sets.
6.18 Consider the four variability indexes of foetal heart rate (MLTV, MSTV, ALTV, ASTV) included in the CTG dataset. Using the CRT approach, find a decision tree that discriminates the pathological foetal state responsible for a “flat-sinusoidal” (FS) tracing from all the other classes.