The ROC Curve

6.4 The ROC Curve

The classifiers presented in the previous sections assumed a certain model of the feature vector distributions in the feature space. Other model-free techniques to design classifiers do not make assumptions about the underlying data distributions. They are called non-parametric methods. One of these methods is based on the choice of appropriate feature thresholds by means of the ROC curve method (where ROC stands for Receiver Operating Characteristic).

The ROC curve method (available with SPSS; see Commands 6.2) appeared in the fifties as a means of selecting the best voltage threshold discriminating pure noise from signal plus noise, in signal detection applications such as radar. Since the seventies, the concept has been used in the areas of medicine and psychology, namely for test assessment purposes.

The ROC curve is an interesting analysis tool for two-class problems, especially in situations where one wants to detect rarely occurring events such as a special signal, a disease, etc., based on the choice of feature thresholds. Let us call the absence of the event the normal situation (N) and the occurrence of the rare event the abnormal situation (A). Figure 6.16 shows the classification matrix for this situation, based on a given decision rule, with true classes along the rows and 5 decided (predicted) classifications along the columns .

The reader may notice the similarity of the canonical two-class classification matrix with the hypothesis decision matrix in chapter 4 (Figure 4.2).

6.4 The ROC Curve 247

Decision A N

li ty

Rea

Figure 6.16. The canonical classification matrix for two-class discrimination of an abnormal event (A) from the normal event (N).

From the classification matrix of Figure 6.16, the following parameters are defined:

− True Positive Ratio ≡ TPR = a/(a+b). Also known as sensitivity, this parameter tells us how sensitive our decision method is in the detection of

the abnormal event. A classification method with high sensitivity will rarely miss the abnormal event when it occurs.

− True Negative Ratio ≡ TNR = d/(c+d). Also known as specificity, this parameter tells us how specific our decision method is in the detection of the abnormal event. A classification method with a high specificity will have a very low rate of false alarms, caused by classifying a normal event as abnormal.

− False Positive Ratio ≡ FPR = c/(c+d) = 1 − specificity. − False Negative Ratio ≡ FNR = b/(a+b) = 1 − sensitivity.

Both the sensitivity and specificity are usually given in percentages. A decision method is considered good if it simultaneously has a high sensitivity (rarely misses the abnormal event when it occurs) and a high specificity (has a low false alarm rate). The ROC curve depicts the sensitivity versus the FPR (complement of the specificity) for every possible decision threshold.

Example 6.9

Q: Consider the Programming dataset (see Appendix E). Determine whether a threshold-based decision rule using attribute AB, previous learning of Boolean “ Algebra , has a significant influence deciding the student passing (SCORE ” ≥ 10) or flunking (SCORE < 10) the Programming course, by visual inspection of the respective ROC curve.

A: Using the Programming dataset we first establish the following Table 6.8. Next, we set the following decision rule for the attribute (feature) AB:

Decide “Pass the Programming examination” if AB ≥ ∆.

6 Statistical Classification We then proceed to determine for every possible threshold value, ∆, the

sensitivity and specificity of the decision rule in the classification of the students. These computations are summarised in Table 6.9.

Note that when ∆ = 0 the decision rule assigns all students to the “Pass” group (all students have AB ≥ 0). For 0 < ∆ ≤ 1 the decision rule assigns to the “Pass” group 135 students that have indeed “passed” and 60 students that have “flunked” (these 195 students have AB ≥ 1). Likewise for other values of ∆ up to ∆ > 2 where the decision rule assigns all students to the flunk group since no students have ∆ > 2. Based on the classification matrices for each value of ∆ the sensitivities and specificities are computed as shown in Table 6.9.

The ROC curve can be directly drawn using these computations, or using SPSS as shown in Figure 6.17c. Figures 6.17a and 6.17b show how the data must be specified. From visual inspection, we see that the ROC curve is only moderately off the diagonal, corresponding to a non-informative decision rule (more details, later).

Table 6.8. Number of students passing and flunking the “Programming” examination for three categories of AB (see the Programming dataset).

Previous learning of AB = Boolean Algebra 1 = Pass

Table 6.9. Computation of the sensitivity (TPR) and 1 −specificity (FPR) for Example 6.9.

Pass/Flunk Decision Based on AB ≥∆

Pass / Flunk Total

Reality Cases

0 97 97 0 60 37 14 83 0 97 TPR 1 0.78 0.28 0 FPR 1 0.62 0.14 0

6.4 The ROC Curve 249

Figure 6.17. ROC curve for Example 6.9, solved with SPSS: a) Datasheet with column “n” used as weight variable; b) ROC curve specification window; c) ROC curve.

Figure 6.18. One hundred samples of a signal consisting of noise plus signal impulses (bold lines) occurring at random times.

Example 6.10

Q: Consider the Signal & Noise dataset (see Appendix E). This set presents 100 signal plus noise values s(n) (Signal+Noise variable), consisting of random noise plus signal impulses with random amplitude, occurring at random times according to the Poisson law. The Signal & Noise data is shown in Figure

6.18. Determine the ROC curve corresponding to the detection of signal impulses using several threshold values to separate signal from noise.

A: The signal plus noise amplitude shown in Figure 6.18 is often greater than the average noise amplitude, therefore revealing the presence of the signal impulses (e.g. at time instants 53 and 85). The discrimination between signal and noise is made setting an amplitude threshold, ∆, such that we decide “impulse” (our rare event) if s(n) > ∆, and “noise” (the normal event) otherwise. For each threshold value, it’s then possible to establish the signal vs. noise classification matrix and compute the sensitivity and specificity values. By varying the threshold (easily done in the Signal & Noise.xls file), the corresponding sensitivity and specificity values can be obtained, as shown in Table 6.10.

6 Statistical Classification

There is a compromise to be made between sensitivity and specificity. This compromise is made more patent in the ROC curve, which was obtained with SPSS, and corresponds to eight different threshold values, as shown in Figure 6.19a (using the Data worksheet of Signal & Noise.xls). Notice that given the limited number of threshold values, the ROC curve has a stepwise aspect, with different values of the FPR corresponding to the same sensitivity, as also appearing in Table 6.10 for the sensitivity value of 0.7. With a large number of signal samples and threshold values, one would obtain a smooth ROC curve, as represented in Figure 6.19b.

Looking at the ROC curves shown in Figure 6.19 the following characteristic aspects are clearly visible:

− The ROC curve graphically depicts the compromise between sensitivity and

specificity. If the sensitivity increases, the specificity decreases, and vice- versa.

− All ROC curves start at (0,0) and end at (1,1) (see Exercise 6.7). − A perfectly discriminating method corresponds to the point (0,1). The ROC

curve is then a horizontal line at a sensitivity =1.

A non-informative ROC curve corresponds to the diagonal line of Figures 6.19, with sensitivity = 1 – specificity. In this case, the true detection rate of the abnormal situation is the same as the false detection rate. The best compromise decision of sensitivity = specificity = 0.5 is then just as good as f lipping a coin.

Table 6.10. Sensitivity and specificity in impulse detection (100 signal values).

Threshold Sensitivity Specificity

One of the uses of the ROC curve is related to the issue of choosing the best decision threshold that can differentiate both situations; in the case of Example

6.10, the presence of the impulses from the presence of the noise alone. Let us address this discriminating issue as a cost decision issue as we have done in section

6.3.1. Representing the sensitivity and specificity of the method for a threshold ∆ by s( ∆) and f(∆) respectively, and using the same notation as in formula 6.20, we can write the total risk as:

R = λ aa P ( A ) s ( ∆ ) + λ an P ( A )( 1 − s ( ∆ )) + λ na P ( N ) f ( ∆ ) + λ nn P ( N )( 1 − f ( ∆ )) ,

or, R = s ( ∆ ) ( λ aa P ( A ) − λ an P ( A ) ) + f ( ∆ ) ( λ na P ( N ) − λ nn P ( N ) ) + constant .

6.4 The ROC Curve

In order to obtain the best threshold, we minimise the risk R by differentiating and equalling to zero, obtaining then:

The point of the ROC curve where the slope has the value given by formula

6.29 represents the optimum operating point or, in other words, corresponds to the best threshold for the two-class problem. Notice that this is a model-free technique of choosing a feature threshold for discriminating two classes, with no assumptions concerning the specific distributions of the cases.

Figure 6.19. ROC curve (bold line), obtained with SPSS, for the signal + noise data: (a) Eight threshold values (the values for ∆ = 2 and ∆ = 3 are indicated); b) A large number of threshold values (expected curve) with the 45º slope point.

Let us now assume that, in a given situation, we assign zero cost to correct decisions, and a cost that is inversely proportional to the prevalences to a wrong decision. Then, the slope of the optimum operating point is at 45º, as shown in Figure 6.19b. For the impulse detection example, the best threshold would be somewhere between 2 and 3.

Another application of the ROC curve is in the comparison of classification performance, namely for feature selection purposes. We have already seen in 6.3.1 how prevalences influence classification decisions. As illustrated in Figure 6.9, for

a two-class situation, the decision threshold is displaced towards the class with the smaller prevalence. Consider that the classifier is applied to a population where the prevalence of the abnormal situation is low. Then, for the previously mentioned reason, the decision maker should operate in the lower left part of the ROC curve in order to keep FPR as small as possible. Otherwise, given the high prevalence of the normal situation, a high rate of false alarms would be obtained. Conversely, if the classifier is applied to a population with a high prevalence of the abnormal

6 Statistical Classification

situation, the decision-maker should adjust the decision threshold to operate on the FPR high part of the curve.

Briefly, in order for our classification method to perform optimally for a large range of prevalence situations, we would like to have an ROC curve very near the perfect curve, i.e., with an underlying area of 1. It seems, therefore, reasonable to select from among the candidate classification methods (or features) the one that has an ROC curve with the highest underlying area.

The area under the ROC curve is computed by the SPSS with a 95% confidence interval.

Despite some shortcomings, the ROC curve area method is a popular method of assessing classifier or feature performance. This and an alternative method based on information theory are described in Metz et al. (1973).

Commands 6.2. SPSS command used to perform ROC curve analysis.

SPSS Graphs; ROC Curve

Example 6.11

Q: Consider the FHR-Apgar dataset, containing several parameters computed from foetal heart rate (FHR) tracings obtained previous to birth, as well as the so- called Apgar index. This is a ranking index, measured on a one-to-ten scale, and evaluated by obstetricians taking into account clinical observations of a newborn baby. Consider the two FHR features, ALTV and ASTV, representing the percentages of abnormal long term and abnormal short-term heart rate variability, respectively. Use the ROC curve in order to elucidate which of these parameters is better in the clinical practice for discriminating an Apgar > 6 (normal situation) from an Apgar ≤ 6 (abnormal or suspect situation).

Figure 6.20. ROC curves for the FHR Apgar dataset, obtained with SPSS, corresponding to features ALTV and ASTV.

6.5 Feature Selection 253

A: The ROC curves for ALTV and ASTV are shown in Figure 6.20. The areas under the ROC curve, computed by SPSS with a 95% confidence interval, are 0.709 ± 0.11 and 0.781 ± 0.10 for ALTV and ASTV, respectively. We, therefore, select the ASTV parameter as the best diagnostic feature.