Bayes Rule for Minimum Risk

6.3.1 Bayes Rule for Minimum Risk

Let us again consider the cork stopper problem and imagine that factory production was restricted to the two classes we have been considering, denoted as: ω 1 = Super and ω 2 = Average. Let us assume further that the factory had a record of production

stocks for a reasonably long period, summarised as:

Number of produced cork stoppers of class ω 1 :

n 1 = 901 420

n 2 = 1 352 130 Total number of produced cork stoppers:

Number of produced cork stoppers of class ω 2 :

n = 2 253 550

With this information, we can readily obtain good estimates of the probabilities of producing a cork stopper from either of the two classes, the so-called prior probabilities or prevalences:

P( ω 1 )= n 1 / n = 0.4; P( ω 2 )= n 2 / n = 0.6.

6.3 Bayesian Classification

Note that the prevalences are not entirely controlled by the factory, and that they depend mainly on the quality of the raw material. Just as, likewise, a cardiologist cannot control how prevalent myocardial infarction is in a given population. Prevalences can, therefore, be regarded as “states of nature”.

Suppose we are asked to make a blind decision as to which class a cork stopper belongs without looking at it. If the only available information is the prevalences,

the sensible choice is class ω 2 . This way, we expect to be wrong only 40% of the

times. Assume now that we were allowed to measure the feature vector x of the presented cork stopper. Let P ( ω i | x ) be the conditional probability of the cork stopper represented by x belonging to class ω i . If we are able to determine the

probabilities P ( ω 1 | x ) and P ( ω 2 | x ) , the sensible decision is now:

If P ( ω 1 | x ) > P ( ω 2 | x ) we decide x ∈ ω 1 ; If P ( ω 1 | x ) < P ( ω 2 | x ) we decide x ∈ ω 2 ; 6.15

If P ( ω 1 | x ) = P ( ω 2 | x ) the decision is arbitrary.

We can condense 6.15 as:

If P ( ω 1 | x ) > P ( ω 2 | x ) then x ∈ ω 1 else x ∈ ω 2 . 6.15a

The posterior probabilities P ( ω i | x ) can be computed if we know the pdfs of the distributions of the feature vectors in both classes, p ( x | ω i ) , the so-called likelihood of x. As a matter of fact, the Bayes law (see Appendix A) states that:

, 6.16 p ( x )

with c p ( x ) = ∑

i = 1 p ( x | ω i ) P ( ω i ) , the total probability of x. Note that P( ω i ) and P( ω i | x) are discrete probabilities (symbolised by a capital letter), whereas p(x | ω i ) and p(x) are values of pdf functions. Note also that the term p(x) is a common term in the comparison expressed by 6.15a, therefore, we

may rewrite for two classes:

If px ( | ωP 1 ) ( ω 1 ) > px ( | ωP 2 ) ( ω 2 ) then x ∈ ω 1 else x ∈ ω 2 , 6.17

Example 6.5

Q: Consider the classification of cork stoppers based on the number of defects, N, and restricted to the first two classes, “Super” and “Average”. Estimate the posterior probabilities and classification of a cork stopper with 65 defects, using prevalences 6.14.

A: The feature vector is x = [N], and we seek the classification of x = [65]. Figure

6.8 shows the histograms of both classes with a superimposed normal curve.

6 Statistical Classification

Figure 6.8. Histograms of feature N for two classes of cork stoppers, obtained with STATISTICA. The threshold value N = 65 is marked with a vertical line.

From this graphic display, we can estimate the likelihoods and the posterior probabilities:

p ( x | ω 1 ) = 20 / 24 = 0 . 833 ⇒ P ( ω 1 ) p ( x | ω 1 ) = 0 . 4 × 0 . 833 = 0 . 333 ; 6.18a p ( x | ω 2 ) = 16 / 23 = 0 . 696 ⇒ P ( ω 2 ) p ( x | ω 2 ) = 0 . 6 × 0 . 696 = 0 . 418 . 6.18b

We then decide class ω 2 , although the likelihood of ω 1 is bigger than that of ω 2 . Notice how the statistical model prevalences changed the conclusions derived by the minimum distance classification (see Example 6.3).

Figure 6.9 illustrates the effect of adjusting the prevalence threshold assuming equal and normal pdfs:

• Equal prevalences. With equal pdfs, the decision threshold is at half

distance from the means. The number of cases incorrectly classified, proportional to the shaded areas, is equal for both classes. This situation is identical to the minimum distance classifier.

• Prevalence of ω 1 bigger than that of ω 2 . The decision threshold is displaced towards the class with smaller prevalence, therefore decreasing the number of wrongly classified cases of the class with higher prevalence, as seems convenient.

The normal curve fitted by STATISTICA is multiplied by the factor “number of cases” × “ histogram interval width”, which is 1000 in the present case. This constant factor is of no

importance and is neglected in the computations of 6.18.

6.3 Bayesian Classification 237

Figure 6.9. Influence of the prevalence threshold on the classification errors, represented by the shaded areas (dark grey represents the errors for class ω 1 ). (a) Equal prevalences; (b) Unequal prevalences.

Figure 6.10. Classification results, obtained with STATISTICA, of the cork stoppers with unequal prevalences: 0.4 for class ω 1 and 0.6 for class ω 2 .

Example 6.6

Q: Compute the classification matrix for all the cork stoppers of Example 6.5 and comment the results.

A: Figure 6.10 shows the classification matrix obtained with the prevalences computed in 6.14, which are indicated in the Group row. We see that indeed the decision threshold deviation led to a better performance for class ω 2 than for class ω 1 . This seems reasonable since class ω 2 now occurs more often. Since the overall error has increased, one may wonder if this influence of the prevalences was beneficial after all. The answer to this question is related to the topic of classification risks, presented below.

Let us assume that the cost of a ω 1 (“super”) cork stopper is 0.025 € and the cost of a ω 2 (“average”) cork stopper is 0.015 €. Suppose that the ω 1 cork stoppers are to be used in special bottles whereas the ω 2 cork stoppers are to be used in normal bottles.

Let us further consider that the wrong classification of an average cork stopper leads to its rejection with a loss of 0.015 € and the wrong classification of a super quality cork stopper amounts to a loss of 0.025 − 0.015 = 0.01 € (see Figure 6.11).

6 Statistical Classification

1 Special Bottles 0.015 €

ω 2 Normal Bottles

Figure 6.11. Loss diagram for two classes of cork stoppers. Correct decisions have zero loss.

Denote:

SB – Action of using a cork stopper in special bottles. NB – Action of using a cork stopper in normal bottles.

ω 1 =S (class super); ω 2 =A (class average)

Define: λ= ij λ ( α i | ω j ) as the loss associated with an action α when the i

correct class is ω j . In the present case, α i ∈ { SB , NB } .

We can arrange the λ ij in a loss matrix Λ, which in the present case is:

Therefore, the risk (expected value of the loss) associated with the action of using a cork, characterised by feature vector x, in special bottles, can be expressed as:

R ( SB | x ) = λ ( SB | S ) P ( S | x ) + λ ( NB | M ) P ( A | x ) = 0 . 015 × P ( A | x ) ; 6.20a

And likewise for normal bottles:

R ( NB | x ) = λ ( NB | S ) P ( S | x ) + λ ( NB | A ) P ( A | x ) = 0 . 01 × P ( S | x ) ; 6.20b

We are assuming that in the risk evaluation, the only influence is from wrong decisions. Therefore, correct decisions have zero loss, λ ii = 0, as in 6.19. If instead of two classes, we have c classes, the risk associated with a certain action α i is expressed as follows:

c R ( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x ) . 6.21

We are obviously interested in minimising an average risk computed for an arbitrarily large number of cork stoppers. The Bayes rule for minimum risk achieves this through the minimisation of the individual conditional risks R( α i | x).

6.3 Bayesian Classification

Let us assume, first, that wrong decisions imply the same loss, which can be scaled to a unitary loss:

In this situation, since all posterior probabilities add up to one, we have to minimise:

R ( α i | x ) = ∑ P ( ω j | x ) = 1 − P ( ω i | x ) . 6.22b

This corresponds to maximising P( ω i | x), i.e., the Bayes decision rule for minimum risk corresponds to the generalised version of 6.15a:

Decide ω i if P ( ω i | x ) > P ( ω j | x ), ∀ j ≠ i . 6.22c

Thus, the decision function for class ω i is the posterior probability,

g i (x ) = P ( ω i | x ) , and the classification rule amounts to selecting the class with maximum posterior probability. Let us now consider the situation of different losses for wrong decisions, assuming, for the sake of simplicity, that c = 2. Taking into account expressions

6.20a and 6.20b, it is readily concluded that we will decide ω 1 if:

λ 21 P ( ω 1 | x ) > λ 12 P ( ω 2 | x ) ⇒ p ( x | ω 1 ) λ 21 P ( ω 1 ) > p ( x | ω 2 ) λ 12 P ( ω 2 ) . 6.23

This is equivalent to formula 6.17 using the following adjusted prevalences:

STATISTICA and SPSS allow specifying the priors as estimates of the sample composition (as in 6.14) or by user assignment of specific values. In the latter the user can adjust the priors in order to cope with specific classification risks.

Example 6.7

Q: Redo Example 6.6 using adjusted prevalences that take into account 6.14 and the loss matrix 6.19. Compare the classification risks with and without prevalence adjustment.

A: The losses are λ 12 = 0.015 and λ 21 = 0.01. Using the prevalences 6.14, one obtains P * ( ω 1 ) = 0.308 and P * ( ω 2 ) = 0.692. The higher loss associated with a wrong classification of a ω 2 cork stopper leads to an increase of P * ( ω 2 ) compared

with P * ( ω 1 ). The consequence of this adjustment is the decrease of the number of

6 Statistical Classification

ω 2 cork stoppers wrongly classified as ω 1 . This is shown in the classification matrix

of Table 6.6. We can now compute the average risk for this two-class situation, as follows:

R = λ 12 Pe 12 + λ 21 Pe 21 ,

where Pe ij is the error probability of deciding class ω i when the true class is ω . j

Using the training set estimates of these errors, Pe 12 = 0.1 and Pe 21 = 0.46 (see

Table 6.6), the estimated average risk per cork stopper is computed as

R = 0.015 ×Pe 12 + 0.01 ×Pe 21 = 0.015 ×

0.01 + 0.01 × 0.46 = 0.0061 €. If we had not

used the adjusted prevalences, we would have obtained the higher risk estimate of 0.0063 € (use the Pe ij estimates from Figure 6.10).

Table 6.6. Classification matrix obtained with STATISTICA of two classes of

cork stoppers with adjusted prevalences (Class 1 ≡ ω 1 ; Class 2 ≡ ω 2 ). The column

values are the predicted classifications. Percent Correct