CLASSIFICATION TECHNIQUES

9.14 CLASSIFICATION TECHNIQUES

A major task after feature extraction is to classify the object into one of several categories. Figure 9.2 lists various classification techniques applicable in image analysis. Although an in-depth discussion of classification techniques can be found in the pattern-recognition literature-see, for example, [1]-we will briefly review these here to establish their relevance in image analysis.

It should be mentioned that classification and segmentation processes have closely related objectives. Classification can lead to segmentation, and vice-versa. Classification of pixels in an image is another form of component labeling that can result in segmentation of various objects in the image. For example, in remote

sensing, classification of multispectral data at each pixel location results in segmen­ tation of various regions of wheat, barley, rice, and the like. Similarly, image segmentation by template matching, as in character recognition, leads to classifica­ tion or identification of each object.

There are two basic approaches to classification, supervised and nonsuper­ vised, depending on whether or not a set of prototypes is available.

Supervised Learning Supervised learning, also called supervised classification, can be distribution free

or statistical. Distribution-free methods do not require knowledge of any a priori probability distribution functions and are based on reasoning and heuristics. Statis­ tical techniques are based on probability distribution models, which may be parametric (such as Gaussian distributions) or nonparametric.

Distribution-free classification. Suppose there are K different objects or pattern classes Si , S2, ..• , Sb . . . , SK. Each class is characterized by Mk prototypes, which have N x 1 feature vectors y<:/, m = 1, . . . , Mk· Let x denote an N x 1 feature vector obtained from the observed image. A fundamental function in pattern recog­ nition is called the discriminant function. It is defined such that the kth discriminant function gk (x ) takes the maximum value if x belongs to class k, that is, the decision

rule is

(9.138) For a K class problem, we need K - 1 discriminant functions. These functions

gk (x) > g; ( x )

divide the N -dimensional feature space into K different regions with a maximum of K(K - 1)/2 hypersurfaces. The partitions become hyperplanes if the discriminant

function is linear, that is, if it has the form (9. 139)

414 Image Analysis and Computer Vision Chap. 9

Such a function arises, for example, when x is classified to the class whose centroid is nearest in Euclidean distance to it (Problem 9. 17). The associated classifier is called the minimum mean (Euclidean) distance classifier.

An alternative decision rule is to classify x to S; if among a total of k nearest prototype neighbors of x, the maximum number of neighbors belong to class S;. This is the k-nearest neighbor classifier, which for k =

1 becomes a minimum-distance classifier.

When the discriminant function can classify the prototypes correctly for some linear discriminants, the classes are said to be linearly separable. In that case, the weights ak and bk can be determined via a successive linear training algorithm. Other

discriminants can be piecewise linear, quadratic, or polynomial functions. The k-nearest neighbor classification can be shown to be equivalent to using piecewise linear discriminants.

Decision tree classification [60-61]. Another distribution-free classifier, called a decision tree classifier, splits the N -dimensional feature space into unique regions by a sequential method. The algorithm is such that every class need not be tested to arrive at a decision. This becomes advantageous when the number of classes is very large. Moreover, unlike many other training algorithms, this algo­ rithm is guaranteed to converge whether or not the feature space is linearly sepa­ rable.

Let µk (i) and (J'k (i) denote the mean and standard deviation, respectively, measured from repeated independent observations of the kth prototype vector element y �> (i), m = 1, . . . , Mk. Define the normalized average prototype features

Z= : r Z1(2) Z2(2) . . . Zk(2) . . : : l . (9. 140)

zk (i) � µk (i)!ak (i) and an N xK matrix

z1(l) z2(l) . . . zk(l)

z1(N) z2(N) . . . zk(N)

The row number of Z is the feature number and the column number is the object or class number. Further, let Z' � [Z] denote the matrix obtained by arranging the

elements of each row of Z in increasing order with the smallest element on the left and the largest on the right. Now, the algorithm is as follows.

Decision Tree Algorithm Step 1 Convert Z to Z'. Find the maximum distance between adjacent row elements in each row of Z' . Find r, the row number with the largest maximum

distance. The row r represents a feature. Set a threshold at the midpoint of the maximum distance boundaries and split row r into two parts.

Step 2 Convert Z' to Z such that the row r is the same in both the matrices. The elements of the other rows of Z' are rearranged such that each column of Z represents a prototype vector. This means, simply, that the elements of each row of Z are in the same order as the elements of row r. Split Z into two matrices Z1 and Zz by splitting each row in a manner similar to row r.

Step 3 Repeat Steps 1 and 2 for the split matrices that have more than one column. Terminate the process when all the split matrices have only one column.

Sec. 9. 1 4 Classification Techniques 415

The preceding process produces a series of thresholds that induce questions of the form, Is feature j > threshold? The questions and the two possible decisions for each question generate a series of nodes and branches of a decision tree. The terminal branches of the tree give the classification decision.

Example 9.11 The accompanying table contains the normalized average areas and perimeter lengths

of five different object classes for which a vision system is to be trained.

2 3 4 5 z (l) =

area

This gives

'T]1 = 16

2 35 42 48 56 => -[

z3 The largest adjacent difference in the first row is 8; in the second row it is 7. Hence the first row is chosen, and z (l) is the feature to be thresholded. This splits Z1 into Z2 and

Z2

Z3, as shown. Proceeding similarly with these matrices, we get

2= [ 28 :::> I Zz = 56 -[ 28 56

'T]2 = 42 I J

TJ4 = 23.5

20 24 27 = - Zn = 24 20 I 21 z3 , [ 35 I 42 48 => 35 42 48

'T]J = 38.5 4 3 5 z4

The thresholds partition the feature space and induce the decision tree, as shown in Fig. 9.58.

Statistical classification. In statistical classification techniques it is assumed the different object classes and the feature vector have an underlying joint probability density. Let P(Sk) be the a priori probability of occurrence of class Sk

and p (x) be the probability density function of the random feature vector observed as x.

Bayes' minimum-risk classifier. The Bayes' minimum-risk classifier mini­ mizes the average loss or risk in assigning x to a wrong class. Define

416 Image Analysis and Computer Vision Chap. 9

Measure z(l ). z(2)

9.58 Decision tree classifier.

k = J 1 Rk

Risk, 9t � L

c (xlSk)P (x) dx

c(xlSk) � L C;,kp (S; lx), i = l

where C;, k is the cost of assigning x to Sk when x E S; in fact and Rk represents the region of the feature space where p (xlSk) > p (xlS;), for every i -:/= k. The quantity c(xlSk) represents the total cost of assigning x to Sk. It is well known the decision

rule that minimizes 9C is given by

'Vj -:/= k i= 1 L c;,kP(S;)p (xlS;) < L C;,iP(S;)p (xlS;), i= 1 �x E Sk (9.142) If C;, k = 1, i -:/= k, and C;, k = 0, i = k, then the decision rule simplifies to

'Vj -:/= k � x E Sk (9.143) In this case the probability of error in classification is also minimized and the

minimum error classifier discriminant becomes (9. 144) In practice the p (xlSk) are estimated from the prototype data by either parametric

or nonparametric techniques which can yield simplified expressions for the discrimi­ nant function.

There also exist some sequential classification techniques such as sequential probability ratio test (SPRT) and generalized SPRT, where decisions can be made initially using fewer than N features and refined as more features are acquired sequentially [62] . The advantage lies in situations where N is large, so that it is

Sec. 9.14 Classification Techniques 417 Sec. 9.14 Classification Techniques 417

Nonsupervised Learning or Clustering In nonsupervised learning, we attempt to identify clusters or natural groupings in

the feature space. A cluster is a set of points in the feature space for which their local density is large (relative maximum) compared to the density of feature points in the surrounding region. Clustering techniques are useful for image segmentation

and for classification of raw data to establish classes and prototypes. Clustering is also a useful vector quantization technique for compression of images.

Example 9.12 The visual and IR images u1 (m, n) and U2 (m, n), respectively (Fig. 9.59a), are trans­

formed pixel by pixel to give the features as v1 (m, n) = (u1 (m, n) + u2 (m, n))!VZ, v2 (m, n) = (u1 (m, n) - it2 (m, n))IVZ. This is simply the 2 x 2 Hadamard transform of the 2 x1 vector [u1 u2f. Figure 9.59b shows the feature images. The images v1 (m, n)

and v2 (m, n) are found to contain mainly the clouds and land features, respectively. Thresholding these images yield the left-side images in Fig. 9.59c and d. Notice the clouds contain some land features, and vice-versa. A scatter diagram, which plots each vector [ v1 v2f as a point in the v1 versus v2 space, is seen to have two main clusters (Fig. 9.60). Using the cluster boundaries for segmentation, we can remove the land features from clouds, and vice versa, as shown in Fig. 9.59c and d (right-side images).

(a) (b)

(c) (d) Figure

9.59 Segmentation by clustering. (a) Input images u1 ( m, n) and u, ( m, n);

(b) feature images v, ( m, n) and v, ( m, n); ( c) segmenation of clouds by thresholding

(left) and by clustering (right) ; (d) segmentation of land by thresholding v, (left) and by clustering (right).

v,

418 Image Analysis and Computer Vision Chap. 9

9.60 Scatter diagram in feature space.

Similarity measure approach. The success of clustering techniques rests on the partitioning of the feature space into cluster subsets. A general clustering algorithm is based on split and merge ideas (Fig. 9.61). Using a similarity measure, the input vectors are partitioned into subsets. Each partition is tested to check

whether or not the subsets are sufficiently distinct. Subsets that are not sufficiently distinct are merged. The procedure is repeated on each of the subsets until no further subdivisions result or some other convergence criterion is satisfied. Thus, a similarity measure, a distinctiveness test, and a stopping rule are required to define

a clustering algorithm. For any two feature vectors X; and xi, some of the commonly used similarity measures are:

Dot product: Similarity rule:

Weighted Euclidean distance: d(x;,xi) � 2: [x; (k) - xi (k)]2 wk

Normalized correlation: p ( X;, Xi ) d X; )(xi, xi )

(x;, xi )

Several different algorithms exist for clustering based bn similarity approach. Examples are given next.

Sec. 9. 1 4 Classification Techniques 419

I nput

Test merge

No Spl it

Figure

9.61 A clustering approach.

Chain method (63]. The first data sample is designated as the representative of the first cluster and similarity or distance of the next sample is measured from the first cluster representative. If this distance is less than a threshold, say TJ, then it is placed in the first cluster; otherwise it becomes the representative of the second

cluster. The process is continued for each new data sample until all the data has been exhausted. Note that this is a one-pass method.

An iterative method (lsodata) [64]. Assume the number of clusters, K, is known. The partitioning of the data is done such that the average spread or variance of the partition is minimized. Let µk (n) denote the kth cluster center at the nth iteration and Rk denote the region of the kth cluster at a given iteration. Initially, we

assign arbitrary values to f.1.k (0). At the nth iteration take one of the data points X; and assign it to the cluster whose center is closest to it, that is,

d (x;, f.1.k (n)) = min j = l, ... ,K [d(x;, µi(n)] (9.145) where d (x, y) is the distance measure used. Recompute the cluster centers by

x; E Rk �

finding the point that minimizes the distance for elements within each cluster. Thus

µk (n + 1): L d(x;, f.1.k(n + l)) = min L d(x;, y),

k = l , .. .,K (9.146) The procedure is repeated for each x;, one at a time, until the clusters and their

centers remain unchanged. If d (x, y) is the Euclidean distance, then a cluster center is simply the mean location of its elements. If K is not known, we start with a large

Sk k = 1 , .. . ,K

k th class Feature

Look up

models

in tables

Figure

9.62 Image understanding systems.

420 Image Analysis and Computer Vision Chap. 9 420 Image Analysis and Computer Vision Chap. 9

Other Methods Clusters can also be viewed as being located at the nodes of the joint Nth-order

histogram of the feature vector. Other clustering methods are based on statistical nonsupervised learning techniques, ranking, and intrinsic dimensionality determi­ nation, graph theory, and so on [65, 66]. Discussion of those techniques is beyond the goals of this text.

Finally it should be noted that success of clustering techniques is closely tied to feature selection. Clusters not detected in a given feature space may be easier to detect in rotated, sealed, or transformed coordinates. For images the feature vector

elements could represent gray level, gradient magnitude, gradient phase, color, and/or other attributes. It may also be useful to decorrelate the elements of the feature vector.