A Fast Fixed-Point Algorithm for Two-Class Discriminative Feature Extraction

  

A Fast Fixed-Point Algorithm for Two-Class

Discriminative Feature Extraction

  Zhirong Yang and Jorma Laaksonen

  

Laboratory of Computer and Information Science

Helsinki University of Technology

P.O. Box 5400, FI-02015 HUT, Espoo, Finland

{zhirong.yang, jorma.laaksonen}@hut.fi

  Abstract.

  We propose a fast fixed-point algorithm to improve the Rel-

evant Component Analysis (RCA) in two-class cases. Using an objective

function that maximizes the predictive information, our method is able

to extract more than one discriminative component of data for two-class

problems, which cannot be accomplished by classical Fisher’s discrim-

inant analysis. After prewhitening the data, we apply Newton’s opti-

mization method which automatically chooses the learning rate in the

iterative training of each component. The convergence of the iterative

learning is quadratic, i.e. much faster than the linear optimization by gra-

dient methods. Empirical tests presented in the paper show that feature

extraction using the new method resembles RCA for low-dimensional

ionosphere data and significantly outperforms the latter in efficiency for

high-dimensional facial image data.

1 Introduction

  Supervised linear dimension reduction, or discriminative feature extraction, is a common technique used in pattern recognition. Such a preprocessing step not only reduces the computation complexity, but also reveals relevant information in the data.

  Fisher’s Linear Discriminant Analysis (LDA) [3] is a classical method for this task. Modeling each class by a single Gaussian distribution and assum- ing all classes share a same covariance, LDA maximizes the Fisher criterion of between-class scatter over within-class scatter and can be solved by Singular Value Decomposition

  (SVD). LDA is attractive for its simplicity. Nevertheless, it yields only one discriminative component for two-class problems because the between-class scatter matrix is of rank one. That is, the discriminative informa- tion can only be coded with a single number and a lot of relevant information may be lost during the dimensionality reduction.

  Loog and Duin [2] extended LDA to the heteroscedastic case based on the simplified Chernoff distance between two classes. They derived an alternative

  

Supported by the Academy of Finland in the projects Neural methods in information

retrieval based on automatic content analysis and relevance feedback and Finnish

  2 Zhirong Yang, Jorma Laaksonen

  criterion which uses the individual scatter matrices of both classes. The gener- alized objective can still be optimized by SVD and their method can possibly output more than one projecting direction.

  LDA and the above Chernoff extension, as well as many other variants such as [4, 7], only utilize up to second-order statistics of the class distribution. Pel- tonen and Kaski [5] recently proposed an alternative approach, Relevant Com- ponent Analysis (RCA), to find the subspace as informative as possible of the classes. They model the prediction by a generative procedure of classes given the projected values, and the objective is to maximize the log-likelihood of the supervised data. In their method, the predictive probability density is approxi- mated by Parzen estimators. The training procedure requires the user to specify a proper starting learning rate, which however is lacking theoretical instruc- tions and may be difficult in some cases. Moreover, the slow convergence of the stochastic gradient algorithm would lead to time-consuming learning.

  In this paper, we propose an improved method to speed up the RCA training for two-class problems. We employ three strategies for this goal: prewhitening the data, learning the uncorrelated components individually, and optimizing the objective by Newton’s method. Finally we obtain a fast Fixed-Point Relevant Component Analysis

  (FPRCA) algorithm such that the optimization conver- gence is quadratic. The new method inherits the essential advantages of RCA. That is, it can handle distributions more complicated than single Gaussians and extract more than one discriminative component of the data. Furthermore, the user does not need to specify the learning rates because they are optimized by the algorithm.

  We start with a brief review of RCA in Section 2. Next, we discuss the data preprocessing and the fast optimization algorithm of RCA in Section 3. Section 4 gives the experiments and comparisons on ionosphere and facial image data. Section 5 concludes the paper.

  2 Relevant Component Analysis

  Consider a supervised data set which consists of pairs (x , c ), j = 1, . . . , n,

  j j m

  where x j is the primary data, and the auxiliary data c j takes values ∈ R from binary categorical values. Relevant Component Analysis (RCA) [5] seeks a linear m × r orthonormal projection W that maximizes the predictive power of the primary data. This is done by constructing a generative probabilistic model

  T r

  x of c j given the projected value y j = W j and maximizing the total ∈ R estimated log-likelihood over W: X n maximize J = log ˆ p(c |y ). (1) W RCA j j A Fast Fixed Point RCA for Two-Class Discriminative Feature Extraction

  3 In RCA, the estimated probability ˆ p(c |y ) is computed by the definition of j j

  conditional probability density function as: Ω(y j , c j ) p(c ˆ |y ) = . (2)

  j j

X

  Ω(y , c)

  j

c

  Here

X

n

  1 Ω(y j , c) = ψ(i, c)ω(y i , y j ) (3) n

  

i=1

  is the Parzen estimation of ˆ p(y j , c) and the membership function ψ(i, c) = 1 if c i = c and 0 otherwise. Gaussian kernel is used in [5] as the Parzen window function

  2

  1 ky i − y j k ω(y , y ) = exp − , (4)

  i j 2 r/2

  2

  (2πσ ) 2σ where σ controls the smoothness of the density estimation. Peltonen and Kaski [5] derived the gradient of J RCA with respect to W, based on which one can compute the gradients for Givens rotation angles and then update W for the next iteration. The RCA algorithm applies stochastic gradient optimization method and the iterations converge to a local optimum with a properly decreasing learning rate.

  

3 Two-Class Discriminant Analysis by RCA with a Fast

Fixed-Point Algorithm

  3.1 Preprocessing the Data Suppose the data has been centered to be zero mean. Our algorithm requires prewhitening the primary data, i.e. to find an m × m symmetric matrix V and

  T

  to transform z = Vx such that E{zz } = I. The matrix V can be obtained for example by − T 2 1 V = ED E , (5)

  T T

  where [E, D, E ] = svd(E{xx }) is the singular value decomposition of the scatter matrix of the primary data.

  Prewhitening the primary data greatly simplifies the algorithm described in the following section. We can acquire a diagonal approximation of the Hessian matrix and then easily invert it. Another utility of whitening resides in the fact that for two projecting vectors w p and w q ,

  

T T T T T

  z z w E{(w )(w )} = w E{zz }w q = w q , (6)

  

p q p p

  and therefore uncorrelatedness is equivalent to orthogonality. This allows us to individually extract uncorrelated features by orthogonalizing the projecting directions. In addition, selecting the σ parameter in the Gaussian kernel function becomes easier because the whitened data has unit variance on all axes and σ

4 Zhirong Yang, Jorma Laaksonen

  3.2 Optimization Algorithm Let us first consider the case of a single discriminative component where y =

  j T

  z y j = w j

  RCA

  ∈ R. Our fixed-point algorithm for finding the extreme point of J iteratively applies a Newton’s update followed by a normalization:

  1

  

2

† RCA RCA ∂ J ∂J w

  = w − , (7)

  2 ∂w ∂w w

  w = . (8)

  new

  kw k Denote J = log ˆ p(c |y ). The gradient in of J (1) with respect to w can

  j j j RCA

  then be expressed as X n n n X X ∂J ∂J dJ ∂(y − y )

  

RCA j j i j

  = = · . (9) ∂w ∂w d(y − y ) ∂w

  i j j=1 j=1 i=1

  Notice that the chain rule in the last step applies to the subscript i, i.e. treating y

  i

  as an intermediate variable and y j as a constant. We write g ij = dJ j /d(y i − y j ) for brevity. If the estimated predictive probability density ˆ p(c j |y j ) is obtained by Parzen window technique as in (2) and (3), we can then (see Appendix) write out g ij as ′ ′

  ψ(i, c )ω (y , y ) ω (y , y )

  j i j i j

  g ij = P − P . (10)

  n n

  ψ(k, c j )ω(y k , y j ) ω(y k , y j )

  

k=1 k=1

  For notational simplicity, denote ∂(y − y )

  

i j

ij = = z i − z j (11)

  ∆ ∂w and the average of all objects (vectors or scalars) a , (i, j) ∈ [1, . . . , n]×[1, . . . , n]

  ij

  as X n n X

  1 E{a} = a . (12)

  ij

  

2

  n

  i=1 j=1

  We can then write X n n X ∂J

  RCA

  2

  = g = n (13)

  ij ∆ ij E{g ◦ ∆},

  ∂w

  i=1 j=1

  where ◦ stands for element-wise product and ∆ consists of n × n vectors of size m.

  By taking the derivative of g with respect to y − y , we obtain

  ij i j ′′ ′

  2 ′ ij j i j j i j ∂g ψ(i, c )ω (y , y ) ψ(i, c )ω (y , y ) g = = P − ij n P

  2 n

  ∂(y i − y j ) ψ(k, c j )ω(y k , y j )

  k=1 ( ψ(k, c j )ω(y k , y j ))

′′ ′

k=1

  2

  ω (y , y ) ω (y , y )

  i j i j

  − + P . (14)

  n P

  2 n

  ω(y , y )

  k j

  ( ω(y , y ))

  k=1 k j A Fast Fixed Point RCA for Two-Class Discriminative Feature Extraction

  5 Based on g one can compute ij

  2

  ∂ J

  RCA ′

  2 T

  = n E{g }. (15) ◦ ∆∆

  2

  ∂w

  T T

  } = 2E{zz } = 2I if the data is centered and prewhitened Notice that E{∆∆ ′ T

  (see Appendix for a proof). Furthermore, if we approximate E{g } ≈ ′ T ′ T ◦ ∆∆ E{g }, assuming g are pair-wisely uncorrelated, the Hessian

  }E{∆∆ and ∆∆ (15) can be approximated by

  2

  ∂ J

  

RCA ′

  2

  = 2n E{g }I (16)

  2 ∂w with E{g

  } ∈ R. Inserting (13) and (16) into (7), we obtain

  2 † E{g ◦ ∆} ′ n

  1 w = w − = (2E{g (17) ′ ′ }w − E{g ◦ ∆}) .

  2

  2n E{g }

  2E{g } Because the normalization step (8) is invariant to scaling and the sign of projec- tion does not affect the subspace predictiveness, we can drop the scalar factor in the front and change the order of terms in the parentheses. Then the update rule (7) simplifies to † ′ w }w. (18)

  = E{g ◦ ∆} − 2E{g In this work we employ a deflationary method to extract multiple discrim- inative components. Precisely, the Fixed-Point Relevant Component Analysis

  (FPRCA) algorithm comprises the following steps:

  1. Center the data to make its mean zero and whiten the data to make its scatter to an identity matrix.

2. Compute ∆, the matrix of pair-wise sample difference vectors as in (11).

  3. Choose r, the number of discriminative components to estimate, and σ if the Gaussian kernel (4) is used. Set p ← 1.

  4. Initialize w (e.g. randomly).

  p ′ ′ 5. Compute g and g and then update w }w . p ← E{g ◦ ∆} − 2E{g p

  6. Do the following orthogonalization: X p−1 T w ← w − (w w )w . (19)

  p p q q p q=1 7. Normalize w ← w /kw k. p p p 8. If not converged, go back to step 5.

  9. Set p ← p + 1. If p ≤ r, go back to step 4.

  6 Zhirong Yang, Jorma Laaksonen

  4 Experiments

  We have tested the FPRCA algorithm on facial images collected under the FERET program [6] and ionosphere data which is available at [1]. The iono- sphere data consists of 351 instances, each of which has 34 real numeric at- tributes. 225 samples are labeled good and the other 126 as bad. For the FERET data, 2409 frontal facial images (poses “fa” and “fb”) of 867 subjects were stored in the database after face segmentation. In this work we obtained the coordi- nates of the eyes from the ground truth data of the collection, with which we calibrated the head rotation so that all faces are upright. Afterwards, all face boxes were normalized to the size of 32×32, with fixed locations for the left eye (26,9) and the right eye (7,9). Two classes, mustache (256 images, 81 subjects) and no mustache (2153 images, 786 subjects), have been used in the following experiments.

  4.1 Visualizing Discriminative Features First we demonstrate the existence of multiple discriminative components in two-class problems. For illustrative purpose, we use two-dimensional projections.

  The first dimension is obtained from LDA as w

  1 and the second is trained by

  FPRCA as w

  2 and orthogonal to w 1 as in (19). All the experiments of FPRCA

  in this paper use one-dimensional Gaussian kernel (4) with σ = 0.1 as the Parzen window function.

  Figure 1 (a) shows the projected values of the two classes of facial images. The plot illustrates that the vertical w axis provides extra discriminative infor-

  

2

  mation in addition to the horizontal w axis computed by the LDA method. It

  

1

  can also be seen that the mustache class along the vertical axis comprises two separate clusters. Such projecting direction can by no means be found by LDA and its variants because they limit the projected classes to be single Gaussians.

  For comparison, the result of HLDR using Chernoff criterion [2] (CHER- NOFF) is shown in Figure 1 (b). The first (horizontal) dimension resembles the LDA result, while the second (vertical) provides little discriminative information.

  It can be seen that the clusters are more overlapping in the latter plot.

  4.2 Discriminative Features for Classification Next we compared the classification results on the ionosphere data using the discriminative features extracted by FPRCA and three other methods: LDA, CHERNOFF [2], and RCA. Two kinds of FPRCA features were used. For r = 1, the one-dimensional projection was initialized by LDA and then trained by FPRCA. For r = 2, the first component was the training result of r = 1 and the additional component was initialized by a random orthogonal vector, and then trained by FPRCA.

  The supervised learning and testing were carried out in three modes: ALL – the training set equals the testing set; LOO – leave one instance out for testing A Fast Fixed Point RCA for Two-Class Discriminative Feature Extraction 3 4 no_mustache mustache

(a) (b)

3 4 no_mustache mustache

  7 −1 −1 1

  2 2 1 −4 −4 −3 −3 −2 −2 −3 −2 −1 1 2 3 4 5 −5 −4 −3 −2 −1 1 2 3 Fig. 1.

  Projected values of the classes mustache and no mustache. (a) The horizontal

axis is obtained by LDA and the vertical by FPRCA. (b) Both dimensions are learned

by HLDR with the Chernoff criterion [2].

  samples are for training and the other half for testing. Both LOO and HALF measure the generalization ability. The latter mode is stochastic and tests the performance with a much smaller training set. For it, we repeated the experiment ten times with different random seeds and calculated the mean accuracy.

  Figure 2 (a) illustrates the Nearest-Neighbor (NN) classification accuracies for the compared methods in the above three testing modes with the ionosphere data. The left two bars in each group show that FPRCA outperforms LDA in all the three modes when a single discriminative component is used. This verifies that the high-order statistics involved in the information theoretic objective can enhance the discrimination. The right three bars demonstrate the performance of two-dimensional discriminative features. It can be seen that the additional component learned by CHERNOFF even deteriorates the classification from that of LDA. Furthermore, CHERNOFF shows poor generalization when the amount of training data becomes small. In contrast, RCA and FPRCA (r = 2) exceed LDA and FPRCA (r = 1) with the second component added. The accuracies of FPRCA (r = 2) are comparable to those of RCA as the differences are within 3.5% units. RCA performs slightly better than FPRCA (r = 2) in the LOO and HALF modes because we applied a grid search for optimal parameters of RCA while we did not for FPRCA.

  We also performed the classification experiments on the FERET data. We employed 5-fold cross-validations (CV) instead of LOO because the latter would be very time-consuming for such a large dataset. The data division for HALF and 5-fold CV modes is based on subject identities. That is, all the images of one subject belong either to the training set or to the testing set, never to both.

  Figure 2 (b) shows the NN classification accuracies. In the one-dimensional case, FPRCA is again superior to LDA in all modes. The CHERNOFF method ranks better in the ALL mode, but behaves badly and becomes the worst one

  8 Zhirong Yang, Jorma Laaksonen 100 100 95

(a) (b)

FPRCA (r=2) FPRCA (r=2) RCA (r=2) RCA (r=2) CHERNOFF (r=2) CHERNOFF (r=2) FPRCA (r=1) FPRCA (r=1) LDA LDA 98

  90 85 75 80 92 94 96 NN classification accuracy (%) NN classification accuracy (%) 60 65 70 ALL LOO HALF ALL 5−fold CV HALF 88 90 Fig. 2. Nearest-Neighbor (NN) classification accuracies for the compared methods in three testing modes: (a) the ionosphere data and (b) the FERET data.

  RCA and FPRCA attain not only the least training errors, but also better gener- alization. The additional dimension introduced by FPRCA is more advantageous when the training data become scarce. The accuracies of FPRCA (r = 2) exceed the other compared methods and the difference is especially significant in the HALF mode. The result for the RCA method may still be suboptimal due to its computational difficulty which will be addressed in the next section.

  4.3 RCA vs. FPRCA in Learning Time We have recorded the running times of RCA and FPRCA using a Linux ma- chine with 12GB RAM and two 64-bit 2.2GHz AMD Opteron processors. For the 34-dimensional ionosphere data, both RCA and FPRCA converged within one minute. The exact running times were 38 and 45 seconds, respectively. How- ever, the calculation is very time-demanding for RCA when it is applied to the 1024-dimensional facial image data. Ten iterations of RCA learning on the FERET database required 598 seconds. A 5, 000-iteration RCA training, which merely utilizes each image roughly twice on the average, took about 83 hours, i.e. more than three days. In contrast, the FPRCA algorithm converges within 20 iterations for the mustache classification problem and the training time was 4, 400 seconds.

  On the other hand, RCA is problematic when a wrong σ or learning rate pa- rameter is selected and one has to re-run the time-consuming procedure. Mean- while, the user of FPRCA does not need to exhaustively try different parameters because the learning rate is automatically chosen by the algorithm and the range A Fast Fixed Point RCA for Two-Class Discriminative Feature Extraction

  9

5 Conclusions

  The objective of maximizing predictive information is known to yield better discriminative power than methods based on only second-order statistics. We presented a fast fixed-point algorithm that efficiently learns the discriminative components of data based on an information theoretic criterion. Prewhitening the primary data facilitates the parameter selection for the Parzen windows and enables approximating the inverse Hessian matrix. The learning rate of each iteration is automatically optimized by the Newton’s method, which eases the use of the algorithm. Our method converges quadratically and the extracted discriminative features are advantageous for both visualization and classification.

  Like other linear dimensionality reduction methods, FPRCA is readily ex- tended to its kernel version. The nonlinear discriminative components can be obtained by mapping the primary data to a higher-dimensional space with ap- propriate kernels.

  A Appendix

  First we derive g ij in (10):

  d

  p(c ˆ j |y j ) dJ d(y i − y j )

  j

  g = = , (20)

  ij

  d(y i − y j ) p(c ˆ j |y j ) where we have P

  d d

  Ω(y , c ) Ω(y , c) − c − j j j dˆ p(c|y j ) d(y i y j ) d(y i y j )

  = − ˆ p(c |y ) . (21) P j j P d(y − y ) Ω(y , c) Ω(y , c)

  i j j j c c

  Inserting (21) and (2) into (20), we obtain P

  d d

  Ω(y j , c j ) Ω(y j , c)

  d(y i − y j ) c d(y i − y j )

  g = − P

  ij

  Ω(y , c ) Ω(y , c)

  j j j ′ ′ c P

  ψ(i, c )ω (y , y ) ( ψ(i, c))ω (y , y )

  j i j i j c

  = − . (22) P n P n P ψ(k, c )ω(y , y ) ( ψ(k, c))ω(y , y )

  j k j k j P P k=1 k=1 c ψ(i, c) = 1 and ψ(k, c) = 1 if each sample is assigned to only one class. c c Finally we have (10).

  T T

  } = 2E{zz }: Next we show that E{∆∆ X n n X

  1 T T } = (z − z )(z − z )

  E{∆∆ i j i j

  2

  n

  i=1 j=1 X n n n n n n X X X X X

  1

  1

  1

  1 T T T T z z z z z z z =

  i − i − j j i j i j

  • z

  2

  2

  n n n n

  i=1 i=1 j=1 i=1 j=1 j=1 X n

  2 T = z z

  (23)

  i i

  n

  i=1 T

  =2E{zz }, (24)

  10 Zhirong Yang, Jorma Laaksonen where the step (23) is obtained if the primary data is centered, i.e. of zero mean.

  References

  

1. C.L. Blake D.J. Newman, S. Hettich and C.J. Merz. UCI repository of machine

learning databases http://www.ics.uci.edu/∼mlearn/MLRepository.html, 1998.

  

2. R.P.W Duin and M. Loog. Linear dimensionality reduction via a heteroscedastic

extension of lda: the chernoff criterion. IEEE Transaction on Pattern Analysis and Machine Intelligence , 26(6):732–739, 2004.

  

3. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of

Eugenics , 7, 1963.

  

4. Peg Howland and Haesun Park. Generalizing discriminant analysis using the gen-

eralized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence , 26(8):995–1006, 2005.

  

5. Jaakko Peltonen and Samuel Kaski. Discriminative components of data. IEEE

Transactions on Neural Networks , 16(1):68–83, 2005.

  

6. P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss. The FERET evaluation

methodology for face recognition algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence , 22:1090–1104, October 2000.

  

7. Y. Xu, J.Y. Yang, and Z. Jin. Theory analysis on FSLDA and ULDA. Pattern

Recognition , 36(12):3031–3033, 2003.