Face Recognition Using Parzenfaces

  

Face Recognition Using Parzenfaces

  Zhirong Yang and Jorma Laaksonen

  

Laboratory of Computer and Information Science

Helsinki University of Technology

P.O. Box 5400, FI-02015 TKK, Espoo, Finland

{zhirong.yang, jorma.laaksonen}@tkk.fi

  

Abstract. A novel discriminant analysis method is presented for the

face recognition problem. It has been recently shown that the predic-

tive objectives based on Parzen estimation are advantageous for learning

discriminative projections if the class distributions are complicated in

the projected space. However, the existing algorithms based on Parzen

estimators require expensive computation to obtain the gradient for op-

timization. We propose here an accelerating technique by reformulat-

ing the gradient and implement its computation by matrix products.

Furthermore, we point out that regularization is necessary for high-

dimensional face recognition problems. The discriminative objective is

therefore extended by a smoothness constraint of facial images. Our

Parzen Discriminant Analysis method can be trained much faster and

achieve higher recognition accuracies than the compared algorithms in

experiments on two popularly used face databases.

1 Introduction

  Face Recognition (FR) is becoming an even more active research topic in the forthcoming years. The challenge of FR is at first induced by the high dimension- ality of facial images. The problem is more challenging in presence of structured variations such as poses and expressions, which are difficult to be modeled and cause the data to distribute in a complicated manifolds. Therefore, the research in this field is not only useful for classifying faces, but also conducive to other high-dimensional pattern recognition problems.

  A substantial amount of efforts has been devoted to the FR problem, among which Fisher’s Linear Discriminant Analysis is widely used. Modeling each class by a single Gaussian distribution which shares a common covariance, LDA max- imizes the Fisher criterion of between-class scatter over within-class scatter and can be solved by Singular Value Decomposition (SVD). The facial feature extrac- tion by LDA is called Fisherfaces [1]. The Fisherface method is attractive for its simplicity, but the assumption of Gaussians with common variance heavily re- stricts its performance. Moreover, Fisherface requires preprocessing by Principal

  

Supported by the Academy of Finland in the projects Neural methods in information

retrieval based on automatic content analysis and relevance feedback and Finnish lost during the unsupervised dimensionality reduction. Later many variants of Fisherface such as [2] have been proposed. However, the Fisherface method and its variants make use of only the first- and second-order statistics of the class distributions while discarding the higher-order statistics.

  Recently Goldberger et al. [3] proposed Neighborhood Component Analysis (NCA) which learns a linear transformation matrix by maximizing the summed likelihood of the labeled data. The probability density at each data point is es- timated by using the neighbors in the transformed space, which turns out to be the Parzen estimation of the posterior of the class label. Peltonen and Kaski later proposed a very similar method called Informative Discriminant Analysis (IDA) [4], in which they instead employ log-likelihood, i.e. the information of predictive probability density. The likelihood formulation allows NCA and IDA to model very complicated class distributions. It was reported that these two methods outperform traditional discriminant analysis approaches in a number of low-dimensional supervised learning problems. However, the optimization of NCA or IDA requires the gradient of the Parzen-based objective, the computa- tion of which is too expensive for most applications. To obtain an orthonormal transformation matrix, IDA employs a reparameterization based on Givens ro- tation, which even aggravates the computation and prevents its application to high-dimensional data. Peltonen et al. later proposed a modified version [5] to speed up the computation by using a small number Gaussian mixtures instead of the Parzen method. This nevertheless loses the advantage of nonparametric estimation. One has to insert additional EM iterations before computing the gradient, and how to select an appropriate number of Gaussians is unclear.

  In this paper we point out that the computational burden of calculating the gradient in NCA and IDA can be significantly reduced by using matrix multipli- cation. Next, the Givens reparameterization in IDA can be replaced by geodesic updates in the Stiefel manifold, which further simplifies the optimization. Fur- thermore, we propose to regularize the projection matrix by employing a smooth- ness constraint. This is done by introducing an additional penalization term of local pixel variance. We name the new method as Parzenface when applying our discriminant analysis to the face recognition problem. The experiments on two public facial image databases, FERET [6] and ORL [7], demonstrate that our learning algorithm can achieve higher accuracy and run much faster than NCA and IDA.

2 Parzen Discriminant Analysis

  2.1 Unregularized objective Consider a supervised data set which consists of pairs (x j , c j ), j = 1, . . . , n, m where x j ∈ R is the primary data, and the auxiliary data c j takes categorical values. We seek for an m × r orthonormal matrix W by which the primary data

  X

n n n

X

X

  |c p(y i i ) J (W) = log p(c |y ) = log log p(c ) + (1) i i i i =1 i =1 i =1 T p(y ) i in the projected space, where y = W x . i i

  If we estimate p(y i |c i ) and p(y i ) by the Parzen window method, the objective becomes X e ij n n P j X

  :c i =c j

  J (W) = log + const = J + const, (2) P n i i =1 i =1 j =1 e ij h i P P n where J is the shorthand notation for log e / e and i j i j j ij ij :c =c =1

  2 ky − y k i j exp − if i 6= j

  2

  e ij = (3) 2σ if i = j with σ a positive parameter which controls the Gaussian window width.

  2.2 Computing the gradient Our optimization algorithm is based on the gradient of J (W) with respect to W n n n X X X

  2

  − y k ∂J i ∂J i ∂ky j i

  ∇ ≡ ∇ W J = = · . (4)

  2 i =1 i =1 j =1 ∂W ∂ky j − y i k ∂W Notice that the chain rule in the inner summation applies to the subscript j, i.e.

  treating y j as an intermediate variable and y i as a constant. Denote ∂J i

  G ij ≡ (5)

  2

  ∂ky j − y i k for notational simplicity. The gradient then becomes X n n X T ∇ = − x − x W i j G ij (x i j ) (x i j ) (6)

  =1 =1

  Direct computation of ∇ by going through all the vector pairs is too expensive

  2

  2

  X n n X T i =1 j =1     X G (x − x ) (x − x ) (7) n n n n ij i j i j X T T X X

  = 2 x i G ij x − x i G ij x (8)    x x x x  X i =1 j =1 i =1 j =1 n n n T T i j X X = 2 i D ii − i G ij (9) i i j i j

  =1 =1 =1 T T

  = 2 XDX − XGX (10) T = 2X(D − G)X , (11) P n where X = [x , x , . . . , x ] and D is a diagonal matrix with D = G .

  1 2 n ii ij j =1

  That is, the gradient can be computed by matrix operations as T ∇ = 2X(D − G)X W . (12)

  It is known that there exist fast algorithms that implement matrix multiplication q in O(τ ) time where τ = min(m, n) and q a positive scalar less than 3 and towards 2 [8]. Many researchers believe that an optimal algorithm will run in

  2

  essentially O(τ ) time [9]. In practice, if the matrix multiplication is accelerated via the Fast Fourier Transformation (FFT), the computation of the gradient

  2

  (12) can be accomplished in O(τ log τ ) time [8], which is already acceptable for most applications.

  2.3 Geodesic flows on the Stiefel manifold Orthonormality of the transformation matrix is preferred in feature extraction because it enforces the matrix to encode the intrinsic subspace in the most eco- nomic way. The orthonormality constraint also prevents the learning algorithm from falling into some trivial local minima. In addition, an orthonormal matrix as the learning result is convenient for us to compare the new method with many existing projective methods used in face recognition.

  The set of m × r real orthonormal matrices forms a Stiefel manifold St(m, r). Given the gradient ∇ at W, it has been shown [10] that the natural gradient in such a manifold is given by

  St(m,r) T

  grad J = ∇ − W∇ W , (13) W and an approximated geodesic learning flow with the starting point W by T T W

  new = expm t ∇W − W∇ W, (14)

  where expm represents the matrix exponential and t a usually small positive

  An orthonormal matrix has (m − r)r + r(r − 1)/2 free parameters [11]. If this number is comparable to or larger than the number of samples n, the discrim- inant analysis problem probably becomes ill-posed. Unfortunately this is the case in face recognition, especially when the facial images are sampled in high resolutions. The learning objective must therefore be regularized.

  However, simple L -norm used in e.g. Support Vector Machines is not suit-

  2

  able for penalization here because summing the squared entries of an m × r orthonormal matrix results in a constant r. One thus has to use some other regularization techniques.

  Notice that each column of W acts as a linear filter and can be displayed like a filter image. It is a crucial observation we have made that many overfitting projection matrices have highly rough filter images. That is, local contrastive pixel groups dominate the filters, but they are too small to represent any rel- evant patterns for face recognition. This motivates us to adopt a penalization T term Tr W ΩW [12] to emphasize the smoothness prior of images, where the constant matrix Ω is constructed by

  Ω st = N (d(s, t); ρ). (15) Here d(s, t) is the 2-D Euclidean distance of the locations s and t, and N the zero- mean normal distribution. The variance parameter ρ controls the neighborhood size and its value depends on the resolution of the facial images used. We find that ρ ∈ (0.3, 0.8) works fine in our experiments with 32 × 32- and 23 × 28-sized T images. It is not difficult to see that Tr(W

  ΩW) is an approximated version of the Laplacian used in [12]. By attaching regularization term, we define the objective of Parzen Discrim- inant Analysis

  (PDA) to be the maximum of X e ij n P j 1 :c i =c j

  1 T J (W) = log − λTr W , (16)

  

PDA P n ΩW

  2 e j ij

  2 i =1

=1

where λ is a positive parameter that controls the balance between discrimination and smoothness. The optimization of PDA is based on the gradient T

  ˜ ∇ = X(D − G)X W

  − λΩW. (17) In the following experiments, we use the approximated geodesic update T T

  ˜ W = expm t ∇W − W ˜ ∇ W . (18)

  new

  We name our new method Parzenface when the Parzen Discriminant Analysis is applied to the face recognition problem. The term Parzenface also refers to

  Fisherface is a combined method which applies Fisher’s Linear Discriminant Analysis (LDA) on the results of Principal Component Analysis (PCA). Fish- erface and its variants are attractive because they have closed-form solutions which can be obtained by (generalized) singular value decomposition. However, these methods model each subject class by a single Gaussian class, which heavily restricts their generalization in presence of different facial expressions, face poses and illumination conditions. In fact, these structural variabilities cause a subject class to stretch in a curved but non-Gaussian manifold.

  Recently some unsupervised methods such as Laplacianfaces [13] have been proposed to unfold the structures of the face manifolds. Although it was reported that they have better recognition accuracy in some cases, the discriminative per- formance of these methods is naturally limited because they omit the supervised information.

  There exist two gradient-based approaches that are closely related to our PDA method. Neighborhood Component Analysis (NCA) [3] learns a transfor- mation matrix (not necessarily orthonormal) to maximize the leave-one-out (loo) performance of nearest neighbor classification. NCA measures the performance based on “soft” neighbor assignments in the transformed space, which is sim- ilar to the PDA objective except the logarithm function is dropped. However, without the logarithm function, NCA lacks the connection to the information theory. By contrast, PDA conforms the general assumption that the samples are independently and identically distributed

  (i.i.d.). Here the “independence” refers to the predictive version n n Y n p({c i } |{y i } ) = p(c i |y i ), (19) i i

  =1 =1 i =1

  of which the maximization is equivalent to that of the PDA unregularized ob- jective (1). In addition, the loss of orthogonality may cause NCA to fall in some trivial local optima, for example, all columns of the transformation matrix con- verging to a same vector.

  Informative Discriminant Analysis (IDA) has the same objective as the un- regularized one in Section 2.1. Peltonen and Kaski have shown that such pre- dictive likelihood has asymptotic connection to the mutual information criterion and the learning metrics [4]. It has however been shown that the unregularized objective is prone to overfitting in high-dimensional cases [14], as will also be illustrated by experiment results in the next section. A significant drawback of

  IDA or NCA is their slow implementation of the gradient computing ((4) in

  2

  2

  [4] and (5) in [3]). The O(n m ) running time restricts IDA and NCA to only small-scale databases, which is nevertheless infeasible for face recognition. An- other expensive computation in IDA is induced by the reparameterization by Givens rotation, which involves the trigonometric functions and further com- plicates the gradient computation. In contrast, Parzenface is a fast approach to compute the gradient by matrix multiplications and the updates within the

  

Fig. 1. The sample images from (top) FERET and from (bottom) ORL databases.

4 Experiments

  4.1 Data We have compared PDA and five other methods on two databases of facial images. The first data set contains facial images collected under the FERET program [6]. 2409 frontal facial images (poses “fa” and “fb”) of 867 subjects were stored in the database after face segmentation. In this work we obtained the coordinates of the eyes from the ground truth data of the collection, with which we calibrated the head rotation so that all faces are upright. Afterwards, all face boxes were normalized to the size of 32×32, with fixed locations for the left eye (26,9) and the right eye (7,9). The second data set comes from the ORL database [7] which includes 400 facial images of 40 subjects. There are 10 images taken in various poses and expressions from each subject. We resize the ORL images to the size of 23×28 without further normalization. Example images from FERET and ORL are displayed in Figure 1. We divide the images of each subject into two parts of equal size, the first half for training and the rest for testing. The whole training set is the union of the training part of all subjects, and so is the whole testing set.

  4.2 Training time A major advantage of PDA (18) over its close cousins NCA [3] and IDA [4] is that PDA requires much less training time. We demonstrate this by running the compared algorithms on a Linux machine with 12GB RAM and two 64-bit 2.2GHz AMD Opteron processors.

  We set the number of iterations for PDA and NCA to 10, and 10 × n for

  IDA since IDA employs an online learning based on stochastic gradients. In this way all algorithms go through a same number of training samples. We repeated such training ten times and recorded the total time used in Table 1. It is easy to see that PDA significantly outperforms NCA and IDA in efficiency. PDA requires about 1/22 training time of NCA and 1/25 of IDA. The advantage is more obvious for the FERET database of larger scale, where PDA is almost 84 times and 100 times faster than NCA and IDA, respectively.

  4.3 Visualizing the filter images It is intuitive to inspect the elements in the trained projection matrix W before

  Table 1.

  Training time of PDA and IDA on the facial image databases (in seconds). database PDA

  IDA NCA

FERET (n = 1208, m = 1024, r = 10) 3,733 362,412 313,811

ORL (n = 200, m = 644, r = 10) 1,502 40,136 32,914

  FR problem, it is expected to find some semantic connections between the filter images and our common prior knowledge about facial images.

  Figure 2 shows the first ten filter images of five compared methods, where the top two are unsupervised and the bottom four supervised. We only plot the ORL results due to space limit. The filter images of IDA contain almost random pixels and there seems no pattern related to faces. This is probably because IDA starts from the LDA projection matrix, but the latter suffers from data scarce in face recognition. Fisherface and Laplacianface are better than

  IDA since one can slightly perceive some contractive parts around or within the head-like boundary. These parts however are too small and scattered all over every filter image, which might cause overfitting of the projection matrix, e.g. being sensitive to small shifts and variation. The contrastive parts of the NCA basis mainly lie around the head, but these filters differ only in some tiny regions. This is probably caused by the removal of the orthogonal constraint in NCA. By contrast, Parzenface yields filter images that contain clearer facial semantics and are hence easier for interpretation. For example, the fourth filter image is likely related to the beard feature and the fifth may control the head shape. The filter images of Eigenface also comprise some facial parts like eyes and chins, which albeit are more blurred and may lead to underfitting for face recognition.

  4.4 Face recognition accuracies Classification of the testing faces is performed in the projected space by using the nearest neighbor classifier. The face recognition accuracies with r ranging from 10 to 70 are shown in Figure 3. Since the maximum output dimensionality of Fisherface is the number of classes minus one, i.e. 39 for the ORL database, we set a tick at 39 in the x-axis of the right plot for better comparison.

  We found that NCA heavily suffers from the overfitting problem. Although it can achieve excellent classification accuracies for the training set, the NCA transformation matrix generalizes poorly to the testing data. Laplacianface per- forms the second worst. This is probably because it requires a large amount of data to build a reliable graph of locality, which is infeasible in our experiments. This drawback is more severe for the ORL database which contains facial images of different poses. The performance order of Eigenface, Fisherface and IDA de- pends on the database used. Fisherface is the best among these three for FERET while Eigenface is the best one for ORL.

  The Parzenface learning can start from any orthonormal matrix. We set the initial matrix to be the one produced by its best opponent, i.e. Fisherface Fig. 2.

  The first ten filter images of the ORL database using (from top to bottom) Eigenface, Laplacianface, Fisherface, IDA and Parzenface.

  were obtained by cross-validation using the training set. The face recognition accuracies were then calculated by applying the trained Parzenface model to the testing set. From Figure 3 we can see that the face recognition accuracies using Parzenfaces are superior to all the other compared methods.

5 Conclusions

  We have presented a new discriminant analysis method and applied it to the face recognition problem. The proposed Parzenface method overcomes two ma- jor drawbacks of existing gradient-based discriminant analysis methods by using information theory. Firstly the computation of the gradient can be greatly accel- erated by using matrix multiplication instead of going through all the pairwise differences. Secondly we have proposed to employ the smoothness constraint of images to regularize the face recognition problem. The empirical study on two popular facial image databases shows that Parzenface requires much less training time than the IDA and NCA methods while achieving higher face recognition accuracies than the other compared methods.

  In this paper we focused on feature extraction by linear projections, but our method can readily be generalized to the nonlinear version by using the kernel extension. Moreover, the optimization is not restricted to the gradient ascend flows. The convergence could be further improved by employing more advanced

  40 50 30 Laplacianface 60 Fisherface Eigenface 50 60 70 80

  90 accuracies (%) accuracies (%) 10 20 10 20 30

dimensions dimensions

40 50 NCA IDA Parzenface 60 70 20 30 Fisherface 10 40 10 20 30 39 50 Eigenface Laplacianface IDA NCA Parzenface 60 70 Fig. 3.

  Face recognition accuracies of FERET (left) and ORL (right).

  References

  

1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. fisherfaces: recognition

using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7) (1997) 711–720

  

2. Howland, P., Wang, J., Park, H.: Solving the small sample size problem in face

recognition using generalized discriminant analysis. Pattern Recognition 39(2) (2006) 277–287

  

3. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighbourhood compo-

nents analysis. Advances in Neural Information Processing 17 (2005) 513–520

  

4. Peltonen, J., Kaski, S.: Discriminative components of data. IEEE Transactions on

Neural Networks 16(1) (2005) 68–83

  

5. Peltonen, J., Goldberger, J., Kaski, S.: Fast discriminative component analysis for

comparing examples. In: NIPS. (2006)

  

6. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation method-

ology for face recognition algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence 22 (2000) 1090–1104

  

7. Harter, F.S.A.: Parameterisation of a stochastic model for human face identifica-

tion. In: Proceedings of the Second IEEE Workshop on Applications of Computer Vision. (1994) 138–142

  8. Horn, R., Johnson, C.: Topics in Matrix Analysis. Cambridge (1994)

  

9. Robinson, S.: Toward an optimal algorithm for matrix multiplication. SIAM News

  38 (9) (2005)

  

10. Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the

Stiefel manifold. Neurocomputing 67 (2005) 106–135

  

11. Edelman, A.: The geometry of algorithms with orthogonality constraints. SIAM

J. Matrix Anal. Appl. 20(2) (1998) 303–353

  

12. Hastie, T., Buja, A., Tibshirani, R.: Penalized discriminant analysis. The Annals

of Statistics 23(1) (1995) 73–102

  

13. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.J.: Face recognition using Laplacian-

faces. IEEE Transactions on Pattern Analysis And Machine Intelligence 27(328- 340) (2005)

  

14. Yang, Z., Laaksonen, J.: Regularized neighborhood component analysis. In: Pro-

ceedings of 15th Scandinavian Conference on Image Analysis (SCIA), Aalborg, Denmark (2007) 253–262