Adaptive Multiplicative Updates for Projective Nonnegative Matrix Factorization

  

Adaptive Multiplicative Updates for Projective

Nonnegative Matrix Factorization

  He Zhang, Zhirong Yang, and Erkki Oja

  

Department of Information and Computer Science

Aalto University School of Science, Espoo, Finland

{he.zhang,zhirong.yang,erkki.oja}@aalto.fi

Abstract.

  Projective Nonnegative Matrix Factorization (PNMF) is able

to extract sparse features and provide good approximation for discrete

problems such as clustering. However, the original PNMF optimization

algorithm can not guarantee theoretical convergence during the itera-

tive learning. We propose here an adaptive multiplicative algorithm for

PNMF which is not only theoretically convergent but also significantly

faster than the previous implementation. An adaptive exponent scheme

has been adopted for our method instead of the old unitary one, which en-

sures the theoretical convergence and accelerates the convergence speed

thanks to the adaptive exponent. We provide new multiplicative update

rules for PNMF based on the squared Euclidean distance and the I-

divergence. For the empirical contributions, we first provide a counter

example on the monotonicity using the original PNMF algorithm, and

then verify our proposed method by experiments on a variety of real-

world data sets. Keywords: Adaptive, multiplicative updates, PNMF, NMF

1 Introduction

  Recently Nonnegative Matrix Factorization (NMF) has been attracting much research effort and applied to many different fields such as face recognition, doc- ument clustering, gene expression studies, music analysis [7, 1, 3]. The research stream originates from the work by Lee and Seung [6], in which they showed that the nonnegativity constraint and the related multiplicative update rules can generate part-based representations of the data. However, the sparseness achieved by NMF is only mediocre. Many NMF variants (e.g. [4, 5]) addressed this problem but their solutions often require extra user-specified parameters to achieve sparser results, which is inconvenient in practice.

  Projective Nonnegative Matrix Factorization (PNMF) [10, 9] as a new variant of NMF has shown advantages over NMF in learning a sparse or orthogonal factorizing matrix, which is desired in both feature extraction and clustering.

  

Supported by the Academy of Finland in the project Finnish Center of Excellence Typically PNMF follows the NMF optimization approach by using multiplicative updates. However, the original PNMF algorithm does not guarantee monotonic decrease of the dissimilarity between the input matrix and its approximate after each learning iteration.

  We propose new multiplicative algorithms for PNMF in this paper. The con- vergence problem of the original PNMF update rules is caused by the restrict that the exponent in the update rule must by unitary (i.e., one). Dropping the restrict, we can obtain theoretically convergent update rules without extra normalization steps. The multiplicative updates are further relaxed by allow- ing variable exponents in different iterations, which turns out to be an effective strategy for accelerating the optimization. The failure of the original PNMF al- gorithm is demonstrated by an counter example, where the monotonicity of the objective evolution is violated. By contrast, our new method steadily minimizes the objective and converges significantly faster than the old algorithms.

  In the remaining, Section 2 recapitulates the essence of the PNMF objectives and their previous optimization methods. Section 3 presents the new convergent multiplicative update rules and the fast PNMF algorithm by using adaptive exponents. In Section 4, we empirically compared the proposed methods using a variety of data sets, and Section 5 concludes the paper.

  2 Projective Nonnegative Matrix Factorization m×n

  , Projective Nonnegative Matrix Given a nonnegative data matrix X ∈ R

  • Factorization (PNMF) seeks a decomposition of X that is of the form: X ≈ T m×r

  WW X with the rank r < min(m, n). Compared with , where W ∈ R T

  • the NMF approximating scheme X ≈ WH, PNMF replaces H with W

  X . This replacement has shown to have positive consequence in sparseness of the approx- imation, orthogonality of the factorizing matrix, close equivalence to clustering, generalization of the approximation to new data without heavy re-computations, and easy extension to a nonlinear kernel method [9]. T

  Let b X = WW X denote the approximating matrix. The approximation can be achieved by minimizing two widely used objectives: (i) the Squared Euclidean

  2 P

  distance (Frobenius norm) defined as D X || b X = X ij − b X ij , and

  EU ij

  (ii) the Non-normalized Kullback-Leibler divergence (I-divergence) defined as P X ij

  D X || b X = X ij log − X ij + b X ij . Note that PNMF is also called

  I ij X ij b Clustering NMF which was later proposed by Ding et al. in [2].

  Denote Z ij = X ij / b X ij and 1 m a column vector of length m filled with 1. To minimize the above objectives, the authors in [10, 9] have employed the following multiplicative update algorithms: ′ (AW) ik

  W = W ik , (1) ik (BW) ik where for the Euclidean case T T T T A WW

  = 2XX and B = WWXX + XX , (2) and for the I-divergence case T T T T T A = ZX + XZ and B = 1 m n

  1 n m X + X1 1 . (3) Note that the update rule (1) itself does not necessarily decrease the objective in each iteration and must therefore be accompanied with a normalization or stabilization step, i.e., ′ ′

  new

  W = W /kW k, (4) ′ ′ T W where kW k equals the square root of maximal eigenvalue of W . Though the algorithms using the update rules (1) and (4) usually work in practice, the theoretical proof of their convergence is still lacking. In Section 4.1 we can even provide a counter example of these rules for the I-divergence.

3 Adaptive PNMF

  The derivation of the update rule (1) follows a heuristic principle that puts the unsigned negative terms in the gradient to the numerator and the rest to the denominator of the multiplying factor to W. Update rules obtained by this prin- ciple may not decrease the objective at each iteration [9] because the exponent of the multiplying factor is restricted to one. Discarding the restrict, we can obtain theoretically convergent multiplicative update rules in the relaxed form η

  (AW)

  

new ik

  W = W ik (5) ik (BW) ik and the convergence is guaranteed by the following theorem. where η ∈ R + Theorem 1. The multiplicative update (5) monotonically decrease D EU I X || b

  X with η = 1/3, and decrease D X || b X with η = 1/2. The proof is special cases of Majorization-Minimization development procedure in [8]. For self-contained purpose, we include the proof sketch in the Appendix.

  The multiplicative algorithm using the update rule (5) avoids unwanted rises and thus assures theoretical convergence of the iterative learning. However, the exponent η that remains constant throughout the iterations is often conservative in practice. Here we propose to accelerate the learning by using more aggressive choice of the exponent, which adaptively changes during the iterations.

  A simple strategy is to increase the exponent steadily if the new objective is smaller than the old one and otherwise shrink back to the safe choice, η. The X pseudo-codes for such implementation is given in Algorithm 1, where D(X|| b ),

  A , B and η are defined according to the type of cost function (Euclidean distance or I-divergence). We have empirically used µ = 0.1 in all related experiments in this work. Although more comprehensive adaptation approaches could be applied, we find that such a simple strategy can already significantly speed up

  Algorithm 1 Multiplicative Updates with Adaptive Exponent for PNMF Usage: W ← FastPNMF(X, η, µ).

  Initialize W; ρ ← η. repeat ρ (AW) ik U ik ik

  ← W T T (BW) ik if D (X||UU X ) < D(X||WW X ) then W ← U

  ρ ← ρ + µ else ρ ← η end if until convergent conditions are satisfied

4 Experiments

  We have selected a variety of data sets that are commonly used in machine learn-

  1

  ing for our experiments. These data sets were obtained from the UCI repository ,

  2

  3

  the University of Florida Sparse Matrix Collection , and the LSI text corpora , as well as other publicly available websites. The statistics of the data sets are summarized in Table 1.

  Table 1.

  The data sets used in the experiments (m = Dimensions, n = # of samples).

GD95 b wine sonar mfeat orl feret worldcities swimmer cisi cran med

m

  40

  13 60 292 400 1024 313 256 1460 1398 1033 n 69 178 208 2000 10304 2409 100 1024 5609 4612 5831

  For the empirical comparisons, we consider three methods: (i) PNMFn, i.e, the original PNMF algorithm using the multiplicative update rule (1) and the normalization step (4), (ii) PNMFc, i.e., the convergent multiplicative PNMF algorithm (5) using constant exponent according to Theorem 1, and (iii) PNMFa, i.e., the fast adaptive PNMF algorithm using adaptive exponents (Algorithm 1).

  4.1 A Counter-Example of Using Extra Normalization Figure 1 shows a counter-example of the original PNMF algorithm for I-divergence using Eqs. (1), (3), and (4). We have used the GD95 b data set in the experi- ment. It can be seen that the monotonicity of objective evolution is violated in every other loop since the 19th iteration and the optimization is then stuck in an endless fluctuation without a decreasing trend. 1 2 http://archive.ics.uci.edu/ml/ 3 http://www.cise.ufl.edu/research/sparse/matrices/index.html

  I−divergence 166.8 167.2 167.4 167.6 167 166.2 166.4 166.6 166 15 20 25

iterations

30 35 40 Fig. 1.

  A counter example showing that the original PNMF algorithm with normal- ization does not monotonically decrease the I-divergence for the GD95 b data set.

  4.2 Training Time Comparison Figure 2 shows the objective evolution curves using the compared methods. One can see that the objectives of the proposed methods, PNMFc and PNMFa, monotonically decrease for the whole iterative learning process without any un- expected rises. Furthermore, PNMFa generally converges the fastest as its curves are below the other two in all plots.

  In addition to qualitative analysis, we have also compared the benchmark on converged time of the three methods. Table 2 summarizes the means and standard deviations of the resulting converged time. The converged time is

Table 2. The mean (µ) and standard deviation (σ) of the converged time (seconds).

  

(a) Criterion: the squared Euclidean distance

method wine sonar mfeat orl feret

PNMFn 0.97±0.03 0.97±0.01 26.14±1.54 40.37±1.03 30.80±7.34

PNMFc 0.22±0.11 0.22±0.11 68.57±1.75 117.26±1.74 107.58±24.43

PNMFa 0.06±0.03 0.06±0.03 19.10±0.70 29.89±1.48 19.97±5.60

(b) Criterion: the I-divergence

method worldcities swimmer cisi cran med

PNMFn 8.35±3.71 309.78±8.78 478.43±43.51 438.98±41.71 321.94±34.90

PNMFc 14.07±2.98 613.04±20.63 863.89±69.23 809.61±62.64 566.99±64.44

PNMFa 4.68±1.44 193.47±5.43 193.23±18.70 189.41±18.50 132.67±13.86

  calculated as follows. We first find the earliest iteration of PNMFn where the ∗ ∗ ∗ objective D n is sufficiently close to its minimum D : |D n − D |/D < 0.001. The corresponding time is recorded as the converged time of the PNMFn. For the PNMFc evolution, the converged time is that of the first iteration where the ∗ ∗ objective D c fulfills |D c − D |/D < 0.001. If no such iteration exists, the con- verged time of PNMFc is set to the largest learning time of the three methods. The same procedure of PNMFc is applied to PNMFa. Each algorithm on each dataset has been repeated 100 times with different random seeds for initializa-

  300 320 260 280 360

340 PNMFc PNMFc

wine swimmer PNMFa PNMFa PNMFn PNMFn 10 3.8 objective (squared Frobenius norm) 200 220 240 180 160 0.01 0.02 0.03 time (seconds) time (seconds) mfeat cisi 0.04 0.05 0.06 0.07 0.08 objective (I−divergence) 10 3.7 50 100 150 200 250 1000 400 600 500 700 800

900 PNMFc PNMFc

PNMFn PNMFn PNMFa PNMFa 10 10 10 5.08 5.09 5.1 objective (squared Frobenius norm) 100 200 300 9 x 10 4 5 10 15 time (seconds) time (seconds) orl cran 20 25 30 35 PNMFn PNMFn 40 objective (I−divergence) 10 10 5.11 5.07 50 100 150 200 250 8 7 5 6 PNMFc PNMFc PNMFa PNMFa objective (I−divergence) 10 10 10 5.09 5.08 5.1 4.5 objective (squared Frobenius norm) 3 4 x 10 4 5 10 time (seconds) time (seconds) feret med 15 20 PNMFc PNMFc PNMFn PNMFn 25 10 10 5.05 5.06

  5.07 50 100 150 200 250 2.5 3.5 4 3 PNMFa PNMFa objective (I−divergence) 10 10 10 10 5.02 5.03 5.04 objective (squared Frobenius norm) 1.5 2 20 40 time (seconds) time (seconds) 60 80 100 10 5.01 10 20 30 40 50 60 70 80 90 100 Fig. 2.

  Evolutions of objectives using the compared methods based on (left) squared Euclidean distance and (right) I-divergence.

  among the three compared: it is 1.5 to 2 times faster than PNMFn and 3 to 5 times faster than PNMFc. The advantage over PNMFn is more significant for the two small data sets wine and sonar.

5 Conclusions

  We have proposed a fast multiplicative algorithm for Projective Nonnegative multiplicative update rules. Firstly, relaxing the exponent of the multiplying factor to any positive number can lead to theoretically convergent update rules without extra normalization. Secondly, further relaxation by allowing variable exponent can accelerate the iterative learning. Empirical results show that the proposed algorithm not only monotonically decreases the dissimilarity objective but also converges significantly faster than the previous implementation.

  The accelerated algorithms facilitate application of the PNMF method. More large-scale datasets will be tested in the future. Moreover, the proposed adaptive exponent technique is readily extended to other fixed-point algorithms that use multiplicative updates.

  A Appendix: Proof of Theorem 1

  A.1 The Euclidean distance case We rewrite the squared Euclidean distance as

  X f f f

  2 T T T T

  X W

  XX W W f W

  • W f W

  X D (X|| f ) = −2Tr + constant. (6)

  EU ij ij

  The first term on the right is upper-bounded by its linear expansion at the current estimate W: T T T

  X f f f W

  XX W

  XX W −2Tr ≤ −4 W ik + constant (7) ik ik W W ik ak aj X W T because it is concave with respect to f . Next, let λ ijak = . The X

  (WW ) ij

  second term can be upper-bounded by using Jensen’s inequality as follows:

  4

  f T T T T T W ik f W f W

2 X

  X WW

  XX W WW W ≤ + XX . (8) ij 3 ik ij

  2W ik We can then construct the auxiliary function

  4 X f T ik W f

  W W AW G( f , W) = −2Tr (BW) + + constant (9) ik

  3 T ik

  2W ik W f W

  X which upper bounds D EU (X|| f ) with A and B defined in Eq. (2). Min- imizing G( f W , W) is implemented by setting its gradient with respect to f W to zero, which yields the update rule Eq. (5) with η = 1/3.

  A.2 The I-divergence case We rewrite the I-divergence as T T T

  X X f f D (X|| f W f W X ) = − X log + W f W

  X W f W X + constant.

  I ij ij ij ij ij

  (10) The first term is upper-bounded using the Jensen’s inequality:

  X T

  X f f f W f W

  X − ij X ij log ≤ − A ai W ik W ak log W ik W ak + constant, ij aik

  (11) where A is defined in Eq. (3). For the second term, we can rewrite it with B defined in Eq. (3) and obtain its upper bound with e U ik = f W ik and U ik = W ik .

  X f T T ik

  2 X

  1 W f f W f W

  X W B f W ij = Tr ≤ (BW) (12) ik ij ik

  2

  2W ik We can then construct the auxiliary function

  X f W ik f f

  2 X

  W G( f + , W) = − A ai W ik W ak log W ik W ak (BW) + constant, ik aik ik

  2W ik T (13) W f W

  X W which upper bounds D (X|| f ). Setting the gradient of G( f , W) with

  I respect to f W to zero, we obtain the update rule (5) with η = 1/2.

  References

  

1. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor

Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley (2009)

2. Ding, C., Li, T., Jordan, M.: Convex and semi-nonnegative matrix factorizations.

  IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1), 45–55 (2010)

  

3. F´evotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the

Itakura-Saito divergence: With application to music analysis. Neural Computation 21(3), 793–830 (2009)

  

4. Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. Jour-

nal of Machine Learning Research 5, 1457–1469 (2004)

  

5. Kim, H., Park, H.: Sparse non-negative matrix factorizations via alternating non-

negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12), 1495–1502 (2007)

  

6. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-

torization. Nature 401, 788–791 (1999)

  

7. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix fac-

torization. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. pp. 267–273 (2003)

  

8. Yang, Z., Oja, E.: Unified development of multiplicative algorithms for linear and

quadratic nonnegative matrix factorization. IEEE Transactions on Neural Net- works 22(12), 1878–1891 (2011)

9. Yang, Z., Oja, E.: Linear and nonlinear projective nonnegative matrix factorization.

  IEEE Transaction on Neural Networks 21(5), 734–749 (2010)

  

10. Yuan, Z., Oja, E.: Projective nonnegative matrix factorization for image compres-

sion and feature extraction. In: Proc. of 14th Scandinavian Conference on Image Analysis (SCIA 2005). pp. 333–342. Joensuu, Finland (June 2005)