Adaptive Multiplicative Updates for Projective Nonnegative Matrix Factorization

Adaptive Multiplicative Updates for Projective

Nonnegative Matrix Factorization

He Zhang, Zhirong Yang, and Erkki Oja _⋆

Department of Information and Computer Science

Aalto University School of Science, Espoo, Finland

{he.zhang,zhirong.yang,erkki.oja}@aalto.fi

Abstract.

Projective Nonnegative Matrix Factorization (PNMF) is able

to extract sparse features and provide good approximation for discrete

problems such as clustering. However, the original PNMF optimization

algorithm can not guarantee theoretical convergence during the itera-

tive learning. We propose here an adaptive multiplicative algorithm for

PNMF which is not only theoretically convergent but also significantly

faster than the previous implementation. An adaptive exponent scheme

has been adopted for our method instead of the old unitary one, which en-

sures the theoretical convergence and accelerates the convergence speed

thanks to the adaptive exponent. We provide new multiplicative update

rules for PNMF based on the squared Euclidean distance and the I-

divergence. For the empirical contributions, we first provide a counter

example on the monotonicity using the original PNMF algorithm, and

then verify our proposed method by experiments on a variety of real-

world data sets. Keywords: Adaptive, multiplicative updates, PNMF, NMF

1 Introduction

Recently Nonnegative Matrix Factorization (NMF) has been attracting much research effort and applied to many different fields such as face recognition, document clustering, gene expression studies, music analysis [7, 1, 3]. The research stream originates from the work by Lee and Seung [6], in which they showed that the nonnegativity constraint and the related multiplicative update rules can generate part-based representations of the data. However, the sparseness achieved by NMF is only mediocre. Many NMF variants (e.g. [4, 5]) addressed this problem but their solutions often require extra user-specified parameters to achieve sparser results, which is inconvenient in practice.

Projective Nonnegative Matrix Factorization (PNMF) [10, 9] as a new variant of NMF has shown advantages over NMF in learning a sparse or orthogonal factorizing matrix, which is desired in both feature extraction and clustering. _⋆

Supported by the Academy of Finland in the project Finnish Center of Excellence Typically PNMF follows the NMF optimization approach by using multiplicative updates. However, the original PNMF algorithm does not guarantee monotonic decrease of the dissimilarity between the input matrix and its approximate after each learning iteration.

We propose new multiplicative algorithms for PNMF in this paper. The convergence problem of the original PNMF update rules is caused by the restrict that the exponent in the update rule must by unitary (i.e., one). Dropping the restrict, we can obtain theoretically convergent update rules without extra normalization steps. The multiplicative updates are further relaxed by allowing variable exponents in different iterations, which turns out to be an effective strategy for accelerating the optimization. The failure of the original PNMF algorithm is demonstrated by an counter example, where the monotonicity of the objective evolution is violated. By contrast, our new method steadily minimizes the objective and converges significantly faster than the old algorithms.

In the remaining, Section 2 recapitulates the essence of the PNMF objectives and their previous optimization methods. Section 3 presents the new convergent multiplicative update rules and the fast PNMF algorithm by using adaptive exponents. In Section 4, we empirically compared the proposed methods using a variety of data sets, and Section 5 concludes the paper.

2 Projective Nonnegative Matrix Factorization _m×n

, Projective Nonnegative Matrix Given a nonnegative data matrix X ∈ R

Factorization (PNMF) seeks a decomposition of X that is of the form: X ≈ _{T m×r}

WW X with the rank r < min(m, n). Compared with , where W ∈ R _T

the NMF approximating scheme X ≈ WH, PNMF replaces H with W

X . This replacement has shown to have positive consequence in sparseness of the approximation, orthogonality of the factorizing matrix, close equivalence to clustering, generalization of the approximation to new data without heavy re-computations, and easy extension to a nonlinear kernel method [9]. _T

Let b X = WW X denote the approximating matrix. The approximation can be achieved by minimizing two widely used objectives: (i) the Squared Euclidean

2 P

distance (Frobenius norm) defined as D X || b X = X ij − b X ij , and

EU _ij

(ii) the Non-normalized Kullback-Leibler divergence (I-divergence) defined as P ^{X ij}

D X || b X = X ij log − X ij + b X ij . Note that PNMF is also called

I _ij _{X ij} b Clustering NMF which was later proposed by Ding et al. in [2].

Denote Z ij = X ij / b X ij and 1 m a column vector of length m filled with 1. To minimize the above objectives, the authors in [10, 9] have employed the following multiplicative update algorithms: _{′ (AW)} _ik

W = W ik , (1) _ik (BW) _ik where for the Euclidean case _{T T T T} A WW

= 2XX and B = WWXX + XX , (2) and for the I-divergence case _{T T T T T} A = ZX + XZ and B = 1 _{m n}

1 _{n m} X + X1 1 . (3) Note that the update rule (1) itself does not necessarily decrease the objective in each iteration and must therefore be accompanied with a normalization or stabilization step, i.e., _{′ ′}

new

W _′ = W /kW k, (4) _{′ ′} _T W where kW k equals the square root of maximal eigenvalue of W . Though the algorithms using the update rules (1) and (4) usually work in practice, the theoretical proof of their convergence is still lacking. In Section 4.1 we can even provide a counter example of these rules for the I-divergence.

3 Adaptive PNMF

The derivation of the update rule (1) follows a heuristic principle that puts the unsigned negative terms in the gradient to the numerator and the rest to the denominator of the multiplying factor to W. Update rules obtained by this principle may not decrease the objective at each iteration [9] because the exponent of the multiplying factor is restricted to one. Discarding the restrict, we can obtain theoretically convergent multiplicative update rules in the relaxed form _η

(AW)

new ik

W = W ik (5) _ik (BW) _ik and the convergence is guaranteed by the following theorem. where η ∈ R + Theorem 1. The multiplicative update (5) monotonically decrease D EU _I X || b

X with η = 1/3, and decrease D X || b X with η = 1/2. The proof is special cases of Majorization-Minimization development procedure in [8]. For self-contained purpose, we include the proof sketch in the Appendix.

The multiplicative algorithm using the update rule (5) avoids unwanted rises and thus assures theoretical convergence of the iterative learning. However, the exponent η that remains constant throughout the iterations is often conservative in practice. Here we propose to accelerate the learning by using more aggressive choice of the exponent, which adaptively changes during the iterations.

A simple strategy is to increase the exponent steadily if the new objective is smaller than the old one and otherwise shrink back to the safe choice, η. The X pseudo-codes for such implementation is given in Algorithm 1, where D(X|| b ),

A , B and η are defined according to the type of cost function (Euclidean distance or I-divergence). We have empirically used µ = 0.1 in all related experiments in this work. Although more comprehensive adaptation approaches could be applied, we find that such a simple strategy can already significantly speed up

Algorithm 1 Multiplicative Updates with Adaptive Exponent for PNMF Usage: W ← FastPNMF(X, η, µ).

Initialize W; ρ ← η. repeat _ρ (AW) _ik U ik ik

← W _{T T} (BW) _ik if D (X||UU X ) < D(X||WW X ) then W ← U

ρ ← ρ + µ else ρ ← η end if until convergent conditions are satisfied

4 Experiments

We have selected a variety of data sets that are commonly used in machine learn-

ing for our experiments. These data sets were obtained from the UCI repository ,

the University of Florida Sparse Matrix Collection , and the LSI text corpora , as well as other publicly available websites. The statistics of the data sets are summarized in Table 1.

Table 1.

The data sets used in the experiments (m = Dimensions, n = # of samples).

GD95 b wine sonar mfeat orl feret worldcities swimmer cisi cran med

13 60 292 400 1024 313 256 1460 1398 1033 n 69 178 208 2000 10304 2409 100 1024 5609 4612 5831

For the empirical comparisons, we consider three methods: (i) PNMFn, i.e, the original PNMF algorithm using the multiplicative update rule (1) and the normalization step (4), (ii) PNMFc, i.e., the convergent multiplicative PNMF algorithm (5) using constant exponent according to Theorem 1, and (iii) PNMFa, i.e., the fast adaptive PNMF algorithm using adaptive exponents (Algorithm 1).

4.1 A Counter-Example of Using Extra Normalization Figure 1 shows a counter-example of the original PNMF algorithm for I-divergence using Eqs. (1), (3), and (4). We have used the GD95 b data set in the experi- ment. It can be seen that the monotonicity of objective evolution is violated in every other loop since the 19th iteration and the optimization is then stuck in an endless fluctuation without a decreasing trend. ₁ ₂ http://archive.ics.uci.edu/ml/ ₃ http://www.cise.ufl.edu/research/sparse/matrices/index.html

_{I−divergence} _166.8 _167.2 _167.4 167.6 ₁₆₇ _166.2 _166.4 166.6 ₁₆₆ ₁₅ ₂₀ ₂₅

_iterations

₃₀ ₃₅ ₄₀ Fig. 1.

A counter example showing that the original PNMF algorithm with normalization does not monotonically decrease the I-divergence for the GD95 b data set.

4.2 Training Time Comparison Figure 2 shows the objective evolution curves using the compared methods. One can see that the objectives of the proposed methods, PNMFc and PNMFa, monotonically decrease for the whole iterative learning process without any un- expected rises. Furthermore, PNMFa generally converges the fastest as its curves are below the other two in all plots.

In addition to qualitative analysis, we have also compared the benchmark on converged time of the three methods. Table 2 summarizes the means and standard deviations of the resulting converged time. The converged time is

Table 2. The mean (µ) and standard deviation (σ) of the converged time (seconds).

(a) Criterion: the squared Euclidean distance

method wine sonar mfeat orl feret

PNMFn 0.97±0.03 0.97±0.01 26.14±1.54 40.37±1.03 30.80±7.34

PNMFc 0.22±0.11 0.22±0.11 68.57±1.75 117.26±1.74 107.58±24.43

PNMFa 0.06±0.03 0.06±0.03 19.10±0.70 29.89±1.48 19.97±5.60

(b) Criterion: the I-divergence

method worldcities swimmer cisi cran med

PNMFn 8.35±3.71 309.78±8.78 478.43±43.51 438.98±41.71 321.94±34.90

PNMFc 14.07±2.98 613.04±20.63 863.89±69.23 809.61±62.64 566.99±64.44

PNMFa 4.68±1.44 193.47±5.43 193.23±18.70 189.41±18.50 132.67±13.86

calculated as follows. We first find the earliest iteration of PNMFn where the _{∗ ∗ ∗} objective D n is sufficiently close to its minimum D : |D n − D |/D < 0.001. The corresponding time is recorded as the converged time of the PNMFn. For the PNMFc evolution, the converged time is that of the first iteration where the _{∗ ∗} objective D c fulfills |D c − D |/D < 0.001. If no such iteration exists, the converged time of PNMFc is set to the largest learning time of the three methods. The same procedure of PNMFc is applied to PNMFa. Each algorithm on each dataset has been repeated 100 times with different random seeds for initializa-

₃₀₀ ₃₂₀ ₂₆₀ ₂₈₀ ₃₆₀

_{340 PNMFc PNMFc}

wine swimmer _{PNMFa PNMFa} _{PNMFn PNMFn} ₁₀ _3.8 _{objective (squared Frobenius norm)} ₂₀₀ ₂₂₀ 240 ₁₈₀ ₁₆₀ _0.01 _0.02 _0.03 _{time (seconds) time (seconds)} _{mfeat cisi} _0.04 _0.05 _0.06 _0.07 _0.08 _{objective (I−divergence)} ₁₀ _3.7 _{50 100 150 200 250} 1000 ₄₀₀ ₆₀₀ ₅₀₀ ₇₀₀ ₈₀₀

_{900 PNMFc PNMFc}

_{PNMFn PNMFn} _{PNMFa PNMFa} ₁₀ ₁₀ ₁₀ _5.08 _5.09 _5.1 _{objective (squared Frobenius norm)} ₁₀₀ ₂₀₀ ₃₀₀ ₉ _{x 10} ₄ ₅ ₁₀ ₁₅ _{time (seconds) time (seconds)} _{orl cran} ₂₀ ₂₅ ₃₀ ₃₅ _{PNMFn PNMFn} ₄₀ objective (I−divergence) ₁₀ ₁₀ _5.11 _5.07 _{50 100 150 200 250} ₈ ₇ ₅ ₆ PNMFc PNMFc _{PNMFa PNMFa} _{objective (I−divergence)} ₁₀ ₁₀ ₁₀ _5.09 _5.08 _5.1 _4.5 _{objective (squared Frobenius norm)} ₃ ₄ _{x 10} ₄ ₅ ₁₀ _{time (seconds) time (seconds)} _{feret med} ₁₅ ₂₀ _{PNMFc PNMFc} _{PNMFn PNMFn} ₂₅ ₁₀ ₁₀ _5.05 _5.06

5.07 _{50 100 150 200 250} _2.5 _3.5 ₄ ₃ PNMFa PNMFa _{objective (I−divergence)} ₁₀ ₁₀ ₁₀ ₁₀ _5.02 _5.03 _5.04 objective (squared Frobenius norm) _1.5 ₂ ₂₀ ₄₀ _{time (seconds) time (seconds)} ₆₀ _{80 100} ₁₀ _5.01 ₁₀ ₂₀ ₃₀ ₄₀ ₅₀ ₆₀ ₇₀ ₈₀ _{90 100} Fig. 2.

Evolutions of objectives using the compared methods based on (left) squared Euclidean distance and (right) I-divergence.

among the three compared: it is 1.5 to 2 times faster than PNMFn and 3 to 5 times faster than PNMFc. The advantage over PNMFn is more significant for the two small data sets wine and sonar.

5 Conclusions

We have proposed a fast multiplicative algorithm for Projective Nonnegative multiplicative update rules. Firstly, relaxing the exponent of the multiplying factor to any positive number can lead to theoretically convergent update rules without extra normalization. Secondly, further relaxation by allowing variable exponent can accelerate the iterative learning. Empirical results show that the proposed algorithm not only monotonically decreases the dissimilarity objective but also converges significantly faster than the previous implementation.

The accelerated algorithms facilitate application of the PNMF method. More large-scale datasets will be tested in the future. Moreover, the proposed adaptive exponent technique is readily extended to other fixed-point algorithms that use multiplicative updates.

A Appendix: Proof of Theorem 1

A.1 The Euclidean distance case We rewrite the squared Euclidean distance as

X f f f

2 _{T T T T}

X W

XX W W f W

W f W

X D (X|| f ) = −2Tr + constant. (6)

EU _ij _ij

The first term on the right is upper-bounded by its linear expansion at the current estimate W: _{T T T}

X f f f W

XX W

XX W −2Tr ≤ −4 W ik + constant (7) _ik _ik _{W W} _{ik ak aj} _X W T because it is concave with respect to f . Next, let λ ijak = . The _X

(WW ) _ij

second term can be upper-bounded by using Jensen’s inequality as follows:

f _{T T T T T} W _ik f W f W

2 X

X WW

XX W WW W ≤ + XX . (8) _ij 3 ik _ij

2W _ik We can then construct the auxiliary function

4 X f _{T ik} W f

W W AW G( f , W) = −2Tr (BW) + + constant (9) _ik

3 _T _ik

2W _ik W f W

X which upper bounds D EU (X|| f ) with A and B defined in Eq. (2). Min- imizing G( f W , W) is implemented by setting its gradient with respect to f W to zero, which yields the update rule Eq. (5) with η = 1/3.

A.2 The I-divergence case We rewrite the I-divergence as _{T T T}

X X f f D (X|| f W f W X ) = − X log + W f W

X W f W X + constant.

I ij _{ij ij} _{ij ij}

(10) The first term is upper-bounded using the Jensen’s inequality:

X _T

X f f f W f W

X − _ij X ij log ≤ − A ai W ik W ak log W ik W ak + constant, _ij _aik

(11) where A is defined in Eq. (3). For the second term, we can rewrite it with B defined in Eq. (3) and obtain its upper bound with e U ik = f W ik and U ik = W ik .

X f _{T T ik}

2 X

1 W f f W f W

X W B f W _ij = Tr ≤ (BW) (12) _ik _{ij ik}

2W _ik We can then construct the auxiliary function

X f W _ik f f

2 X

W G( f + , W) = − A ai W ik W ak log W ik W ak (BW) + constant, _ik _{aik ik}

2W ik _T (13) W f W

X W which upper bounds D (X|| f ). Setting the gradient of G( f , W) with

I respect to f W to zero, we obtain the update rule (5) with η = 1/2.

References

1. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor

Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley (2009)

2. Ding, C., Li, T., Jordan, M.: Convex and semi-nonnegative matrix factorizations.

IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1), 45–55 (2010)

3. F´evotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the

Itakura-Saito divergence: With application to music analysis. Neural Computation 21(3), 793–830 (2009)

4. Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. Jour-

nal of Machine Learning Research 5, 1457–1469 (2004)

5. Kim, H., Park, H.: Sparse non-negative matrix factorizations via alternating non-

negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12), 1495–1502 (2007)

6. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-

torization. Nature 401, 788–791 (1999)

7. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix fac-

torization. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. pp. 267–273 (2003)

8. Yang, Z., Oja, E.: Unified development of multiplicative algorithms for linear and

quadratic nonnegative matrix factorization. IEEE Transactions on Neural Net- works 22(12), 1878–1891 (2011)

9. Yang, Z., Oja, E.: Linear and nonlinear projective nonnegative matrix factorization.

IEEE Transaction on Neural Networks 21(5), 734–749 (2010)

10. Yuan, Z., Oja, E.: Projective nonnegative matrix factorization for image compres-

sion and feature extraction. In: Proc. of 14th Scandinavian Conference on Image Analysis (SCIA 2005). pp. 333–342. Joensuu, Finland (June 2005)

Adaptive Multiplicative Updates for Projective Nonnegative Matrix Factorization

1 Introduction

3 Adaptive PNMF

4 Experiments

5 Conclusions

2 X

Dokumen yang terkait

Design of PI Controller for Angular Velocity Control of Brushed DC Motor plus Neuro Adaptive Control

Self Tuning of Cascade PI Controller for Buck Converter Based on Adaptive Interaction

Current Updates in Management of Nephrotic Syndrome in Children

Approaching Dual Quaternions From Matrix Algebra

Teaching Adaptive Physical Education for Light Idiotic Students at Exclusive School Cendana Rumbai Pekanbaru

Data Reduction Algorithm Using Nonnegative Matrix Factorization with Nonlinear Constraints

Machine Learning for Adaptive Many Core Machines A Practical Approach (Studies in Big Data) 2015th Edition pdf pdf

Projective Nonnegative Matrix Factorization with α-Divergence

Kullback-Leibler Divergence for Nonnegative Matrix Factorization

Selecting β-divergence for Nonnegative Matrix Factorization by Score Matching

Dukungan

Links

Adaptive Multiplicative Updates for Projective Nonnegative Matrix Factorization

1 Introduction

3 Adaptive PNMF

4 Experiments

5 Conclusions

2 X

Dokumen yang terkait

Design of PI Controller for Angular Velocity Control of Brushed DC Motor plus Neuro Adaptive Control

Self Tuning of Cascade PI Controller for Buck Converter Based on Adaptive Interaction

Current Updates in Management of Nephrotic Syndrome in Children

Approaching Dual Quaternions From Matrix Algebra

Teaching Adaptive Physical Education for Light Idiotic Students at Exclusive School Cendana Rumbai Pekanbaru

Data Reduction Algorithm Using Nonnegative Matrix Factorization with Nonlinear Constraints

Machine Learning for Adaptive Many Core Machines A Practical Approach (Studies in Big Data) 2015th Edition pdf pdf

Projective Nonnegative Matrix Factorization with α-Divergence

Kullback-Leibler Divergence for Nonnegative Matrix Factorization

Selecting β-divergence for Nonnegative Matrix Factorization by Score Matching

Dokumen yang Anda mencari sudah siap untuk unduhkan