Pairwise Clustering with t-PLSI

  

Pairwise Clustering with t-PLSI

  He Zhang, Tele Hao, Zhirong Yang, and Erkki Oja

  

Department of Information and Computer Science

Aalto University School of Science, Espoo, Finland

{he.zhang,tele.hao,zhirong.yang,erkki.oja}@aalto.fi

Abstract.

  In the past decade, Probabilistic Latent Semantic Indexing

(PLSI) has become an important modeling technique, widely used in

clustering or graph partitioning analysis. However, the original PLSI is

designed for multinomial data and may not handle other data types.

To overcome this restriction, we generalize PLSI to t-exponential family

based on a recently proposed information criterion called t-divergence.

The t-divergence enjoys more flexibility than KL-divergence in PLSI such

that it can accommodate more types of noise in data. To optimize the

generalized learning objective, we propose a Majorization-Minimization

algorithm which multiplicatively updates the factorizing matrices. The

new method is verified in pairwise clustering tasks. Experimental results

on real-world datasets show that PLSI with t-divergence can improve

clustering performance in purity for certain datasets.

Keywords: clustering, divergence, approximation, multiplicative update

1 Introduction

  Probabilistic clustering has been obtaining promising results in many applica- tions (e.g. [1, 3]). Especially, Probabilistic Latent Semantic Indexing (PLSI) [10] that provides a nice factorizing structure for optimization and statistical in- terpretation has attracted much research effort in the past decade. PLSI was originally used for topic modeling and later also found a good application in clustering (e.g. [5]).

  Despite its success in many tasks, PLSI is restricted to multinomial data - originally, word counts in documents. That is, it assumes that data is generated from a multinomial distribution. Besides the nonnegative integer limitation, the multinomial assumption may not hold for other data with different types of noise.

  In this paper we generalize PLSI with a more flexible formulation based on nonnegative low-rank approximation. Maximizing the PLSI likelihood is equiv- alent to minimizing the Kullback-Leibler (KL) divergence between the input matrix and its approximation. KL-divergence was recently generalized to a fam- ily called t-divergence for measuring the approximation error. The t-divergence

  

This work is supported by the Academy of Finland in the project Finnish Center of Excellence in Computational Inference Research (COIN). is more flexible than KL-divergence in the sense that more types of noise model, e.g. data with a heavy-tailed distribution, can be accommodated [8]. Here we integrate t-divergence in our new PLSI formulation and name the generalized method t-PLSI.

  As the algorithmic contribution, we propose a Majorization-Minimization algorithm to solve the t-PLSI optimization problem. The t-divergence is con- structed through the Fenchel dual of the log-partition function of t-exponential family distributions [8]. The resulting convexity facilitates developing convenient multiplicative variational algorithms for t-PLSI.

  We apply t-PLSI to pairwise clustering analysis. Fourteen real-world datasets are selected for comparing PLSI and the generalized method. Experimental re- sults show that PLSI based on KL-divergence (t = 1) is not always the best. For many selected datasets, t-PLSI can achieve better purities with other t values.

  The rest of the paper is organized as follows. Section 2 reviews the PLSI model and gives its formulation for the pairwise clustering framework. In Section 3, we present the generalized PLSI based on t-divergence, including its learning objective and optimization algorithm. Section 4 gives the experimental settings and results. Conclusions and some future work are given in Section 5.

2 Probabilistic Latent Semantic Indexing (PLSI)

  PLSI [10] was originally developed for text document analysis. Let C be an m×n word-document matrix, where C ij is the number of times the ith word appears in the jth document. PLSI assumes that the data is generated from a multinomial distribution and maximizes the likelihood m n

  Y Y C ij P . i j (w = i, d = j) (1)

  =1 =1

  The multinomial parameters P (w = i, d = j) are factorized by r

  X P (w = i, d = j) = P (w = i|z = k)P (d = j|z = k)P (z = k), (2) k =1 with conditional independence between the word variable w and the document variable d given the latent class variable z. In the following, we rewrite the probabilities by matrix notations for convenience: b

  X = W SH, where b X ij = P (w = i, d = j), W ik = P (w = i|z = k), H kj = P (d = j|z = k) and S a diagonal matrix with S kk = P (z = k). C ij

  P

  Given X as the normalized version of C, i.e. X ij = , maximizing the ab C ab likelihood in Eq. (1) is equivalent to minimizing the Kullback-Leibler divergence

  X X ij D (X|| b X ) = X log (3)

  KL ij T ij (W SH ) ij

  P m P r P n subject to W ≥ 0, H ≥ 0, W ik = 1, S kk = 1, H jk = 1 (for i k j

  =1 =1 =1 details, see [5]).

  In this paper we focus on the symmetric form of PLSI for pairwise clustering, where X is taken as the affinity matrix of a weighted undirected graph that represents pairwise similarities between data samples. No distinction between “words” and “documents” is made like in the original PLSI, but data can be of T any type. In this case, H = W .

3 PLSI with t-Divergence

  Data in real-world applications is not necessarily multinomially distributed. Dif- ferent types of data may contain different types of noise (e.g. [9]). We therefore X argue that, for certain data, the approximation error in X ≈ b should be mea- sured by more suitable divergences than the Kullback-Leibler.

  To capture different statistical properties of data, e.g. for heavy-tailed distri- butions, the exponential family can be generalized to the t-exponential family by replacing the exponential function with the t-exponential function (see e.g. [7]). Correspondingly, there are several ways to generalize the KL-divergence for approximate inference. Here we adopt the t-divergence proposed by Ding et al. [8] because it can maintain the Fenchel duality in t-exponential family. p

  The t-divergence for two densities p and ˜ is defined as Z p D (p||˜ p ) = q (x) log p (x) − q(x) log p ˜ (x)dx, (4) t t t t

  (x) R

  where q(x) = t is a normalization term called the escort distribution of p (x) dx p (x), and log is the inverse of t-exponential function (see e.g. [7]): t

  1−t

  x − 1 log (x) = (5) t 1 − t

  1

  for t ∈ (0, 2)\{1}. When t = 0, log (x) = x. When t = 2, log (x) = . When t t x t → 1, log (x) = log(x) and the t-divergence (4) reduces to KL-divergence. It t is worth noticing that the log decays towards 0 more slowly than the usual log t function for 1 < t < 2, which leads to the heavy-tailed nature of t-exponential family and is desired for robustness.

  3.1 Learning objective We now generalize the PLSI model by using t-divergence. The discrete version of t

  • divergence between a normalized nonnegative matrix X and its approximate b

  X is given by X b

  D t (X|| b X ) = q ij log t t X ij − q ij log X ij , (6)

  X ij t T P t

  X where q ij = . Recall that we use b = W SW for symmetric inputs. The ab X ab resulting t-PLSI optimization problem for pairwise clustering is P t 1−t b

  1 − ij ij ij

  X X D

  X minimize t (X|| b ) = P (7) W,S t (1 − t) ab ab

  X X

  X W S subject to ik = 1, for all k, and kk = 1. (8) i k

  3.2 Optimization

  X The t-divergence in Eq. (7) is convex in b ij for t ∈ [0, 2]. We can therefore ap- ply Jensen’s inequality to develop a Majorization-Minimization algorithm which iterates the following update rules:

   t 2t−1 1  T 

  W ik A ⊘ W SW W 0 < t < 1 ik W ik ∝ T t (9)

   W A W

   ik ⊘ W SW 1 < t < 2, T T t ik 1 t S kk ∝S kk W A ⊘ W SW W , (10) X ij t ik

  P t

  where A ij = and ⊘ denotes the element-wise division between two ab ab X matrices of equal size.

  Theorem 1 The t-PLSI objective in Eq. (7) monotonically decreases by using the multiplicative update rules (9) and (10). The proof is given in the appendix. It is a basic fact that a lower-bounded monotonically decreasing sequence is convergent. Because t-divergence is lower- bounded by zero, the monotonicity shown in the above theorem guarantees that the objective function in Eq. (7) converges to a local minimum.

4 Experiments

  We have compared the clustering performances on a number of undirected graphs between the original PLSI and the proposed PLSI based on t-divergence. The graphs have ground-truth classes. The evaluation criterion that we adopt is the

  P r l l

  1

  n clustering purity = max 1≤l≤q , where n is the number of vertices in n =1 k k k the partition k that belong to ground-truth class l. A larger purity in general corresponds to a better clustering result.

  We have used fourteen datasets to evaluate the two compared methods. These

  1

  2

  3

  datasets can be retrieved from Pajek database , Newman’s collection , or UCI 1 2 http://vlado.fmf.uni-lj.si/pub/networks/data/ 3 http://www-personal.umich.edu/~mejn/netdata/ http://archive.ics.uci.edu/ml/index.html

  Table 1: Clustering performances measured by purity for the two methods.

  

(a) t = 0.2 in the case 0 < t < 1

t

Dataset type # instance # classes PLSI -PLSI

1.00 strike graph

  24

  3

  0.96 sawmill graph

  36

  3

  0.50

  0.53 cities graph

  46

  4

  0.57

  0.67 dolphins graph

  62

  2

  0.98

  0.98

  2

  0.52 parkinsons multivariate 195

  0.54 adjnoun graph 112

  2

  0.75

  0.75

  3

  0.68 balance multivariate 625

  0.63

(b) t = 1.5 in the case 1 < t < 2

t

Dataset type # instance # classes PLSI -PLSI

0.89 highschool graph

  63

  6

  0.87

  8

  0.39 football graph 115

  0.42 scotland graph 108

  12

  0.93

  0.93

  14

  0.44 glass multivariate 214

  0.49 journals graph 124

  6

  0.88

  0.89 ecoli multivariate 336

  8

  0.80

  0.80 vowel multivariate 990

  11

  0.34

  0.35

  machine learning repository. Nine datasets are sparse graphs and the remaining five are multivariate data. We preprocessed the latter type into sparse similarity matrices by symmetrizing their K-Nearest-Neighbor graphs (K = 15).

  We set the number of clusters as the true number of classes for the two clustering algorithms on all datasets. The parameters W and S were initialized by following the Fresh Start procedure proposed by Ding et al. [6]. Table 1 shows the clustering performances of the compared algorithms.

  From the results in Table 2, we can see that the original PLSI based on KL- divergence, i.e. t = 1, is not always the best. For certain datasets, the clustering purity can be improved by using t-divergence with other t values. For example when t = 0.2 in the case 0 < t < 1 (Table 1a), the t-PLSI achieves a perfect clustering result on “strike” dataset with purity 1. As for the “cities” dataset, the clustering purity has been improved by 10% over that of the original PLSI. Similar analysis can be made for the case 1 < t < 2 (Table 1b). Compared with KL-divergence, t-divergence has an extra degree of flexibility given by the free parameter t. Especially, as t → 2, the tail of the log function becomes heavier t than that of the usual log function, which makes the new method more robust

5 Conclusions

  We have studied the generalization of Probabilistic Latent Semantic Indexing with t-divergence family. The generalized PLSI was formulated as a nonnega- tive low-rank approximation problem. The formulation is more flexible and can accommodate more types of noise in data. We have proposed a Majorization- Minimization algorithm for optimizing the constrained objective. Empirical com- parison shows that clustering performance in purity can be improved by using the generalized method with suitable t-divergences other than KL-divergence.

  The proposed generalization is not restricted to PLSI. The t-divergence could be used in other nonnegative approximation problems with, for example, other matrix factorization/decomposition or other constraints, where the flexibility might also help to improve the performance.

  An interesting question raised with the t-exponential family is whether we can find its conjugate prior. The multinomial distribution that underlies PLSI has Dirichlet as its conjugate prior. This conjugacy mainly accounts for the success of generative topic modeling in recent years. Inference based on t-exponential family might also benefit from its conjugate prior if they exist. That is, if we could find a conjugate prior for t-PLSI, we may also apply the nonparametric Bayesian treatment similar to Latent Dirichlet Allocation and thus avoid overfitting.

  The t-divergence is related to two other divergence families, α-divergence and R´enyi divergence (see e.g. [3, 4, 13]). One of the major differences is nor- malization on the input: α-divergence involves no normalization and could be problematic when combined with prior information; R´enyi divergence normal- izes the input before the power operation while t-divergence employs t-power before normalization. The latter has the ability to smooth nonzero entries (for 0 < t < 1) or exclude outliers (for 1 < t < 2). More thorough comparison among these divergence families should be carried out in the future.

  Another important and still open problem is how to select among various t -divergences. This strongly depends on the nature of the data to be analyzed. Usually a larger t leads to more exclusive approximation while a smaller t to more inclusive approximation. Automatic selection among parameterized divergence family generally requires extra information or criterion, for example, ground truth data [2] or cross-validations with a fixed reference parameter [12].

  Appendix: Proof of Theorem 1

  The development follows the Majorization-Minimization steps proposed by Yang and Oja [11, 13]. In the derivation, we use W and S for the current estimates W S while f and e for the variables. r

  Proof. Introducing Lagrangian multipliers {λ k } , k =1

  !

  X X e f W , S W , S λ W . J (f ) ≡J (f ) + k ik − 1 (11) k i

  (Majorization) !

  1−t

  X X

  X X

  1 e W , S A φ f W S W f λ W f ,

  • J (f ) ≤ ij ijk ik kk jk k ik − 1 t
  • X − 1 ij ik kk jk t t ij i k k W S W

      P P t

      W , W where A ij = and φ ijk = . Denote the r.h.s. by G (f ). ab l X W il S ll W jl ab

      Case 1: when 0 < t < 1, we have G (f W , W ) ≤ G (f W , W ), where

      1

      !

      2(1−t)

      X X f

      X X W

      1

      1−t 1−t ik

      f ) = + G (f W , W A ij φ ijk S W λ k W ik − 1 .

      1

    kk jk

    1−t

      t − 1 W ij k k i ik

      W , W W , W

      Case 2: when 1 < t < 2, we have G (f ) ≤ G 2 (f ), where

      !

      (1−t) 1−t

      f f W W

      X X 1 ik jk

      1−t 1−t 1−t

      G W , W A φ S W W (f ) = ij ijk 1 + log

      2 kk ik jk (1−t) 1−t

      t − 1 ij k ik jk W W

      !

      X X f λ W .

    • k ik − 1 k i

      (Minimization) For Case 1: when 0 < t < 1

      2(1−t)−1

      X f ∂G W pq

      1 1−t 1−t

      A φ S W = −2 pj pjq + λ q = 0. qq jq

      1−t

      W pq ∂ f W pq j 1−t 1−t 1

      1 2t−1 1 2t−1 2t−1 2t−1 1−t P

      W W λ A φ W This gives f pq = pq q

      2S qq pj pjq . By using j jq P f p W pq = 1, we obtain 1 1 1−t t−1   2t−1 2t−1 2t−1 2t−1 1−t

      X X λ q qq W aq  A φ W  . = 2S aj ajq (12) a j jq

      For Case 2: when 1 < t < 2

      X ∂G

      1

      2 1−t 1−t 1−t

      A φ S W W = − pj pjq + λ q = 0. qq pq jq f ∂ f W W pq pq j − 1−t P P

      1 1−t 1−t

      f W S A φ W W W

      This gives f pq = λ pj pjq . By using pq = 1, we q qq j pq jq p obtain

      X X

      1−t 1−t 1−t λ W A φ W . q = S aj ajq (13) qq aq jq a j

      W Similarly we can prove the monotonicity over S. Let e J (W, e S ) ≡ J (W, e S ) + P e

      λ S kk − 1 , which is tightly upper bounded by k !

      1−t

      X X

      X

      1 e e G ( e S, S ) ≡ A ij φ ijk W ik S kk W jk + λ S kk − 1 . t − 1 ij k k 1 t 1 − 1−t 1−t t P Zeroing the derivative of G( e S, S ) gives e S qq = λ A ij φ ijq W W . ij iq jq t 1 P 1 P 1 P t 1−t 1−t t e

      S A φ W W Using qq = 1, we obtain λ = ij ija . Inserting λ q a ij ia ja

      S back to e qq gives the update rule (10).

      References

      

    1. Arora, R., Gupta, M., Kapila, A., Fazel, M.: Clustering by left-stochastic matrix

    factorization. In: International Conference on Machine Learning (ICML). pp. 761– 768 (2011)

      

    2. Choi, H., Choi, S., Katake, A., Choe, Y.: Learning alpha-integration with partially-

    labeled data. In: Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing. pp. 14–19 (2010)

      

    3. Cichocki, A., Lee, H., Kim, Y.D., Choi, S.: Non-negative matrix factorization with

    α

    • divergence. Pattern Recognition Letters 29, 1433–1440 (2008)

      

    4. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor

    Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley (2009)

      

    5. Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factor-

    ization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis 52(8), 3913–3927 (2008)

    6. Ding, C., Li, T., Jordan, M.: Convex and semi-nonnegative matrix factorizations.

      IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1), 45–55 (2010)

      

    7. Ding, N., Vishwanathan, S.: t-logistic regression. In: Advances in Neural Informa-

    tion Processing Systems 23, pp. 514–522 (2010)

    8. Ding, N., Vishwanathan, S., Qi, Y.A.: t-divergence based approximate inference.

      In: Advances in Neural Information Processing Systems 24, pp. 1494–1502 (2011)

      

    9. F´evotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the

    Itakura-Saito divergence: With application to music analysis. Neural Computation 21(3), 793–830 (2009)

      

    10. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd

    Annual International Conference on Research and Development in Information Retrieval (SIGIR). pp. 50–57. ACM (1999)

      

    11. Hunter, D.R., Lange, K.: A tutorial on MM algorithms. The American Statistician

    58(1), 30–37 (2004)

      

    12. Mollah, M., Sultana, N., Minami, M.: Robust extraction of local structures by the

    minimum of beta-divergence method. Neural Networks 23, 226–238 (2010)

      

    13. Yang, Z., Oja, E.: Unified development of multiplicative algorithms for linear and

    quadratic nonnegative matrix factorization. IEEE Transactions on Neural Net- works 22(12), 1878–1891 (2011)