Pairwise Clustering with t-PLSI

He Zhang, Tele Hao, Zhirong Yang, and Erkki Oja

Department of Information and Computer Science

^⋆

Aalto University School of Science, Espoo, Finland

{he.zhang,tele.hao,zhirong.yang,erkki.oja}@aalto.fi

Abstract.

In the past decade, Probabilistic Latent Semantic Indexing

(PLSI) has become an important modeling technique, widely used in

clustering or graph partitioning analysis. However, the original PLSI is

designed for multinomial data and may not handle other data types.

To overcome this restriction, we generalize PLSI to t-exponential family

based on a recently proposed information criterion called t-divergence.

The t-divergence enjoys more flexibility than KL-divergence in PLSI such

that it can accommodate more types of noise in data. To optimize the

generalized learning objective, we propose a Majorization-Minimization

algorithm which multiplicatively updates the factorizing matrices. The

new method is verified in pairwise clustering tasks. Experimental results

on real-world datasets show that PLSI with t-divergence can improve

clustering performance in purity for certain datasets.

Keywords: clustering, divergence, approximation, multiplicative update

1 Introduction

Probabilistic clustering has been obtaining promising results in many applications (e.g. [1, 3]). Especially, Probabilistic Latent Semantic Indexing (PLSI) [10] that provides a nice factorizing structure for optimization and statistical in- terpretation has attracted much research effort in the past decade. PLSI was originally used for topic modeling and later also found a good application in clustering (e.g. [5]).

Despite its success in many tasks, PLSI is restricted to multinomial data - originally, word counts in documents. That is, it assumes that data is generated from a multinomial distribution. Besides the nonnegative integer limitation, the multinomial assumption may not hold for other data with different types of noise.

In this paper we generalize PLSI with a more flexible formulation based on nonnegative low-rank approximation. Maximizing the PLSI likelihood is equivalent to minimizing the Kullback-Leibler (KL) divergence between the input matrix and its approximation. KL-divergence was recently generalized to a family called t-divergence for measuring the approximation error. The t-divergence _⋆

This work is supported by the Academy of Finland in the project Finnish Center of Excellence in Computational Inference Research (COIN). is more flexible than KL-divergence in the sense that more types of noise model, e.g. data with a heavy-tailed distribution, can be accommodated [8]. Here we integrate t-divergence in our new PLSI formulation and name the generalized method t-PLSI.

As the algorithmic contribution, we propose a Majorization-Minimization algorithm to solve the t-PLSI optimization problem. The t-divergence is con- structed through the Fenchel dual of the log-partition function of t-exponential family distributions [8]. The resulting convexity facilitates developing convenient multiplicative variational algorithms for t-PLSI.

We apply t-PLSI to pairwise clustering analysis. Fourteen real-world datasets are selected for comparing PLSI and the generalized method. Experimental results show that PLSI based on KL-divergence (t = 1) is not always the best. For many selected datasets, t-PLSI can achieve better purities with other t values.

The rest of the paper is organized as follows. Section 2 reviews the PLSI model and gives its formulation for the pairwise clustering framework. In Section 3, we present the generalized PLSI based on t-divergence, including its learning objective and optimization algorithm. Section 4 gives the experimental settings and results. Conclusions and some future work are given in Section 5.

2 Probabilistic Latent Semantic Indexing (PLSI)

PLSI [10] was originally developed for text document analysis. Let C be an m×n word-document matrix, where C ij is the number of times the ith word appears in the jth document. PLSI assumes that the data is generated from a multinomial distribution and maximizes the likelihood _{m n}

Y Y _C _ij P . _{i j} (w = i, d = j) (1)

=1 =1

The multinomial parameters P (w = i, d = j) are factorized by _r

X P (w = i, d = j) = P (w = i|z = k)P (d = j|z = k)P (z = k), (2) _{k =1} with conditional independence between the word variable w and the document variable d given the latent class variable z. In the following, we rewrite the probabilities by matrix notations for convenience: b

X = W SH, where b X ij = P (w = i, d = j), W ik = P (w = i|z = k), H kj = P (d = j|z = k) and S a diagonal matrix with S kk = P (z = k). _C _ij

Given X as the normalized version of C, i.e. X ij = , maximizing the _ab _{C ab} likelihood in Eq. (1) is equivalent to minimizing the Kullback-Leibler divergence

X X _ij D (X|| b X ) = X log (3)

KL ij _T _ij (W SH ) ij

P m P r P n subject to W ≥ 0, H ≥ 0, W ik = 1, S kk = 1, H jk = 1 (for _{i k j}

=1 =1 =1 details, see [5]).

In this paper we focus on the symmetric form of PLSI for pairwise clustering, where X is taken as the affinity matrix of a weighted undirected graph that represents pairwise similarities between data samples. No distinction between “words” and “documents” is made like in the original PLSI, but data can be of _T any type. In this case, H = W .

3 PLSI with t-Divergence

Data in real-world applications is not necessarily multinomially distributed. Dif- ferent types of data may contain different types of noise (e.g. [9]). We therefore X argue that, for certain data, the approximation error in X ≈ b should be measured by more suitable divergences than the Kullback-Leibler.

To capture different statistical properties of data, e.g. for heavy-tailed distributions, the exponential family can be generalized to the t-exponential family by replacing the exponential function with the t-exponential function (see e.g. [7]). Correspondingly, there are several ways to generalize the KL-divergence for approximate inference. Here we adopt the t-divergence proposed by Ding et al. [8] because it can maintain the Fenchel duality in t-exponential family. p

The t-divergence for two densities p and ˜ is defined as Z _p D (p||˜ p ) = q (x) log p (x) − q(x) log p ˜ (x)dx, (4) _{t t t} _t

(x) R

where q(x) = t is a normalization term called the escort distribution of _{p (x) dx} p (x), and log is the inverse of t-exponential function (see e.g. [7]): _t

1−t

x − 1 log (x) = (5) _t 1 − t

for t ∈ (0, 2)\{1}. When t = 0, log (x) = x. When t = 2, log (x) = . When _{t t} _x t → 1, log (x) = log(x) and the t-divergence (4) reduces to KL-divergence. It _t is worth noticing that the log decays towards 0 more slowly than the usual log _t function for 1 < t < 2, which leads to the heavy-tailed nature of t-exponential family and is desired for robustness.

3.1 Learning objective We now generalize the PLSI model by using t-divergence. The discrete version of t

divergence between a normalized nonnegative matrix X and its approximate b

X is given by X b

D t (X|| b X ) = q ij log _{t t} X ij − q ij log X ij , (6)

_X _ij t _T P _t

X where q ij = . Recall that we use b = W SW for symmetric inputs. The _ab _X _ab resulting t-PLSI optimization problem for pairwise clustering is P _{t 1−t} b

1 − _{ij ij ij}

X X D

X minimize t (X|| b ) = P (7) _W,S _t (1 − t) _{ab ab}

X X

X W S subject to ik = 1, for all k, and kk = 1. (8) _{i k}

3.2 Optimization

X The t-divergence in Eq. (7) is convex in b ij for t ∈ [0, 2]. We can therefore apply Jensen’s inequality to develop a Majorization-Minimization algorithm which iterates the following update rules:

 _{t 2t−1} ₁  T 

W ik A ⊘ W SW W 0 < t < 1 _ik W _{ik ∝} _T _t (9)

 W A W

 ik ⊘ W SW 1 < t < 2, _{T T} _t _ik ₁ _t S kk ∝S kk W A ⊘ W SW W , (10) _X _ij _t _ik

P _t

where A ij = and ⊘ denotes the element-wise division between two _{ab ab} _X matrices of equal size.

Theorem 1 The t-PLSI objective in Eq. (7) monotonically decreases by using the multiplicative update rules (9) and (10). The proof is given in the appendix. It is a basic fact that a lower-bounded monotonically decreasing sequence is convergent. Because t-divergence is lower- bounded by zero, the monotonicity shown in the above theorem guarantees that the objective function in Eq. (7) converges to a local minimum.

4 Experiments

We have compared the clustering performances on a number of undirected graphs between the original PLSI and the proposed PLSI based on t-divergence. The graphs have ground-truth classes. The evaluation criterion that we adopt is the

P r _{l l}

n clustering purity = max 1≤l≤q , where n is the number of vertices in _{n =1} _{k k k} the partition k that belong to ground-truth class l. A larger purity in general corresponds to a better clustering result.

We have used fourteen datasets to evaluate the two compared methods. These

datasets can be retrieved from Pajek database , Newman’s collection , or UCI ₁ ₂ http://vlado.fmf.uni-lj.si/pub/networks/data/ ₃ http://www-personal.umich.edu/~mejn/netdata/ http://archive.ics.uci.edu/ml/index.html

Table 1: Clustering performances measured by purity for the two methods.

(a) t = 0.2 in the case 0 < t < 1

Dataset type # instance # classes PLSI -PLSI

1.00 strike graph

0.96 sawmill graph

0.50

0.53 cities graph

0.57

0.67 dolphins graph

0.98

0.52 parkinsons multivariate 195

0.54 adjnoun graph 112

0.75

0.68 balance multivariate 625

0.63

(b) t = 1.5 in the case 1 < t < 2

Dataset type # instance # classes PLSI -PLSI

0.89 highschool graph

0.87

0.39 football graph 115

0.42 scotland graph 108

0.93

0.44 glass multivariate 214

0.49 journals graph 124

0.88

0.89 ecoli multivariate 336

0.80

0.80 vowel multivariate 990

0.34

0.35

machine learning repository. Nine datasets are sparse graphs and the remaining five are multivariate data. We preprocessed the latter type into sparse similarity matrices by symmetrizing their K-Nearest-Neighbor graphs (K = 15).

We set the number of clusters as the true number of classes for the two clustering algorithms on all datasets. The parameters W and S were initialized by following the Fresh Start procedure proposed by Ding et al. [6]. Table 1 shows the clustering performances of the compared algorithms.

From the results in Table 2, we can see that the original PLSI based on KL- divergence, i.e. t = 1, is not always the best. For certain datasets, the clustering purity can be improved by using t-divergence with other t values. For example when t = 0.2 in the case 0 < t < 1 (Table 1a), the t-PLSI achieves a perfect clustering result on “strike” dataset with purity 1. As for the “cities” dataset, the clustering purity has been improved by 10% over that of the original PLSI. Similar analysis can be made for the case 1 < t < 2 (Table 1b). Compared with KL-divergence, t-divergence has an extra degree of flexibility given by the free parameter t. Especially, as t → 2, the tail of the log function becomes heavier _t than that of the usual log function, which makes the new method more robust

5 Conclusions

We have studied the generalization of Probabilistic Latent Semantic Indexing with t-divergence family. The generalized PLSI was formulated as a nonnegative low-rank approximation problem. The formulation is more flexible and can accommodate more types of noise in data. We have proposed a Majorization- Minimization algorithm for optimizing the constrained objective. Empirical comparison shows that clustering performance in purity can be improved by using the generalized method with suitable t-divergences other than KL-divergence.

The proposed generalization is not restricted to PLSI. The t-divergence could be used in other nonnegative approximation problems with, for example, other matrix factorization/decomposition or other constraints, where the flexibility might also help to improve the performance.

An interesting question raised with the t-exponential family is whether we can find its conjugate prior. The multinomial distribution that underlies PLSI has Dirichlet as its conjugate prior. This conjugacy mainly accounts for the success of generative topic modeling in recent years. Inference based on t-exponential family might also benefit from its conjugate prior if they exist. That is, if we could find a conjugate prior for t-PLSI, we may also apply the nonparametric Bayesian treatment similar to Latent Dirichlet Allocation and thus avoid overfitting.

The t-divergence is related to two other divergence families, α-divergence and R´enyi divergence (see e.g. [3, 4, 13]). One of the major differences is normalization on the input: α-divergence involves no normalization and could be problematic when combined with prior information; R´enyi divergence normal- izes the input before the power operation while t-divergence employs t-power before normalization. The latter has the ability to smooth nonzero entries (for 0 < t < 1) or exclude outliers (for 1 < t < 2). More thorough comparison among these divergence families should be carried out in the future.

Another important and still open problem is how to select among various t -divergences. This strongly depends on the nature of the data to be analyzed. Usually a larger t leads to more exclusive approximation while a smaller t to more inclusive approximation. Automatic selection among parameterized divergence family generally requires extra information or criterion, for example, ground truth data [2] or cross-validations with a fixed reference parameter [12].

Appendix: Proof of Theorem 1

The development follows the Majorization-Minimization steps proposed by Yang and Oja [11, 13]. In the derivation, we use W and S for the current estimates W S while f and e for the variables. _r

Proof. Introducing Lagrangian multipliers {λ k } , _{k =1}

X X e f W , S W , S λ W . J (f ) ≡J (f ) + k ik − 1 (11) _{k i}

(Majorization) !

1−t

X X

1 e W , S A φ f W S W f λ W f ,

J (f ) ≤ ij ijk ik kk jk k ik − 1 t

_{ij ik kk jk}

_{t t}

_{ij i}

_{k k}

_{W S W}

P P _t

W , W where A ij = and φ ijk = . Denote the r.h.s. by G (f ). _{ab l} _{X W il S ll W jl} _ab

Case 1: when 0 < t < 1, we have G (f W , W ) ≤ G (f W , W ), where

2(1−t)

X X f

X X W

1−t 1−t ik

f ) = + G (f W , W A ij φ ijk S W λ k W ik − 1 .

_{kk jk}

t − 1 W _{ij k k i} _ik

W , W W , W

Case 2: when 1 < t < 2, we have G (f ) ≤ G 2 (f ), where

(1−t) 1−t

f f W W

X X 1 ik jk

1−t 1−t 1−t

G W , W A φ S W W (f ) = ij ijk 1 + log

2 _{kk ik jk} (1−t) 1−t

t − 1 _ij _{k ik jk} W W

X X f λ W .

k ik − 1 _{k i}

(Minimization) For Case 1: when 0 < t < 1

2(1−t)−1

X f ∂G W pq

1 1−t 1−t

A φ S W = −2 pj pjq + λ q = 0. _{qq jq}

1−t

W pq ∂ f W _pq _j ₋ _{1−t 1−t} ₁

1 _2t−1 ¹ _{2t−1 2t−1 2t−1 1−t} P

W W λ A φ W This gives f pq = pq q

2S qq pj pjq . By using _{j jq} P f _p W pq = 1, we obtain ₁ ₁ _{1−t t−1}   _2t−1 _{2t−1 2t−1 2t−1 1−t}

X X λ q qq W aq  A φ W  . = 2S aj ajq (12) _{a j} _jq

For Case 2: when 1 < t < 2

X ∂G

2 1−t 1−t 1−t

A φ S W W = − pj pjq + λ q = 0. _{qq pq jq} f ∂ f W W _{pq pq} _j _{− 1−t} P P

1 1−t 1−t

f W S A φ W W W

This gives f pq = λ pj pjq . By using pq = 1, we _{q qq j pq jq p} obtain

X X

1−t 1−t 1−t λ W A φ W . _{q = S aj ajq (13)} _{qq aq jq} _{a j}

W Similarly we can prove the monotonicity over S. Let e J (W, e S ) ≡ J (W, e S ) + P e

λ S kk − 1 , which is tightly upper bounded by _k !

1−t

X X

1 e e G ( e S, S ) ≡ A ij φ ijk W ik S kk W jk + λ S kk − 1 . t − 1 _{ij k k} ₁ _t ₁ _{− 1−t 1−t} _t P Zeroing the derivative of G( e S, S ) gives e S qq = λ A ij φ ijq W W . _{ij iq jq} _t ₁ P ^{1 P} ¹ P _t 1−t 1−t _t e

S A φ W W Using qq = 1, we obtain λ = ij ija . Inserting λ _{q a ij ia ja}

S back to e qq gives the update rule (10).

References

1. Arora, R., Gupta, M., Kapila, A., Fazel, M.: Clustering by left-stochastic matrix

2. Choi, H., Choi, S., Katake, A., Choe, Y.: Learning alpha-integration with partially-

3. Cichocki, A., Lee, H., Kim, Y.D., Choi, S.: Non-negative matrix factorization with

_α

divergence. Pattern Recognition Letters 29, 1433–1440 (2008)

4. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor

5. Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factor-

6. Ding, C., Li, T., Jordan, M.: Convex and semi-nonnegative matrix factorizations.

IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1), 45–55 (2010)

7. Ding, N., Vishwanathan, S.: t-logistic regression. In: Advances in Neural Informa-

8. Ding, N., Vishwanathan, S., Qi, Y.A.: t-divergence based approximate inference.

In: Advances in Neural Information Processing Systems 24, pp. 1494–1502 (2011)

9. F´evotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the

10. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd

11. Hunter, D.R., Lange, K.: A tutorial on MM algorithms. The American Statistician

12. Mollah, M., Sultana, N., Minami, M.: Robust extraction of local structures by the

13. Yang, Z., Oja, E.: Unified development of multiplicative algorithms for linear and

Pairwise Clustering with t-PLSI

1 Introduction

2 Probabilistic Latent Semantic Indexing (PLSI)

3 PLSI with t-Divergence

4 Experiments

5 Conclusions

8. Ding, N., Vishwanathan, S., Qi, Y.A.: t-divergence based approximate inference.

Dokumen yang terkait

Improving the Effectiveness of Information Retrieval with Clustering and Fusion

Analytical Hierarchy Process Development Strategy for Cluster Apple Chips with K-means Clustering and Analytical Hierarchy Process

Model Pemilihan Untuk Metode Kontes Dalam Pengadaan Barang Pemerintah Menggunakan Pairwise Comparison Dan Topsis

Analysis of Traffic Anomaly Detection System Using CURE Algorithms (Clustering Using Representatives) with the Silhouette Coefficient for Clustering Validation

Analisis dan Simulasi Clustering Node Menggunakan Algoritma LEACH Node Clustering Analysis and Simulation Using LEACH Algorithm

Analisis dan Implementasi Graph Clustering pada Berita Digital Menggunakan Algoritma Star Clustering

Clustering Dokumen Skripsi Dengan Menggunakan Hierarchical Agglomerative Clustering

Clustering Data Polutan Udara Kota Pekanbaru dengan Menggunakan Metode K-Means Clustering

Similarity Measures for Text Document Clustering

Selection Methods for Text Clustering

Dukungan

Links

Pairwise Clustering with t-PLSI

1 Introduction

2 Probabilistic Latent Semantic Indexing (PLSI)

3 PLSI with t-Divergence

4 Experiments

5 Conclusions

8. Ding, N., Vishwanathan, S., Qi, Y.A.: t-divergence based approximate inference.

Dokumen yang terkait

Improving the Effectiveness of Information Retrieval with Clustering and Fusion

Analytical Hierarchy Process Development Strategy for Cluster Apple Chips with K-means Clustering and Analytical Hierarchy Process

Model Pemilihan Untuk Metode Kontes Dalam Pengadaan Barang Pemerintah Menggunakan Pairwise Comparison Dan Topsis

Analysis of Traffic Anomaly Detection System Using CURE Algorithms (Clustering Using Representatives) with the Silhouette Coefficient for Clustering Validation

Analisis dan Simulasi Clustering Node Menggunakan Algoritma LEACH Node Clustering Analysis and Simulation Using LEACH Algorithm

Analisis dan Implementasi Graph Clustering pada Berita Digital Menggunakan Algoritma Star Clustering

Clustering Dokumen Skripsi Dengan Menggunakan Hierarchical Agglomerative Clustering

Clustering Data Polutan Udara Kota Pekanbaru dengan Menggunakan Metode K-Means Clustering

Similarity Measures for Text Document Clustering

Selection Methods for Text Clustering

Dokumen yang Anda mencari sudah siap untuk unduhkan