Clustering with Gaussian Mixtures

_{http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.} _{message, or the following link to the source repository of Andrew’s tutorials:} _{make use of a significant portion of these slides in your own lecture, please include this} _{verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you}

_{this source material useful in giving your own lectures. Feel free to use these slides}

Note to other teachers and users of these slides. Andrew would be delighted if you found

Clustering w ith Gaussian Mixtures Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

www.cs.cmu.edu/~awm [email protected] 412-268-7599

You walk into a bar. • A stranger approaches and tells you: “I’ve got data from k classes. Each class produces observations with a normal distribution and variance

2 σ I . Standard simple multivariate gaussian assumptions. I can tell you all the P(w )’s .” i

So far, looks straightforward. • “I need a maximum likelihood estimate of the µ ’s .“ i

No problem: • “There’s just one thing. None of the data are labeled. I have datapoints, but I don’t know what class they’re from (any of them!) _{Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 2} Uh oh!! •

Gaussian Bayes Classifier Reminder p x y i P y i

( | = ) ( = ) P y i x

( = | ) = p x

( )

1 T ⎡ ⎤ x µ x µ p

exp − − −

( ) ( Σ ) k i i k i i m /

2 1 /

2 π

( 2 ) || || ⎢⎣ 2 ⎥⎦ Σ i

P y i x ( = | ) = p x

( )

How do we deal with that?

Learning modelyear , mpg ---> maker

⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞

⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛

= _m _{m m} ^m ^m ₂ ₂ ₁ ² ² ² ¹² ¹ ¹² ¹ ² σ σ σ

σ σ σ σ σ σ

L M O M M L L

O(m

2 ) parameters

⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞

⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛

= _m _{m m} ^m ^m ₂ ₂ ₁ ² ² ² ¹² ¹ ¹² ¹ ² σ σ σ

σ σ σ σ σ σ

L M O M M L L

Aligned: O(m) parameters

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞

⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛

= − _m ^m ₂ ¹ ² ³ ² ² ²

σ σ σ σ

σ L L M M O M M M L L L

⎛ ⎞ σ ^{1 L} ⎜ ⎟ ₂

⎜ σ

^{2 L ⎟}

Aligned: O(m)

⎜ ^{2 ⎟} _{3 L} σ ⎜ ⎟

Σ = ⎜ ⎟ M M M O M M parameters

⎜ ⎟ ₂ L σ m − ¹ ⎜ ⎟ ₂ ⎜ ⎟ L σ m

2 ⎛ ⎞ σ L ⎜ ⎟ ₂ ⎜ σ L ⎟

Spherical: O(1)

⎜ ^{2 ⎟} σ L ⎜ ⎟

Σ ⎜ ⎟ M M M O M M cov parameters

⎜ ⎟ ₂ L σ ⎜ ⎟ ₂ ⎜ ⎟ L σ ⎝ ⎠

Density Estimator

Gauss DE

Joint BC Naïve BC Joint DE Naïve DE

Real-valued inputs only Dec Tree

Categorical inputs only Mixed Real / Cat okay

Predict real no.

Regressor

Prob- ability Inputs

Predict category Inputs

O(1) cov parameters

Classifier

Making a Classifier from a Density Estimator Inputs

σ σ L L M M O M M M L L L

₂ ₂ ² ² ² σ σ σ

⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞

Gauss BC Next… back to Density Estimation What if we want to do density estimation with multimodal or clumpy data?

There are k components. The • i’th component is called ω

ω • Component has an

associated mean vector µ µ

2 µ

1 µ

There are k components. The • i’th component is called ω

Component ω has an •

associated mean vector µ

µ i

2 Each component generates data µ

from a Gaussian with mean µ

and covariance matrix σ

I µ

3 Assume that each datapoint is

generated according to the following recipe:

There are k components. The • i’th component is called ω

Component ω • has an

associated mean vector µ µ

2 Each component generates data •

from a Gaussian with mean µ

and covariance matrix σ

I Assume that each datapoint is

generated according to the following recipe:

1. Pick a component at random.

The GMM assumption

There are k components. The • i’th component is called ω

ω • Component has an

associated mean vector µ

µ i

Each component generates data x from a Gaussian with mean µ

and covariance matrix σ

I Assume that each datapoint is

generated according to the following recipe:

1. Pick a component at random.

Choose component i with P( ). probability ω

The General GMM assumption

There are k components. The • i’th component is called ω

Component ω has an •

associated mean vector µ µ

Each component generates data µ

from a Gaussian with mean µ

and covariance matrix Σ

i µ

3 Assume that each datapoint is

generated according to the following recipe:

1. Pick a component at random.

Choose component i with P( ). probability ω

Unsupervised Learning: not as hard as it looks Sometimes easy

IN CASE YOU’RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d

VECTORS) Sometimes impossible

Computing likelihoods in unsupervised case We have x , x x

, … We know P(w ) P(w ) .. P(w )

2 k

We know σ ) = Prob that an observation from class

P(x|w , µ , … µ i i k

w would have value x given class

means µ … µ

1 x _{Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 20} Can we write an expression for that? Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 21 likelihoods in unsupervised case

We have x

We now have a procedure s.t. if you give me a guess at µ

| µ

, µ

.. µ

) ? [YES, IF WE ASSUME THE

Unsupervised Learning: Mediumly Good News

1 , µ

, .. x

2 ..

µ k, I can tell you the prob of the unlabeled data given those µ‘s.

Suppose x‘s are 1-dimensional. There are two classes; w

and w ₂ P(w

) = 1/3 P(w ₂ ) = 2/3 σ = 1 .

There are 25 unlabeled datapoints

x ₁ = 0.608 x ₂ = -1.590 x ₃ = 0.235 x ₄ = 3.949

: x ₂₅ = -0.712

1 x

, µ

2 … x n

We have P(w

) .. P(w

). We have

σ.

We can define, for any x , P(x|w

, µ

, x

.. µ

) Can we define P(x | µ

, µ

.. µ

) ? Can we define P(x

( From Duda and Hart)

2 =1.668).

1 ..

=2.085, µ

=-1.257)*

Duda & Hart’s Example We can graph the prob. dist. function of data given our µ ₁ and µ

2 estimates.

We can also graph the true function from which the data was randomly generated.

solution tries to put the “2/3” hump where the “1/3” hump should go, and vice versa.

are given the class which was used to learn them, then the results are (

=-2.176, µ

=1.684). Unsupervised got ( µ

=-2.13, µ

=1.668) Local minimum, but very close to global at (

, µ

, x

2 ..

| µ

) against µ

Graph of log P( x

( →) and µ

( ↑) Max likelihood = (

=-2.13, µ

corresponds to switching w

₁

₂

They are close. Good.
The 2 ^nd
In this example unsupervised is almost as good as supervised. If the x

Finding the max likelihood µ ,µ ..µ

1 2 k We can compute P( data | µ ,µ ..µ )

1 2 k How do we find the µ ‘s which give max. likelihood? i

The normal max likelihood trick:

Set log Prob (….) = 0 µ i and solve for µ ‘s. i

# Here you get non-linear non-analytically- solvable equations

Use gradient descent • Slow but doable

Use a much faster, cuter, and recently very popular _{Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 25}

The E.M. Algorithm UR TO DE • We’ll get back to unsupervised learning soon.

But now we’ll look at an even simpler case with hidden information.
The EM algorithm

Can do trivial things, such as the contents of the next few slides.

An excellent way of doing our unsupervised learning problem, as we’ll see.

Silly Example Let events be “grades in a class” w = Gets an A P(A) = ½

1 w = Gets a B P(B) = µ

2 w = Gets a C P(C) = 2µ

3 w = Gets a D P(D) = ½-3µ

4 (Note 0 ≤ µ ≤1/6)

Assume we want to estimate µ from data. In a given class there were a A’s b B’s c C’s d D’s _{Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 28} What’s the maximum likelihood estimate of µ given a,b,c,d ?

Silly Example

Let events be “grades in a class” w = Gets an A P(A) = ½ ₁ w = Gets a B P(B) = µ ₂ w = Gets a C P(C) = 2µ ₃ w = Gets a D P(D) = ½-3µ ₄ (Note 0 ≤ µ ≤1/6) Assume we want to estimate µ from data. In a given class there were a A’s b B’s c C’s d D’s

Trivial Statistics

P(A) = ½ P(B) = µ P(C) = 2µ P(D) = ½-3µ

a b c d

P( a,b,c,d | µ) = K(½) (µ) (2µ) (½-3µ) a,b,c,d | µ) = log K + alog ½ + blog µ + clog 2µ + dlog (½-3µ) log P(

∂ LogP

FOR MAX LIKE µ, SET = ∂ µ

b c d

∂ LogP

3 = − = +

∂ − µ µ 2 µ 1 /

2 3 µ

b c

Gives max like µ =

6 ( ) + So if class got A B C D

b c d

= tru Max like µ

but

Same Problem with Hidden Information

Since the ratio a:b should be the same as the ratio ½ : µ

REMEMBER P(A) = ½ P(B) = µ P(C) = 2µ P(D) = ½-3µ

If we know the expected values of a and b we could compute the maximum likelihood value of µ

If we know the value of µ we could compute the expected value of a and b

Someone tells us that Number of High grades (A’s + B’s) = h Number of C’s = c Number of D’s = d What is the max. like estimate of µ now? We can answer this question circularly:

( ) d c b c b

1 µ

REMEMBER P(A) = ½ P(B) = µ P(C) = 2µ P(D) = ½-3µ

Someone tells us that Number of High grades (A’s + B’s) = h Number of C’s = c Number of D’s = d What is the max. like estimate of µ now?

REMEMBER P(A) = ½ E.M. for our Trivial Problem

P(B) = µ We begin with a guess for µ

P(C) = 2µ We iterate between EXPECTATION and MAXIMALIZATION to

P(D) = ½-3µ improve our estimates of µ and a and b.

Define µ(t) the estimate of µ on the t’th iteration b(t) the estimate of b on t’th iteration µ ( ) = initial guess

E-step µ(t) h b ( t ) = = Ε b | µ ( t )

[ ]

1 µ ( t )

2 b t c

( ) +

µ ( t

1 ) M- step = 6 b t c d

( ( ) ) + + = max like est of µ given b ( ) t Continue iterating until converged.

E.M. Convergence

Convergence proof based on fact that Prob(data | µ) must increase or _{[NOT OBVIOUS]} remain same between each iteration _[OBVIOUS]
But it can never exceed 1 _[OBVIOUS]

So it must therefore converge

t µ(t) b(t)

In our example, suppose we had h = 20

1 0.0833 2.857

c = 10

2 0.0937 3.158

d = 10

3 0.0947 3.185

µ(0) = 0

4 0.0948 3.187

Convergence is generally linear: error

5 0.0948 3.187

Back to Unsupervised Learning of GMMs

Remember: We have unlabeled data x x … x ₁ _{2 R} We know there are k classes We know P(w ) P(w ) P(w ) … P(w ) ₁ ₂ _{3 k} We don’t know µ µ .. µ ₁ _{2 k} We can write P( data | µ …. µ ) _{1 k}

= p x ... x µ ... µ _R ( ^{1 R} ^{1 k )} = p x µ ... µ

( ) _i _{1 k} ∏ _{i =} _R ₁ _k

= p x w , µ ... µ P w _{i j} _{1 k ( ) j}

( ) ∏∑ _i _{1 j =} ₁ _{R k} =

1 ² ⎛ ⎞

2 E.M. for GMMs ∂

For Max likelihood we know log Pr ob data µ ... µ =

( k )

₁

∂ µ _i Some wild' n' crazy algebra turns this into : " For Max likelihood , for each j, _R

P w x x

, µ ... µ

( j i k ) i ₁ ∑ _{i =} ₁

µ = See _j _R

P w x , µ ... µ ( j i k ) http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf ₁

∑ _{i =} ₁ This is n nonlinear equations in µ ’s.” _j If, for each x we knew that for each w the prob that µ was in class w is _{i j j j} P(w |x ,µ …µ ) Then… we would easily compute µ . _{j i} _{1 k j} If we knew each µ then we could easily compute P(w |x ,µ …µ ) for each w _{j j i} _{1 k j} and x . _i _{Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 36} …I feel an EM experience coming on!!

E.M. for GMMs

t’th iteration let our estimates be Iterate. On the

= { µ (t), µ (t) … µ (t) }

λ t

1 2 c E-step

Just evaluate a Gaussian at Compute “expected” classes of all datapoints for each class ₂ x _k

p x w , ( t ), p ( t )

µ σ

I p x w , λ P w λ _{k i t i t} ) ( ) ( k i i ) i P w x , = =

λ ( i k t ) ( _c p x ²

( λ ) _{k t} p x w , ( t ), p ( t )

µ σ

( k j j ) j

∑ _j = ₁ M-step.

Compute Max. like µ given our data’s class membership distributions P w x , x

( λ ) _{i k t k} ∑

µ ( ) t _i

1 = ^k P , ( w x λ ) _{i k t}

Convergence

Your lecturer will (unless out of time) give you a nice intuitive explanation of why this rule works.
As with all EM procedu
• This algorithm is REALLY USED. And
convergence to a in high dimensional state spaces, too.

local optimum E.G. Vector Quantization for Speech guaranteed. _{Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 38} Data E-step Compute “expected” classes of all datapoints for each class ( ) ( ) ( )

( ) ( )

( )

(t) …

, P µ µ

) ( ) ( ), ( , p

Σ Σ = = _c _j _{j j j j k}

^{i i i i k}

_{t k}

^{t i t i k}

_{t k i}

t p t t w x

w w x

x w

₁

∑ =

Just evaluate a Gaussian at x _k

(t) }

(t) … p

(t), p

Σ c

Compute Max. like µ given our data’s class membership distributions p _i (t) is shorthand for estimate of P( ω _i ) on t’th iteration

(t),

(t) … µ

(t), µ

= { µ

λ t

Iterate. On the t’th iteration let our estimates be

λ λ λ λ M-step.

− + − = + Σ _k _{t k i} ^T ^{i k i k} ^k ^{t k i} _i

( ) ( ) ( )

1 ( )

Advance apologies: in Black and White this example will be incomprehensible

λ , P

∑ = +

( ) R x w t p ^k ^{t k i} _i

1 P 1 ,

∑ ∑ = + _k _{t k i} ^k ^k ^{t k i} _i x w x x w t

, P

λ µ µ λ

x w t x t x x w t

∑ ∑

( ) ( ) ( ) [ ] ( ) [ ] ( )

, P , P 1 µ

λ λ

Example: Start

GMM clustering of the assay data

Where are we now?

Inference

Joint DE, Bayes Net Structure Learning P(E |E ) ₁ ₂ Engine Learn

Inputs

Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,

Predict Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes

Classifier

category Net Based BC, Cascade Correlation

Inputs Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve

Prob-

Density

DE, Bayes Net Structure Learning, GMMs ability

Estimator

Inputs

Linear Regression, Polynomial Regression,

Predict

Regressor Perceptron, Neural Net, N.Neigh, Kernel, LWR, real no.

Inputs RBFs, Robust Regression, Cascade Correlation, _{Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 51} Regression Trees, GMDH, Multilinear Interp, MARS The old trick…

Inference

Joint DE, Bayes Net Structure Learning P(E |E ) ₁ ₂ Engine Learn

Inputs

Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,

Predict Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes

Classifier

What you should know

How to “learn” maximum likelihood parameters (locally max. like.) in the case of unlabeled data.
Be happy with this kind of probabilistic analysis.
Understand the two examples of E.M. given in these notes. For more info, see Duda + Hart. It’s a great book. There’s much more in the book than in your handout.

K-means (see next lecture)
Hierarchical clustering (e.g. Minimum spanning trees) (see next lecture)
Principal Component Analysis
Non-linear PCA

simple, useful tool

Neural Auto-Associators Locally weighted PCA Others…

Clustering with Gaussian Mixtures

Clustering w ith Gaussian Mixtures Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Learning modelyear , mpg ---> maker

Silly Example

Same Problem with Hidden Information

Back to Unsupervised Learning of GMMs

Example: Start

What you should know

Dokumen yang terkait

Perbandingan Hasil Deteksi Tepi Laplacian Of Gaussian Dengan Laplacian Of Gaussian Kombinasi High Boost Filtering

Stabilisasi Sistem Pendulum Kereta Menggunakan Kontroler Linear Quadratic Gaussian

Improving the Effectiveness of Information Retrieval with Clustering and Fusion

Analytical Hierarchy Process Development Strategy for Cluster Apple Chips with K-means Clustering and Analytical Hierarchy Process

Analysis of Traffic Anomaly Detection System Using CURE Algorithms (Clustering Using Representatives) with the Silhouette Coefficient for Clustering Validation

Analisis dan Simulasi Clustering Node Menggunakan Algoritma LEACH Node Clustering Analysis and Simulation Using LEACH Algorithm

Analisis dan Implementasi Graph Clustering pada Berita Digital Menggunakan Algoritma Star Clustering

Clustering Dokumen Skripsi Dengan Menggunakan Hierarchical Agglomerative Clustering

Clustering Data Polutan Udara Kota Pekanbaru dengan Menggunakan Metode K-Means Clustering

Pairwise Clustering with t-PLSI

Dukungan

Links

Clustering with Gaussian Mixtures

Clustering w ith Gaussian Mixtures Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Learning modelyear , mpg ---> maker

Silly Example

Same Problem with Hidden Information

Back to Unsupervised Learning of GMMs

Example: Start

What you should know

Dokumen yang terkait

Perbandingan Hasil Deteksi Tepi Laplacian Of Gaussian Dengan Laplacian Of Gaussian Kombinasi High Boost Filtering

Stabilisasi Sistem Pendulum Kereta Menggunakan Kontroler Linear Quadratic Gaussian

Improving the Effectiveness of Information Retrieval with Clustering and Fusion

Analytical Hierarchy Process Development Strategy for Cluster Apple Chips with K-means Clustering and Analytical Hierarchy Process

Analysis of Traffic Anomaly Detection System Using CURE Algorithms (Clustering Using Representatives) with the Silhouette Coefficient for Clustering Validation

Analisis dan Simulasi Clustering Node Menggunakan Algoritma LEACH Node Clustering Analysis and Simulation Using LEACH Algorithm

Analisis dan Implementasi Graph Clustering pada Berita Digital Menggunakan Algoritma Star Clustering

Clustering Dokumen Skripsi Dengan Menggunakan Hierarchical Agglomerative Clustering

Clustering Data Polutan Udara Kota Pekanbaru dengan Menggunakan Metode K-Means Clustering

Pairwise Clustering with t-PLSI

Dokumen yang Anda mencari sudah siap untuk unduhkan