Clustering with Gaussian Mixtures

  http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. message, or the following link to the source repository of Andrew’s tutorials: make use of a significant portion of these slides in your own lecture, please include this verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you

this source material useful in giving your own lectures. Feel free to use these slides

Note to other teachers and users of these slides. Andrew would be delighted if you found

Clustering w ith Gaussian Mixtures Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

  www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599

  Copyright © 2001, 2004, Andrew W. Moore Unsupervised Learning

  You walk into a bar. • A stranger approaches and tells you: “I’ve got data from k classes. Each class produces observations with a normal distribution and variance

  2 σ I . Standard simple multivariate gaussian assumptions. I can tell you all the P(w )’s .” i

  So far, looks straightforward. • “I need a maximum likelihood estimate of the µ ’s .“ i

  No problem: • “There’s just one thing. None of the data are labeled. I have datapoints, but I don’t know what class they’re from (any of them!) Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 2 Uh oh!! •

  Gaussian Bayes Classifier Reminder p x y i P y i

  ( | = ) ( = ) P y i x

  ( = | ) = p x

  ( )

  1

  1 T ⎡ ⎤ x µ x µ p

exp − − −

  ( ) ( Σ ) k i i k i i m /

  2 1 /

  2 π

  ( 2 ) || || ⎢⎣ 2 ⎥⎦ Σ i

  P y i x ( = | ) = p x

  ( )

  How do we deal with that?

  Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 3 Predicting wealth from age

  Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 4

  Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 5 Predicting wealth from age

Learning modelyear , mpg ---> maker

  Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 6

  ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞

  ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛

  = m m m m m 2 2 1 2 2 2 12 1 12 1 2 σ σ σ

  σ σ σ σ σ σ

  L M O M M L L

  

Σ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 7 General:

  O(m

  2 ) parameters

  ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞

  ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛

  = m m m m m 2 2 1 2 2 2 12 1 12 1 2 σ σ σ

  σ σ σ σ σ σ

  L M O M M L L

  

Σ

Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 8

  Aligned: O(m) parameters

  ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞

  ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛

  = − m m 2 1 2 3 2 2 2

1

2

  σ σ σ σ

  σ L L M M O M M M L L L

  Σ

  

2

⎛ ⎞ σ 1 L ⎜ ⎟ 2

⎜ σ

2 L ⎟

  Aligned: O(m)

  ⎜ 2 ⎟ 3 L σ ⎜ ⎟

  Σ = ⎜ ⎟ M M M O M M parameters

  ⎜ ⎟ 2 L σ m1 ⎜ ⎟ 2 ⎜ ⎟ L σ m

  ⎝ ⎠ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 9

  2 ⎛ ⎞ σ L ⎜ ⎟ 2 ⎜ σ L ⎟

  Spherical: O(1)

  ⎜ 2 ⎟ σ L ⎜ ⎟

=

  Σ ⎜ ⎟ M M M O M M cov parameters

  ⎜ ⎟ 2 L σ ⎜ ⎟ 2 ⎜ ⎟ L σ ⎝ ⎠

  Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 10

  Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 11 Spherical:

  Density Estimator

  Gauss DE

  Joint BC Naïve BC Joint DE Naïve DE

  Real-valued inputs only Dec Tree

  Categorical inputs only Mixed Real / Cat okay

  Predict real no.

  Regressor

  Prob- ability Inputs

  Predict category Inputs

  O(1) cov parameters

  Classifier

  Making a Classifier from a Density Estimator Inputs

  Σ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 12

  σ σ L L M M O M M M L L L

  

=

2 2 2 2 2 σ σ σ

  ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛

  ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞

  

Gauss BC Next… back to Density Estimation What if we want to do density estimation with multimodal or clumpy data?

  Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 13 The GMM assumption

  There are k components. The • i’th component is called ω

  i

  ω • Component has an

  i

  associated mean vector µ µ

  i

  2 µ

  1 µ

  3 Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 14 The GMM assumption

  There are k components. The • i’th component is called ω

  i

  Component ω has an •

  i

  associated mean vector µ

  µ i

  2 Each component generates data µ

  • 1

  from a Gaussian with mean µ

  i

  2

  and covariance matrix σ

  I µ

  3 Assume that each datapoint is

  generated according to the following recipe:

  Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 15 The GMM assumption

  There are k components. The • i’th component is called ω

  i

  Component ω • has an

  i

  associated mean vector µ µ

  i

  2 Each component generates data •

  from a Gaussian with mean µ

  i

  2

  and covariance matrix σ

  I Assume that each datapoint is

  generated according to the following recipe:

  1. Pick a component at random.

  Choose component i with P( ). probability ω Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 16 i

  The GMM assumption

  There are k components. The • i’th component is called ω

  i

  ω • Component has an

  i

  associated mean vector µ

  µ i

  2

  • Each component generates data x from a Gaussian with mean µ

  i

  2

  and covariance matrix σ

  I Assume that each datapoint is

  generated according to the following recipe:

  1. Pick a component at random.

  Choose component i with P( ). probability ω

  i

  2

  2. Datapoint ~ N( µ , σ I ) Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 17 i

  The General GMM assumption

  There are k components. The • i’th component is called ω

  i

  Component ω has an •

  i

  associated mean vector µ µ

  i

  2

  • Each component generates data µ

  1

  from a Gaussian with mean µ

  i

  and covariance matrix Σ

  i µ

  3 Assume that each datapoint is

  generated according to the following recipe:

  1. Pick a component at random.

  Choose component i with P( ). probability ω

  i

  2. Datapoint ~ N( µ , Σ ) Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 18 i i

  Unsupervised Learning: not as hard as it looks Sometimes easy

  IN CASE YOU’RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d

  VECTORS) Sometimes impossible

  DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS and sometimes Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 19 in between

  Computing likelihoods in unsupervised case We have x , x x

  1

  , … We know P(w ) P(w ) .. P(w )

  1

  2 k

  We know σ ) = Prob that an observation from class

  P(x|w , µ , … µ i i k

w would have value x given class

  i

  means µµ

  1 x Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 20 Can we write an expression for that? Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 21 likelihoods in unsupervised case

  We have x

  We now have a procedure s.t. if you give me a guess at µ

  | µ

  1

  , µ

  2

  

.. µ

  k

  ) ? [YES, IF WE ASSUME THE

  X 1 ’S WERE DRAWN INDEPENDENTLY] Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 22

  Unsupervised Learning: Mediumly Good News

  1 , µ

  , .. x

  2 ..

  µ k, I can tell you the prob of the unlabeled data given those µ‘s.

  Suppose x‘s are 1-dimensional. There are two classes; w

  1

  and w 2 P(w

  1

  ) = 1/3 P(w 2 ) = 2/3 σ = 1 .

  There are 25 unlabeled datapoints

  x 1 = 0.608 x 2 = -1.590 x 3 = 0.235 x 4 = 3.949

  : x 25 = -0.712

  n

  1

  1 x

  , µ

  2 … x n

  We have P(w

  1

  ) .. P(w

  k

  ). We have

σ.

We can define, for any x , P(x|w

  i

  , µ

  1

  2

  , x

  .. µ

  k

  ) Can we define P(x | µ

  1

  , µ

  2

  .. µ

  k

  ) ? Can we define P(x

  1

  ( From Duda and Hart)

  2 =1.668).

  1 ..

  =2.085, µ

  2

  =-1.257)*

  Duda & Hart’s Example Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 24

  Duda & Hart’s Example We can graph the prob. dist. function of data given our µ 1 and µ

  2 estimates.

  We can also graph the true function from which the data was randomly generated.

  solution tries to put the “2/3” hump where the “1/3” hump should go, and vice versa.

  x

  µ

  25

  are given the class which was used to learn them, then the results are (

  µ

  1

  =-2.176, µ

  2

  =1.684). Unsupervised got ( µ

  1

  =-2.13, µ

  1

  =1.668) Local minimum, but very close to global at (

  , µ

  2

  , x

  2 ..

  x

  25

  | µ

  1

  2

  ) against µ

  Graph of log P( x

  1

  ( →) and µ

  2

  ( ↑) Max likelihood = (

  µ

  1

  =-2.13, µ

  1

  • corresponds to switching w
  • 1 + w 2 .

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 23

    • They are close. Good.
    • The 2 nd
    • In this example unsupervised is almost as good as supervised. If the x

      Finding the max likelihood µ ,µ ..µ

      1 2 k We can compute P( data | µ ,µ ..µ )

      1 2 k How do we find the µ ‘s which give max. likelihood? i

    • The normal max likelihood trick:

      Set œ log Prob (….) = 0 œ µ i and solve for µ ‘s. i

      # Here you get non-linear non-analytically- solvable equations

      Use gradient descent • Slow but doable

    • Use a much faster, cuter, and recently very popular Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 25

      method… Expectation Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 26 Maximalization

      The E.M. Algorithm UR TO DE • We’ll get back to unsupervised learning soon.

    • But now we’ll look at an even simpler case with hidden information.
    • The EM algorithm

      ‰ Can do trivial things, such as the contents of the next few slides.

      ‰ An excellent way of doing our unsupervised learning problem, as we’ll see. ‰

      Many, many other uses, including inference of Hidden Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 27 Markov Models (future lecture).

      Silly Example Let events be “grades in a class” w = Gets an A P(A) = ½

      1 w = Gets a B P(B) = µ

      2 w = Gets a C P(C) = 2µ

      3 w = Gets a D P(D) = ½-3µ

      4 (Note 0 ≤ µ ≤1/6)

      Assume we want to estimate µ from data. In a given class there were a A’s b B’s c C’s d D’s Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 28 What’s the maximum likelihood estimate of µ given a,b,c,d ?

    Silly Example

      Let events be “grades in a class” w = Gets an A P(A) = ½ 1 w = Gets a B P(B) = µ 2 w = Gets a C P(C) = 2µ 3 w = Gets a D P(D) = ½-3µ 4 (Note 0 ≤ µ ≤1/6) Assume we want to estimate µ from data. In a given class there were a A’s b B’s c C’s d D’s

      What’s the maximum likelihood estimate of µ given a,b,c,d ? Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 29

      Trivial Statistics

      P(A) = ½ P(B) = µ P(C) = 2µ P(D) = ½-3µ

      a b c d

      P( a,b,c,d | µ) = K(½) (µ) (2µ) (½-3µ) a,b,c,d | µ) = log K + alog ½ + blog µ + clog 2µ + dlog (½-3µ) log P(

      ∂ LogP

      FOR MAX LIKE µ, SET = ∂ µ

      b c d

      ∂ LogP

      2

      3 = − = +

      ∂ − µ µ 2 µ 1 /

      2 3 µ

    • b c

      Gives max like µ =

      6 ( ) + So if class got A B C D

    • b c d

      14

      6

      9

      

    10

      1

      e!

      = tru Max like µ

       but

      10

      ing, Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 30 Bor

    Same Problem with Hidden Information

      2

      Since the ratio a:b should be the same as the ratio ½ : µ

      REMEMBER P(A) = ½ P(B) = µ P(C) = 2µ P(D) = ½-3µ

      If we know the expected values of a and b we could compute the maximum likelihood value of µ

      If we know the value of µ we could compute the expected value of a and b

      Someone tells us that Number of High grades (A’s + B’s) = h Number of C’s = c Number of D’s = d What is the max. like estimate of µ now? We can answer this question circularly:

      6

    µ

      ( ) d c b c b

      1

      1

      2

      µ

      1 µ

      2

      µ

    • =

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 32 Same Problem with Hidden Information h b h a

      REMEMBER P(A) = ½ P(B) = µ P(C) = 2µ P(D) = ½-3µ

      Someone tells us that Number of High grades (A’s + B’s) = h Number of C’s = c Number of D’s = d What is the max. like estimate of µ now?

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 31

    • =
    • =

      REMEMBER P(A) = ½ E.M. for our Trivial Problem

      P(B) = µ We begin with a guess for µ

      P(C) = 2µ We iterate between EXPECTATION and MAXIMALIZATION to

      P(D) = ½-3µ improve our estimates of µ and a and b.

      Define µ(t) the estimate of µ on the t’th iteration b(t) the estimate of b on t’th iteration µ ( ) = initial guess

      E-step µ(t) h b ( t ) = = Ε b | µ ( t )

      [ ]

    • 1 µ ( t )

      2 b t c

      ( ) +

    • µ ( t

      1 ) M- step = 6 b t c d

      ( ( ) ) + + = max like est of µ given b ( ) t Continue iterating until converged.

      Good new s: Converging to local optimum is assured. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 33 Bad new s: I said “local” optimum.

      E.M. Convergence

    • Convergence proof based on fact that Prob(data | µ) must increase or [NOT OBVIOUS] remain same between each iteration [OBVIOUS]
    • But it can never exceed 1 [OBVIOUS]

      So it must therefore converge

      t µ(t) b(t)

      In our example, suppose we had h = 20

      1 0.0833 2.857

      c = 10

      2 0.0937 3.158

      d = 10

      3 0.0947 3.185

      µ(0) = 0

      4 0.0948 3.187

      Convergence is generally linear: error

      5 0.0948 3.187

      decreases by a constant factor each time step. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 34 6 0.0948 3.187

    Back to Unsupervised Learning of GMMs

      Remember: We have unlabeled data x xx 1 2 R We know there are k classes We know P(w ) P(w ) P(w ) … P(w ) 1 2 3 k We don’t know µ µ .. µ 1 2 k We can write P( data | µ …. µ ) 1 k

      = p x ... x µ ... µ R ( 1 R 1 k ) = p x µ ... µ

      ( ) i 1 ki = R 1 k

      = p x w , µ ... µ P w i j 1 k ( ) j

      ( ) ∏∑ i 1 j = 1 R k =

      1 2 ⎛ ⎞

      = − − K exp ⎜ x µ ⎟ P w 2 ( i j ) ( ) j ∏∑ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 35 i = 1 j = 1 ⎝ σ ⎠

      2 E.M. for GMMs ∂

      For Max likelihood we know log Pr ob data µ ... µ =

      

    ( k )

    1

      ∂ µ i Some wild' n' crazy algebra turns this into : " For Max likelihood , for each j, R

      P w x x

      , µ ... µ

      ( j i k ) i 1i = 1

      µ = See j R

      P w x , µ ... µ ( j i k ) http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf 1

      ∑ i = 1 This is n nonlinear equations in µ ’s.” j If, for each x we knew that for each w the prob that µ was in class w is i j j j P(w |x ,µ …µ ) Then… we would easily compute µ . j i 1 k j If we knew each µ then we could easily compute P(w |x ,µ …µ ) for each w j j i 1 k j and x . i Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 36 …I feel an EM experience coming on!!

      E.M. for GMMs

      t’th iteration let our estimates be Iterate. On the

      = { µ (t), µ (t) … µ (t) }

      λ t

      1 2 c E-step

      Just evaluate a Gaussian at Compute “expected” classes of all datapoints for each class 2 x k

    p x w , ( t ), p ( t )

      

    µ σ

      I p x w , λ P w λ k i t i t ) ( ) ( k i i ) i P w x , = =

      λ ( i k t ) ( c p x 2

      ( λ ) k t p x w , ( t ), p ( t )

      µ σ

      I

    ( k j j ) j

      ∑ j = 1 M-step.

      Compute Max. like µ given our data’s class membership distributions P w x , x

      ( λ ) i k t k

    • µ ( ) t i

      1 = k P , ( w x λ ) i k t

      ∑ k Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 37 E.M.

      Convergence

    • Your lecturer will (unless out of time) give you a nice intuitive explanation of why this rule works.
    • As with all EM procedu
    • • This algorithm is REALLY USED. And

      convergence to a in high dimensional state spaces, too.

      local optimum E.G. Vector Quantization for Speech guaranteed. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 38 Data E-step Compute “expected” classes of all datapoints for each class ( ) ( ) ( )

      ( ) ( )

    ( )

      (t) …

      , P µ µ

      ) ( ) ( ), ( , p

    ) ( ) ( ), ( , p

    p P , p

      Σ Σ = = c j j j j j k

    i i i i k

    t k t i t i k t k i t p t t w x

    t p t t w x

    x w w x x w 1

      ∑ =

      Just evaluate a Gaussian at x k

      (t) }

      c

      (t) … p

      2

      (t), p

      

    1

      (t), p

      Σ c

      2

      Compute Max. like µ given our data’s class membership distributions p i (t) is shorthand for estimate of P( ω i ) on t’th iteration

      Σ

      (t),

      1

      Σ

      (t),

      c

      (t) … µ

      2

      (t), µ

      1

      = { µ

      λ t

      Iterate. On the t’th iteration let our estimates be

      λ λ λ λ M-step.

    • − + − = + Σ k t k i T i k i k k t k i i

      ( ) ( ) ( )

      1 ( )

      Advance apologies: in Black and White this example will be incomprehensible

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 40 Gaussian Mixture

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 39 E.M. for General GMMs

      λ , P

      ∑ = +

      ( ) R x w t p k t k i i

      1 P 1 ,

      ∑ ∑ = + k t k i k k t k i i x w x x w t

      , P

      λ µ µ λ

      x w t x t x x w t

      ∑ ∑

      ( ) ( ) ( ) [ ] ( ) [ ] ( )

      , P , P 1 µ

      λ λ

    Example: Start

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 41 After first iteration

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 42 After 2nd iteration

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 43 After 3rd iteration

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 44 After 4th iteration

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 45 After 5th iteration

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 46 After 6th iteration

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 47 After 20th iteration

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 48 Some Bio Assay data

      GMM clustering of the assay data

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 49 Resulting Density

      Estimator Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 50

      Where are we now?

      Inference

      Joint DE, Bayes Net Structure Learning P(E |E ) 1 2 Engine Learn

      Inputs

    Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,

      Predict Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes

      Classifier

      category Net Based BC, Cascade Correlation

      Inputs Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve

      Prob-

      Density

      DE, Bayes Net Structure Learning, GMMs ability

      Estimator

      Inputs

    Linear Regression, Polynomial Regression,

      Predict

      Regressor Perceptron, Neural Net, N.Neigh, Kernel, LWR, real no.

      Inputs RBFs, Robust Regression, Cascade Correlation, Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 51 Regression Trees, GMDH, Multilinear Interp, MARS The old trick…

      Inference

      Joint DE, Bayes Net Structure Learning P(E |E ) 1 2 Engine Learn

      Inputs

    Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,

      Predict Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes

      Classifier

      category

    Net Based BC, Cascade Correlation, GMM-BC

      Inputs Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve

      Prob-

      Density

      DE, Bayes Net Structure Learning, GMMs ability

      Estimator

      Inputs

    Linear Regression, Polynomial Regression,

      Predict

      Regressor Perceptron, Neural Net, N.Neigh, Kernel, LWR, real no.

      Inputs RBFs, Robust Regression, Cascade Correlation, Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 52 Regression Trees, GMDH, Multilinear Interp, MARS Three classes of assay (each learned with it’s own mixture model) (Sorry, this will again be semi-useless in black and white) Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 53 Resulting Bayes Classifier Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 54

      Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness

      Yellow means anomalous Cyan means Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 55 ambiguous Unsupervised learning with symbolic attributes missing # KI DS NATI ON MARRI ED

      It’s just a “learning Bayes net with known structure but hidden values” problem. Can use Gradient Descent. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 56 EASY, fun exercise to do an EM formulation for this case too.

    • Remember, E.M. can get stuck in local minima, and empirically it DOES.
    • Our unsupervised learning example assumed P(w
    • It’s possible to do Bayesian unsupervised learning instead of max. likelihood.
    • There are other algorithms for unsupervised learning. We’ll visit K-means soon. Hierarchical clustering is also interesting.
    • Neural-net algorithms called “competitive learning” turn out to have interesting parallels with the EM method we saw.

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 57 Final Comments

      i )’s known, and variances fixed and known. Easy to relax this.

    What you should know

    • How to “learn” maximum likelihood parameters (locally max. like.) in the case of unlabeled data.
    • Be happy with this kind of probabilistic analysis.
    • Understand the two examples of E.M. given in these notes. For more info, see Duda + Hart. It’s a great book. There’s much more in the book than in your handout.

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 58

    • K-means (see next lecture)
    • Hierarchical clustering (e.g. Minimum spanning trees) (see next lecture)
    • Principal Component Analysis
    • Non-linear PCA

      Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 59 Other unsupervised learning methods

      simple, useful tool

      Neural Auto-Associators Locally weighted PCA Others…