SlideShare a Scribd company logo
1 of 30
Download to read offline
Note to other teachers and users of these slides. Andrew would be delighted if you found
                                          this source material useful in giving your own lectures. Feel free to use these slides
                                          verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you
                                          make use of a significant portion of these slides in your own lecture, please include this
                                          message, or the following link to the source repository of Andrew’s tutorials:
                                          http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.




          Clustering with
         Gaussian Mixtures
                                   Andrew W. Moore
                                        Professor
                               School of Computer Science
                               Carnegie Mellon University
                                          www.cs.cmu.edu/~awm
                                            awm@cs.cmu.edu
                                              412-268-7599


         Copyright © 2001, 2004, Andrew W. Moore




                  Unsupervised Learning
 •     You walk into a bar.
       A stranger approaches and tells you:
          “I’ve got data from k classes. Each class produces
          observations with a normal distribution and variance
          σ2I . Standard simple multivariate gaussian
          assumptions. I can tell you all the P(wi)’s .”
 •     So far, looks straightforward.
          “I need a maximum likelihood estimate of the µi’s .“
 •     No problem:
          “There’s just one thing. None of the data are labeled. I
          have datapoints, but I don’t know what class they’re
          from (any of them!)
 •     Uh oh!!

Copyright © 2001, 2004, Andrew W. Moore                                          Clustering with Gaussian Mixtures: Slide 2




                                                                                                                                          1
Gaussian Bayes Classifier
                       Reminder
                           p(x | y = i) P ( y = i)
  P ( y = i | x) =
                                   p (x)

                                                      ⎡1                               ⎤
                                   1
                                                  exp ⎢ − (x k − µ i ) Σ i (x k − µ i )⎥ pi
                                                                      T

                      ( 2π )   m/2           1/ 2
                                    || Σ i ||         ⎣2                               ⎦
P ( y = i | x) =
                                                         p (x)

                                   How do we deal with that?




  Copyright © 2001, 2004, Andrew W. Moore                        Clustering with Gaussian Mixtures: Slide 3




             Predicting wealth from age




  Copyright © 2001, 2004, Andrew W. Moore                        Clustering with Gaussian Mixtures: Slide 4




                                                                                                              2
Predicting wealth from age




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 5




Learning modelyear ,                          ⎛ σ 21       σ 12 L σ 1m ⎞
                                              ⎜                         ⎟
                                              ⎜σ           σ 2 2 L σ 2m ⎟
  mpg ---> maker                          Σ = ⎜ 12
                                                                         M⎟
                                              ⎜M              M      O      ⎟
                                              ⎜σ                     L σ 2m ⎟
                                                           σ 2m
                                              ⎝ 1m                          ⎠




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 6




                                                                                       3
General: O(m2)                                   ⎛ σ 21         σ 12 L σ 1m ⎞
                                                   ⎜                           ⎟
                                                   ⎜σ             σ 2 2 L σ 2m ⎟
   parameters                                  Σ = ⎜ 12
                                                                               M⎟
                                                   ⎜M              M       O      ⎟
                                                   ⎜σ                      L σ 2m ⎟
                                                                  σ 2m
                                                   ⎝ 1m                           ⎠




Copyright © 2001, 2004, Andrew W. Moore        Clustering with Gaussian Mixtures: Slide 7




                                            ⎛ σ 21                                      0⎞
                                                   0          0                0
                                                                       L
    Aligned: O(m)
                                            ⎜                                               ⎟
                                            ⎜ 0 σ 22                                    0⎟
                                                              0                0
                                                                       L
                                            ⎜                                               ⎟
                                                            σ 23 L
                                               0   0                           0        0⎟
     parameters                           Σ=⎜
                                            ⎜M                                           M⎟
                                                   M          M        O       M
                                            ⎜                                               ⎟
                                                                       L σ 2 m −1
                                               0   0          0                         0⎟
                                            ⎜
                                            ⎜0                                         σ 2m ⎟
                                                   0          0                0
                                                                       L
                                            ⎝                                               ⎠




Copyright © 2001, 2004, Andrew W. Moore        Clustering with Gaussian Mixtures: Slide 8




                                                                                                4
⎛ σ 21                                       0⎞
                                                   0            0               0
                                                                        L
    Aligned: O(m)
                                            ⎜                                              ⎟
                                            ⎜ 0 σ 22                                     0⎟
                                                                0               0
                                                                        L
                                            ⎜                                              ⎟
                                                            σ 23 L
                                               0   0                            0        0⎟
     parameters                           Σ=⎜
                                            ⎜M                                           M⎟
                                                   M            M       O       M
                                            ⎜                                              ⎟
                                                                        L σ 2 m −1
                                               0   0            0                        0⎟
                                            ⎜
                                            ⎜0                                           2⎟
                                                                                        σ m⎠
                                                   0            0               0
                                                                        L
                                            ⎝




Copyright © 2001, 2004, Andrew W. Moore        Clustering with Gaussian Mixtures: Slide 9




                                               ⎛σ 2                                     0⎞
                                                            0       0               0
                                                                            L
  Spherical: O(1)
                                               ⎜                                          ⎟
                                                           σ
                                               ⎜0                                       0⎟
                                                               2
                                                                    0               0
                                                                            L
                                               ⎜                                          ⎟
                                                                    σ   2
                                                 0          0                       0   0⎟
                                                                            L
  cov parameters                             Σ=⎜
                                               ⎜M                                        M⎟
                                                            M       M       O       M
                                               ⎜                                          ⎟
                                                                            L σ2
                                               ⎜0           0       0                   0⎟
                                               ⎜0                                       σ2⎟
                                                            0       0       L0
                                               ⎝                                          ⎠




Copyright © 2001, 2004, Andrew W. Moore       Clustering with Gaussian Mixtures: Slide 10




                                                                                               5
⎛σ 2                                     0⎞
                                                                        0       0             0
                                                                                      L
   Spherical: O(1)
                                                           ⎜                                          ⎟
                                                                       σ
                                                           ⎜0                                       0⎟
                                                                           2
                                                                                0             0
                                                                                      L
                                                           ⎜                                          ⎟
                                                                               σ2 L
                                                             0          0                     0     0⎟
   cov parameters                                        Σ=⎜
                                                           ⎜M                                        M⎟
                                                                        M       M     O       M
                                                           ⎜                                          ⎟
                                                                                      L σ2
                                                           ⎜0           0       0                   0⎟
                                                           ⎜0                                       σ2⎟
                                                                        0       0     L0
                                                           ⎝                                          ⎠




Copyright © 2001, 2004, Andrew W. Moore                   Clustering with Gaussian Mixtures: Slide 11




                  Making a Classifier from a
                     Density Estimator

                                          Categorical   Real-valued             Mixed Real /
                                          inputs only   inputs only             Cat okay

                               Predict Joint BC         Gauss BC
 Inputs




                                                                                 Dec Tree
                 Classifier   category Naïve BC

                                          Joint DE      Gauss DE
 Inputs Inputs




                                Prob-
                  Density
                                ability
                 Estimator                Naïve DE
                              Predict
                 Regressor    real no.


Copyright © 2001, 2004, Andrew W. Moore                   Clustering with Gaussian Mixtures: Slide 12




                                                                                                          6
Next… back to Density Estimation

What if we want to do density estimation with
         multimodal or clumpy data?




Copyright © 2001, 2004, Andrew W. Moore        Clustering with Gaussian Mixtures: Slide 13




                    The GMM assumption
•    There are k components. The
     i’th component is called ωi
     Component ωi has an
•
     associated mean vector µi                         µ2
                                          µ1


                                                          µ3




Copyright © 2001, 2004, Andrew W. Moore        Clustering with Gaussian Mixtures: Slide 14




                                                                                             7
The GMM assumption
•    There are k components. The
     i’th component is called ωi
     Component ωi has an
•
     associated mean vector µi                         µ2
                                          µ1
•    Each component generates data
     from a Gaussian with mean µi
     and covariance matrix σ2I
                                                          µ3
Assume that each datapoint is
    generated according to the
    following recipe:




Copyright © 2001, 2004, Andrew W. Moore        Clustering with Gaussian Mixtures: Slide 15




                    The GMM assumption
•    There are k components. The
     i’th component is called ωi
     Component ωi has an
•
     associated mean vector µi                         µ2
•    Each component generates data
     from a Gaussian with mean µi
     and covariance matrix σ2I
Assume that each datapoint is
    generated according to the
    following recipe:
1. Pick a component at random.
   Choose component i with
   probability P(ωi).


Copyright © 2001, 2004, Andrew W. Moore        Clustering with Gaussian Mixtures: Slide 16




                                                                                             8
The GMM assumption
•    There are k components. The
     i’th component is called ωi
     Component ωi has an
•
     associated mean vector µi                         µ2
•    Each component generates data
                                                                   x
     from a Gaussian with mean µi
     and covariance matrix σ2I
Assume that each datapoint is
    generated according to the
    following recipe:
1. Pick a component at random.
   Choose component i with
   probability P(ωi).
2. Datapoint ~ N(µi, σ2I )
Copyright © 2001, 2004, Andrew W. Moore        Clustering with Gaussian Mixtures: Slide 17




      The General GMM assumption
•    There are k components. The
     i’th component is called ωi
     Component ωi has an
•
     associated mean vector µi                         µ2
                                          µ1
•    Each component generates data
     from a Gaussian with mean µi
     and covariance matrix Σi
                                                          µ3
Assume that each datapoint is
    generated according to the
    following recipe:
1. Pick a component at random.
   Choose component i with
   probability P(ωi).
2. Datapoint ~ N(µi, Σi )
Copyright © 2001, 2004, Andrew W. Moore        Clustering with Gaussian Mixtures: Slide 18




                                                                                             9
Unsupervised Learning:
                    not as hard as it looks
                                            Sometimes easy
                                                                    IN CASE YOU’RE
                                                                    WONDERING WHAT
                                                                    THESE DIAGRAMS ARE,
                                                                    THEY SHOW 2-d
                                                                    UNLABELED DATA (X
                                                                    VECTORS)
                                          Sometimes impossible      DISTRIBUTED IN 2-d
                                                                    SPACE. THE TOP ONE
                                                                    HAS THREE VERY
                                                                    CLEAR GAUSSIAN
                                                                    CENTERS
                                             and sometimes
                                               in between

Copyright © 2001, 2004, Andrew W. Moore                   Clustering with Gaussian Mixtures: Slide 19




                   Computing likelihoods in
                     unsupervised case
We have x1 , x2 , … xN
We know P(w1) P(w2) .. P(wk)
We know σ

P(x|wi, µi, … µk) = Prob that an observation from class
                      wi would have value x given class
                      means µ1… µx

Can we write an expression for that?


Copyright © 2001, 2004, Andrew W. Moore                   Clustering with Gaussian Mixtures: Slide 20




                                                                                                        10
likelihoods in unsupervised case
We have x1 x2 … xn
We have P(w1) .. P(wk). We have σ.
We can define, for any x , P(x|wi , µ1, µ2 .. µk)

Can we define P(x | µ1, µ2 .. µk) ?




Can we define P(x1, x1, .. xn | µ1, µ2 .. µk) ?
        [YES, IF WE ASSUME THE X1’S WERE DRAWN INDEPENDENTLY]

Copyright © 2001, 2004, Andrew W. Moore             Clustering with Gaussian Mixtures: Slide 21




                    Unsupervised Learning:
                     Mediumly Good News
    We now have a procedure s.t. if you give me a guess at µ1, µ2 .. µk,
    I can tell you the prob of the unlabeled data given those µ‘s.



  Suppose x‘s are 1-dimensional.
                                                    (From Duda and Hart)
  There are two classes; w1 and w2
  P(w1) = 1/3           P(w2) = 2/3       σ=1.
  There are 25 unlabeled datapoints
      x1 =  0.608
      x2 = -1.590
      x3 = 0.235
      x4 = 3.949
           :
      x25 = -0.712

Copyright © 2001, 2004, Andrew W. Moore             Clustering with Gaussian Mixtures: Slide 22




                                                                                                  11
Duda & Hart’s
    Example


Graph of
log P(x1, x2 .. x25 | µ1, µ2 )
   against µ1 (→) and µ2 (↑)




  Max likelihood = (µ1 =-2.13, µ2 =1.668)
  Local minimum, but very close to global at (µ1 =2.085, µ2 =-1.257)*
       * corresponds to switching w1 + w2.

  Copyright © 2001, 2004, Andrew W. Moore         Clustering with Gaussian Mixtures: Slide 23




                   Duda & Hart’s Example
 We can graph the
   prob. dist. function
   of data given our
   µ1 and µ2
   estimates.

We can also graph the
  true function from
  which the data was
  randomly generated.



• They are close. Good.
• The 2nd solution tries to put the “2/3” hump where the “1/3” hump should
  go, and vice versa.
• In this example unsupervised is almost as good as supervised. If the x1 ..
  x25 are given the class which was used to learn them, then the results are
  (µ1=-2.176, µ2=1.684). Unsupervised got (µ1=-2.13, µ2=1.668).
  Copyright © 2001, 2004, Andrew W. Moore         Clustering with Gaussian Mixtures: Slide 24




                                                                                                12
Finding the max likelihood µ1,µ2..µk
We can compute P( data | µ1,µ2..µk)
How do we find the µi‘s which give max. likelihood?

• The normal max likelihood trick:
      Set        log Prob (….) = 0
             µi
  and solve for µi‘s.
      # Here you get non-linear non-analytically-
      solvable equations
• Use gradient descent
      Slow but doable
• Use a much faster, cuter, and recently very popular
  method…

Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 25




                                                Expectation
                                               Maximalization


Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 26




                                                                                        13
The E.M. Algorithm
                 R
             U
          TO
       DE

• We’ll get back to unsupervised learning soon.
• But now we’ll look at an even simpler case with
  hidden information.
• The EM algorithm
            Can do trivial things, such as the contents of the next
            few slides.
            An excellent way of doing our unsupervised learning
            problem, as we’ll see.
            Many, many other uses, including inference of Hidden
            Markov Models (future lecture).

Copyright © 2001, 2004, Andrew W. Moore       Clustering with Gaussian Mixtures: Slide 27




                                 Silly Example
Let events be “grades in a class”
  w1 = Gets an A           P(A) = ½
  w2 = Gets a B            P(B) = µ
  w3 = Gets a C            P(C) = 2µ
  w4 = Gets a D            P(D) = ½-3µ
                    (Note 0 ≤ µ ≤1/6)
Assume we want to estimate µ from data. In a given class
  there were
                    a A’s
                    b B’s
                    c C’s
                    d D’s
What’s the maximum likelihood estimate of µ given a,b,c,d ?
Copyright © 2001, 2004, Andrew W. Moore       Clustering with Gaussian Mixtures: Slide 28




                                                                                            14
Silly Example
Let events be “grades in a class”
    w1 = Gets an A              P(A) = ½
    w2 = Gets a B               P(B) = µ
    w3 = Gets a C               P(C) = 2µ
    w4 = Gets a D               P(D) = ½-3µ
                                (Note 0 ≤ µ ≤1/6)
Assume we want to estimate µ from data. In a given class there were
                                a A’s
                                b B’s
                                c C’s
                                d D’s
What’s the maximum likelihood estimate of µ given a,b,c,d ?




Copyright © 2001, 2004, Andrew W. Moore                               Clustering with Gaussian Mixtures: Slide 29




                              Trivial Statistics
P(A) = ½           P(B) = µ         P(C) = 2µ        P(D) = ½-3µ
P( a,b,c,d | µ) =        K(½)a(µ)b(2µ)c(½-3µ)d
log P( a,b,c,d | µ) = log K + alog ½ + blog µ + clog 2µ + dlog (½-3µ)
                                                ∂ LogP
                                                       =0
          FOR MAX LIKE µ, SET
                                                   ∂µ
          ∂ LogP     b 2c        3d
                   =+      −            =0
             ∂µ      µ 2 µ 1 / 2 − 3µ
                                  b+c
          Gives max like µ =
                             6 (b + c + d )
          So if class got    A       B                     C          D
                                          14     6            9    10

                                                                              !
                               1
                                                                          true
          Max like µ =                                             ut
                                                              g, b
                              10
                                                         orin
                                                     B
Copyright © 2001, 2004, Andrew W. Moore                               Clustering with Gaussian Mixtures: Slide 30




                                                                                                                    15
Same Problem with Hidden Information
                                                                REMEMBER
 Someone tells us that                                          P(A) = ½
 Number of High grades (A’s + B’s) = h                          P(B) = µ
                                     =c
 Number of C’s                                                  P(C) = 2µ

                                     =d
 Number of D’s                                                  P(D) = ½-3µ

 What is the max. like estimate of µ now?




Copyright © 2001, 2004, Andrew W. Moore            Clustering with Gaussian Mixtures: Slide 31




  Same Problem with Hidden Information
                                                                REMEMBER
 Someone tells us that                                          P(A) = ½
 Number of High grades (A’s + B’s) = h                          P(B) = µ
                                     =c
 Number of C’s                                                  P(C) = 2µ

                                     =d
 Number of D’s                                                  P(D) = ½-3µ

 What is the max. like estimate of µ now?
 We can answer this question circularly:
EXPECTATION
                          If we know the value of µ we could compute the
                          expected value of a and b                1
                                                                             µ
                                                                     2h
                                                               a=       b=      h
     Since the ratio a:b should be the same as the ratio ½ : µ
                                                                  1 +µ     1 +µ
                                                                   2        2
MAXIMIZATION
   If we know the expected values of a and b
                                                              b+c
   we could compute the maximum likelihood          µ=
                                                           6(b + c + d )
   value of µ
Copyright © 2001, 2004, Andrew W. Moore            Clustering with Gaussian Mixtures: Slide 32




                                                                                                 16
E.M. for our Trivial Problem
                                                                                     REMEMBER
                                                                                     P(A) = ½
                                                                                     P(B) = µ
We begin with a guess for µ
                                                                                     P(C) = 2µ
We iterate between EXPECTATION and MAXIMALIZATION to
                                                                                     P(D) = ½-3µ
improve our estimates of µ and a and b.

Define      µ(t) the estimate of µ on the t’th iteration
             b(t) the estimate of b on t’th iteration
     µ (0) = initial guess
                                                  E-step
                   µ(t)h
                               = Ε[b | µ (t )]
      b(t ) =
              1 + µ (t )
               2
                b(t ) + c
 µ (t + 1) =                                        M-step
             6(b(t ) + c + d )
           = max like est of µ given b(t )
                 Continue iterating until converged.
                 Good news: Converging to local optimum is assured.
                 Bad news: I said “local” optimum.
Copyright © 2001, 2004, Andrew W. Moore                        Clustering with Gaussian Mixtures: Slide 33




                         E.M. Convergence
• Convergence proof based on fact that Prob(data | µ) must increase or
   remain same between each iteration [NOT OBVIOUS]
• But it can never exceed 1 [OBVIOUS]
So it must therefore converge [OBVIOUS]


In our example,                                            t             µ(t)                 b(t)
suppose we had                                             0       0                   0
        h = 20                                             1       0.0833              2.857
        c = 10
                                                           2       0.0937              3.158
        d = 10
                                                           3       0.0947              3.185
      µ(0) = 0
                                                           4       0.0948              3.187
   Convergence is generally linear: error
                                                           5       0.0948              3.187
   decreases by a constant factor each time
   step.                                                   6       0.0948              3.187
Copyright © 2001, 2004, Andrew W. Moore                        Clustering with Gaussian Mixtures: Slide 34




                                                                                                             17
Back to Unsupervised Learning of
                 GMMs
Remember:
  We have unlabeled data x1 x2 … xR
  We know there are k classes
  We know P(w1) P(w2) P(w3) … P(wk)
  We don’t know µ1 µ2 .. µk

We can write P( data | µ1…. µk)
                    = p(x1...xR µ1...µ k )

                    = ∏ p(xi µ1...µ k )
                          R


                         i =1


                                     (          )
                    = ∏∑ p xi w j , µ1...µ k P(w j )
                          R     k


                         i =1 j =1


                    = ∏∑ K exp⎜ − 2 (xi − µ j ) ⎟P(w j )
                                ⎛1             2⎞
                        R   k


                                ⎝ 2σ            ⎠
                      i =1 j =1


Copyright © 2001, 2004, Andrew W. Moore                            Clustering with Gaussian Mixtures: Slide 35




                                     E.M. for GMMs
                                               ∂
                                                   log Pr ob(data µ1...µ k ) = 0
   For Max likelihood we know
                                              ∂µ i
   Some wild' n' crazy algebra turns this into : quot; For Max likelihood, for each j,

          ∑ P(w          xi , µ1...µ k ) xi
           R

                     j
   µj =    i =1
                                                                      See
            ∑ P(w j xi , µ1...µ k )
               R
                                              http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf
             i =1

 This is n nonlinear equations in µj’s.”
         If, for each xi we knew that for each wj the prob that µj was in class wj is
         P(wj|xi,µ1…µk) Then… we would easily compute µj.

         If we knew each µj then we could easily compute P(wj|xi,µ1…µk) for each wj
         and xi.
                                                 …I feel an EM experience coming on!!
Copyright © 2001, 2004, Andrew W. Moore                            Clustering with Gaussian Mixtures: Slide 36




                                                                                                                 18
E.M. for GMMs
Iterate. On the t’th iteration let our estimates be
                                                λt = { µ1(t), µ2(t) … µc(t) }

E-step                                                                                                            Just evaluate
                                                                                                                  a Gaussian at
   Compute “expected” classes of all datapoints for each class
                                                                      (                          )
                                                                                                                  xk
                          p(xk wi , λt )P(wi λt )                    p xk wi , µ i (t ), σ 2 I pi (t )
  P(wi xk , λt ) =                                           =
                                           p(xk λt )
                                                                 ∑ p(x                               )
                                                                 c
                                                                               w j , µ j (t ), σ 2 I p j (t )
                                                                           k
  M-step.                           j =1

    Compute Max. like µ given our data’s class membership distributions

                  ∑ P(w x , λ ) x
                           i       k        t       k
  µ i (t + 1) =   k

                   ∑ P(w x , λ )
                               i       k        t
                      k




  Copyright © 2001, 2004, Andrew W. Moore                                             Clustering with Gaussian Mixtures: Slide 37




   E.M.
Convergence
 • Your lecturer will
   (unless out of
   time) give you a
   nice intuitive
   explanation of
   why this rule
   works.
 • As with all EM
   procedures,                                          • This algorithm is REALLY USED. And
   convergence to a                                       in high dimensional state spaces, too.
   local optimum                                          E.G. Vector Quantization for Speech
   guaranteed.                                            Data
  Copyright © 2001, 2004, Andrew W. Moore                                             Clustering with Gaussian Mixtures: Slide 38




                                                                                                                                    19
E.M. for General GMMs                                                                                          pi(t) is shorthand
                                                                                                                       for estimate of
                                                                                                                       P(ωi) on t’th
Iterate. On the t’th iteration let our estimates be                                                                    iteration
   λt = { µ1(t), µ2(t) … µc(t), Σ1(t), Σ2(t) … Σc(t), p1(t), p2(t) … pc(t) }
                                                                                                                                 Just evaluate
E-step
                                                                                                                                 a Gaussian at
   Compute “expected” classes of all datapoints for each class                                                                   xk
                         p(xk wi , λt )P(wi λt )                               p(xk wi , µ i (t ), Σ i (t ) ) pi (t )
 P(wi xk , λt ) =                                                     =
                                       p(xk λt )
                                                                          ∑ p(x                                    )
                                                                          c
                                                                                         w j , µ j (t ), Σ j (t ) p j (t )
                                                                                     k
  M-step.                          j =1

    Compute Max. like µ given our data’s class membership distributions

                                                                     ∑ P(w x , λ ) [x − µ (t + 1)][x                             − µ i (t + 1)]
              ∑ P(w x , λ ) x
                                                                                                                                                T
                                                                                          i   k   t       k    i             k
                                                         Σ (t + 1) =
                           i       k       t       k
  µ (t + 1) =
                                                                                 k

                                                                                   ∑ P(w x , λ )
                 k

               ∑ P(w x , λ )
                                                              i
    i
                                                                                                               i   k     t
                               i       k       t
                                                                                                      k
                     k


                                                                  ∑ P(w x , λ )
                                                                           i     k   t
                                               pi (t + 1) =       k
                                                                                                  R = #records
                                                                          R
  Copyright © 2001, 2004, Andrew W. Moore                                                         Clustering with Gaussian Mixtures: Slide 39




  Gaussian
   Mixture
  Example:
    Start



 Advance apologies: in Black
and White this example will be
     incomprehensible
  Copyright © 2001, 2004, Andrew W. Moore                                                         Clustering with Gaussian Mixtures: Slide 40




                                                                                                                                                    20
After first
iteration




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 41




After 2nd
iteration




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 42




                                                                                        21
After 3rd
 iteration




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 43




 After 4th
 iteration




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 44




                                                                                        22
After 5th
 iteration




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 45




 After 6th
 iteration




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 46




                                                                                        23
After 20th
  iteration




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 47




  Some Bio
   Assay
    data




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 48




                                                                                        24
GMM
clustering
  of the
assay data




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 49




  Resulting
   Density
  Estimator




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 50




                                                                                        25
Where are we now?
 Inputs



                  Inference
                              P(E1|E2) Joint DE, Bayes Net Structure Learning
                 Engine Learn
                                          Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,
 Inputs




                               Predict Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes
                 Classifier   category Net Based BC, Cascade Correlation

                                          Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve
 Inputs Inputs




                                Prob-
                  Density
                                          DE, Bayes Net Structure Learning, GMMs
                                ability
                 Estimator
                                          Linear Regression, Polynomial Regression,
                              Predict
                 Regressor                Perceptron, Neural Net, N.Neigh, Kernel, LWR,
                              real no.
                                          RBFs, Robust Regression, Cascade Correlation,
                                          Regression Trees, GMDH, Multilinear Interp, MARS

Copyright © 2001, 2004, Andrew W. Moore                      Clustering with Gaussian Mixtures: Slide 51




                               The old trick…
 Inputs




                  Inference
                              P(E1|E2) Joint DE, Bayes Net Structure Learning
                 Engine Learn
                                          Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,
 Inputs




                               Predict Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes
                 Classifier   category Net Based BC, Cascade Correlation, GMM-BC

                                          Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve
 Inputs Inputs




                                Prob-
                  Density
                                          DE, Bayes Net Structure Learning, GMMs
                                ability
                 Estimator
                                          Linear Regression, Polynomial Regression,
                              Predict
                 Regressor                Perceptron, Neural Net, N.Neigh, Kernel, LWR,
                              real no.
                                          RBFs, Robust Regression, Cascade Correlation,
                                          Regression Trees, GMDH, Multilinear Interp, MARS

Copyright © 2001, 2004, Andrew W. Moore                      Clustering with Gaussian Mixtures: Slide 52




                                                                                                           26
Three
classes of
assay
(each learned with
it’s own mixture
model)
(Sorry, this will again be
semi-useless in black and
white)




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 53




Resulting
Bayes
Classifier




Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 54




                                                                                        27
Resulting Bayes
Classifier, using
posterior
probabilities to
alert about
ambiguity and
anomalousness

     Yellow means
      anomalous

      Cyan means
      ambiguous

Copyright © 2001, 2004, Andrew W. Moore             Clustering with Gaussian Mixtures: Slide 55




Unsupervised learning with symbolic
attributes      missing




                                                                  # KIDS
          NATION

                                          MARRIED


It’s just a “learning Bayes net with known structure but
hidden values” problem.
Can use Gradient Descent.

EASY, fun exercise to do an EM formulation for this case too.

Copyright © 2001, 2004, Andrew W. Moore             Clustering with Gaussian Mixtures: Slide 56




                                                                                                  28
Final Comments
• Remember, E.M. can get stuck in local minima, and
  empirically it DOES.
• Our unsupervised learning example assumed P(wi)’s
  known, and variances fixed and known. Easy to
  relax this.
• It’s possible to do Bayesian unsupervised learning
  instead of max. likelihood.
• There are other algorithms for unsupervised
  learning. We’ll visit K-means soon. Hierarchical
  clustering is also interesting.
• Neural-net algorithms called “competitive learning”
  turn out to have interesting parallels with the EM
  method we saw.
Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 57




                  What you should know
• How to “learn” maximum likelihood parameters
  (locally max. like.) in the case of unlabeled data.
• Be happy with this kind of probabilistic analysis.
• Understand the two examples of E.M. given in these
  notes.


   For more info, see Duda + Hart. It’s a great book.
   There’s much more in the book than in your
   handout.



Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 58




                                                                                        29
Other unsupervised learning
                      methods
 • K-means (see next lecture)
 • Hierarchical clustering (e.g. Minimum spanning
   trees) (see next lecture)
 • Principal Component Analysis
          simple, useful tool

 • Non-linear PCA
          Neural Auto-Associators
          Locally weighted PCA
          Others…

Copyright © 2001, 2004, Andrew W. Moore   Clustering with Gaussian Mixtures: Slide 59




                                                                                        30

More Related Content

What's hot

Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierYiqun Hu
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesMohammed Bennamoun
 
Inference in First-Order Logic
Inference in First-Order Logic Inference in First-Order Logic
Inference in First-Order Logic Junya Tanaka
 
Brief Introduction to Boltzmann Machine
Brief Introduction to Boltzmann MachineBrief Introduction to Boltzmann Machine
Brief Introduction to Boltzmann MachineArunabha Saha
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
Naive bayes
Naive bayesNaive bayes
Naive bayesumeskath
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentMuhammad Rasel
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 

What's hot (20)

Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
 
Inference in First-Order Logic
Inference in First-Order Logic Inference in First-Order Logic
Inference in First-Order Logic
 
Brief Introduction to Boltzmann Machine
Brief Introduction to Boltzmann MachineBrief Introduction to Boltzmann Machine
Brief Introduction to Boltzmann Machine
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Decision theory
Decision theoryDecision theory
Decision theory
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
07 approximate inference in bn
07 approximate inference in bn07 approximate inference in bn
07 approximate inference in bn
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Back propagation
Back propagationBack propagation
Back propagation
 
Edge detection-LOG
Edge detection-LOGEdge detection-LOG
Edge detection-LOG
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Viewers also liked

Image Segmentation Using Gm Ms
Image Segmentation Using Gm MsImage Segmentation Using Gm Ms
Image Segmentation Using Gm Msadcaine
 
Parametric Density Estimation using Gaussian Mixture Models
Parametric Density Estimation using Gaussian Mixture ModelsParametric Density Estimation using Gaussian Mixture Models
Parametric Density Estimation using Gaussian Mixture ModelsPardis N
 
Focused Clustering and Outlier Detection in Large Attributed Graphs
Focused Clustering and Outlier Detection in Large Attributed GraphsFocused Clustering and Outlier Detection in Large Attributed Graphs
Focused Clustering and Outlier Detection in Large Attributed GraphsBryan Perozzi
 
Outlier detection for high dimensional data
Outlier detection for high dimensional dataOutlier detection for high dimensional data
Outlier detection for high dimensional dataParag Tamhane
 
Bài giảng Trí Tuệ Nhân Tạo
Bài giảng Trí Tuệ Nhân TạoBài giảng Trí Tuệ Nhân Tạo
Bài giảng Trí Tuệ Nhân TạoDự Nguyễn Quang
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversionankit_saluja
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminarDiptimaya Sarangi
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySeminar Links
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentationhimanshubhatti
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognitionCharu Joshi
 
Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project reportSarang Afle
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 

Viewers also liked (15)

Image Segmentation Using Gm Ms
Image Segmentation Using Gm MsImage Segmentation Using Gm Ms
Image Segmentation Using Gm Ms
 
Parametric Density Estimation using Gaussian Mixture Models
Parametric Density Estimation using Gaussian Mixture ModelsParametric Density Estimation using Gaussian Mixture Models
Parametric Density Estimation using Gaussian Mixture Models
 
Focused Clustering and Outlier Detection in Large Attributed Graphs
Focused Clustering and Outlier Detection in Large Attributed GraphsFocused Clustering and Outlier Detection in Large Attributed Graphs
Focused Clustering and Outlier Detection in Large Attributed Graphs
 
Outlier detection for high dimensional data
Outlier detection for high dimensional dataOutlier detection for high dimensional data
Outlier detection for high dimensional data
 
R Code for EM Algorithm
R Code for EM AlgorithmR Code for EM Algorithm
R Code for EM Algorithm
 
A Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM AlgorithmA Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM Algorithm
 
Bài giảng Trí Tuệ Nhân Tạo
Bài giảng Trí Tuệ Nhân TạoBài giảng Trí Tuệ Nhân Tạo
Bài giảng Trí Tuệ Nhân Tạo
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project report
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

More from guestfee8698

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesguestfee8698
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Modelsguestfee8698
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clusteringguestfee8698
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Netsguestfee8698
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiersguestfee8698
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networksguestfee8698
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networksguestfee8698
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionguestfee8698
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithmsguestfee8698
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)guestfee8698
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiersguestfee8698
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimationguestfee8698
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Minersguestfee8698
 

More from guestfee8698 (19)

PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
VC dimensio
VC dimensioVC dimensio
VC dimensio
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Nets
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Bayesian Networks
Bayesian NetworksBayesian Networks
Bayesian Networks
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Cross-Validation
Cross-ValidationCross-Validation
Cross-Validation
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiers
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
Gaussians
GaussiansGaussians
Gaussians
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
 

Recently uploaded

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 

Recently uploaded (20)

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 

Gaussian Mixture Models

  • 1. Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Clustering with Gaussian Mixtures Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright © 2001, 2004, Andrew W. Moore Unsupervised Learning • You walk into a bar. A stranger approaches and tells you: “I’ve got data from k classes. Each class produces observations with a normal distribution and variance σ2I . Standard simple multivariate gaussian assumptions. I can tell you all the P(wi)’s .” • So far, looks straightforward. “I need a maximum likelihood estimate of the µi’s .“ • No problem: “There’s just one thing. None of the data are labeled. I have datapoints, but I don’t know what class they’re from (any of them!) • Uh oh!! Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 2 1
  • 2. Gaussian Bayes Classifier Reminder p(x | y = i) P ( y = i) P ( y = i | x) = p (x) ⎡1 ⎤ 1 exp ⎢ − (x k − µ i ) Σ i (x k − µ i )⎥ pi T ( 2π ) m/2 1/ 2 || Σ i || ⎣2 ⎦ P ( y = i | x) = p (x) How do we deal with that? Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 3 Predicting wealth from age Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 4 2
  • 3. Predicting wealth from age Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 5 Learning modelyear , ⎛ σ 21 σ 12 L σ 1m ⎞ ⎜ ⎟ ⎜σ σ 2 2 L σ 2m ⎟ mpg ---> maker Σ = ⎜ 12 M⎟ ⎜M M O ⎟ ⎜σ L σ 2m ⎟ σ 2m ⎝ 1m ⎠ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 6 3
  • 4. General: O(m2) ⎛ σ 21 σ 12 L σ 1m ⎞ ⎜ ⎟ ⎜σ σ 2 2 L σ 2m ⎟ parameters Σ = ⎜ 12 M⎟ ⎜M M O ⎟ ⎜σ L σ 2m ⎟ σ 2m ⎝ 1m ⎠ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 7 ⎛ σ 21 0⎞ 0 0 0 L Aligned: O(m) ⎜ ⎟ ⎜ 0 σ 22 0⎟ 0 0 L ⎜ ⎟ σ 23 L 0 0 0 0⎟ parameters Σ=⎜ ⎜M M⎟ M M O M ⎜ ⎟ L σ 2 m −1 0 0 0 0⎟ ⎜ ⎜0 σ 2m ⎟ 0 0 0 L ⎝ ⎠ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 8 4
  • 5. ⎛ σ 21 0⎞ 0 0 0 L Aligned: O(m) ⎜ ⎟ ⎜ 0 σ 22 0⎟ 0 0 L ⎜ ⎟ σ 23 L 0 0 0 0⎟ parameters Σ=⎜ ⎜M M⎟ M M O M ⎜ ⎟ L σ 2 m −1 0 0 0 0⎟ ⎜ ⎜0 2⎟ σ m⎠ 0 0 0 L ⎝ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 9 ⎛σ 2 0⎞ 0 0 0 L Spherical: O(1) ⎜ ⎟ σ ⎜0 0⎟ 2 0 0 L ⎜ ⎟ σ 2 0 0 0 0⎟ L cov parameters Σ=⎜ ⎜M M⎟ M M O M ⎜ ⎟ L σ2 ⎜0 0 0 0⎟ ⎜0 σ2⎟ 0 0 L0 ⎝ ⎠ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 10 5
  • 6. ⎛σ 2 0⎞ 0 0 0 L Spherical: O(1) ⎜ ⎟ σ ⎜0 0⎟ 2 0 0 L ⎜ ⎟ σ2 L 0 0 0 0⎟ cov parameters Σ=⎜ ⎜M M⎟ M M O M ⎜ ⎟ L σ2 ⎜0 0 0 0⎟ ⎜0 σ2⎟ 0 0 L0 ⎝ ⎠ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 11 Making a Classifier from a Density Estimator Categorical Real-valued Mixed Real / inputs only inputs only Cat okay Predict Joint BC Gauss BC Inputs Dec Tree Classifier category Naïve BC Joint DE Gauss DE Inputs Inputs Prob- Density ability Estimator Naïve DE Predict Regressor real no. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 12 6
  • 7. Next… back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 13 The GMM assumption • There are k components. The i’th component is called ωi Component ωi has an • associated mean vector µi µ2 µ1 µ3 Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 14 7
  • 8. The GMM assumption • There are k components. The i’th component is called ωi Component ωi has an • associated mean vector µi µ2 µ1 • Each component generates data from a Gaussian with mean µi and covariance matrix σ2I µ3 Assume that each datapoint is generated according to the following recipe: Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 15 The GMM assumption • There are k components. The i’th component is called ωi Component ωi has an • associated mean vector µi µ2 • Each component generates data from a Gaussian with mean µi and covariance matrix σ2I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ωi). Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 16 8
  • 9. The GMM assumption • There are k components. The i’th component is called ωi Component ωi has an • associated mean vector µi µ2 • Each component generates data x from a Gaussian with mean µi and covariance matrix σ2I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ωi). 2. Datapoint ~ N(µi, σ2I ) Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 17 The General GMM assumption • There are k components. The i’th component is called ωi Component ωi has an • associated mean vector µi µ2 µ1 • Each component generates data from a Gaussian with mean µi and covariance matrix Σi µ3 Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ωi). 2. Datapoint ~ N(µi, Σi ) Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 18 9
  • 10. Unsupervised Learning: not as hard as it looks Sometimes easy IN CASE YOU’RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d UNLABELED DATA (X VECTORS) Sometimes impossible DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS and sometimes in between Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 19 Computing likelihoods in unsupervised case We have x1 , x2 , … xN We know P(w1) P(w2) .. P(wk) We know σ P(x|wi, µi, … µk) = Prob that an observation from class wi would have value x given class means µ1… µx Can we write an expression for that? Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 20 10
  • 11. likelihoods in unsupervised case We have x1 x2 … xn We have P(w1) .. P(wk). We have σ. We can define, for any x , P(x|wi , µ1, µ2 .. µk) Can we define P(x | µ1, µ2 .. µk) ? Can we define P(x1, x1, .. xn | µ1, µ2 .. µk) ? [YES, IF WE ASSUME THE X1’S WERE DRAWN INDEPENDENTLY] Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 21 Unsupervised Learning: Mediumly Good News We now have a procedure s.t. if you give me a guess at µ1, µ2 .. µk, I can tell you the prob of the unlabeled data given those µ‘s. Suppose x‘s are 1-dimensional. (From Duda and Hart) There are two classes; w1 and w2 P(w1) = 1/3 P(w2) = 2/3 σ=1. There are 25 unlabeled datapoints x1 = 0.608 x2 = -1.590 x3 = 0.235 x4 = 3.949 : x25 = -0.712 Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 22 11
  • 12. Duda & Hart’s Example Graph of log P(x1, x2 .. x25 | µ1, µ2 ) against µ1 (→) and µ2 (↑) Max likelihood = (µ1 =-2.13, µ2 =1.668) Local minimum, but very close to global at (µ1 =2.085, µ2 =-1.257)* * corresponds to switching w1 + w2. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 23 Duda & Hart’s Example We can graph the prob. dist. function of data given our µ1 and µ2 estimates. We can also graph the true function from which the data was randomly generated. • They are close. Good. • The 2nd solution tries to put the “2/3” hump where the “1/3” hump should go, and vice versa. • In this example unsupervised is almost as good as supervised. If the x1 .. x25 are given the class which was used to learn them, then the results are (µ1=-2.176, µ2=1.684). Unsupervised got (µ1=-2.13, µ2=1.668). Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 24 12
  • 13. Finding the max likelihood µ1,µ2..µk We can compute P( data | µ1,µ2..µk) How do we find the µi‘s which give max. likelihood? • The normal max likelihood trick: Set log Prob (….) = 0 µi and solve for µi‘s. # Here you get non-linear non-analytically- solvable equations • Use gradient descent Slow but doable • Use a much faster, cuter, and recently very popular method… Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 25 Expectation Maximalization Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 26 13
  • 14. The E.M. Algorithm R U TO DE • We’ll get back to unsupervised learning soon. • But now we’ll look at an even simpler case with hidden information. • The EM algorithm Can do trivial things, such as the contents of the next few slides. An excellent way of doing our unsupervised learning problem, as we’ll see. Many, many other uses, including inference of Hidden Markov Models (future lecture). Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 27 Silly Example Let events be “grades in a class” w1 = Gets an A P(A) = ½ w2 = Gets a B P(B) = µ w3 = Gets a C P(C) = 2µ w4 = Gets a D P(D) = ½-3µ (Note 0 ≤ µ ≤1/6) Assume we want to estimate µ from data. In a given class there were a A’s b B’s c C’s d D’s What’s the maximum likelihood estimate of µ given a,b,c,d ? Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 28 14
  • 15. Silly Example Let events be “grades in a class” w1 = Gets an A P(A) = ½ w2 = Gets a B P(B) = µ w3 = Gets a C P(C) = 2µ w4 = Gets a D P(D) = ½-3µ (Note 0 ≤ µ ≤1/6) Assume we want to estimate µ from data. In a given class there were a A’s b B’s c C’s d D’s What’s the maximum likelihood estimate of µ given a,b,c,d ? Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 29 Trivial Statistics P(A) = ½ P(B) = µ P(C) = 2µ P(D) = ½-3µ P( a,b,c,d | µ) = K(½)a(µ)b(2µ)c(½-3µ)d log P( a,b,c,d | µ) = log K + alog ½ + blog µ + clog 2µ + dlog (½-3µ) ∂ LogP =0 FOR MAX LIKE µ, SET ∂µ ∂ LogP b 2c 3d =+ − =0 ∂µ µ 2 µ 1 / 2 − 3µ b+c Gives max like µ = 6 (b + c + d ) So if class got A B C D 14 6 9 10 ! 1 true Max like µ = ut g, b 10 orin B Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 30 15
  • 16. Same Problem with Hidden Information REMEMBER Someone tells us that P(A) = ½ Number of High grades (A’s + B’s) = h P(B) = µ =c Number of C’s P(C) = 2µ =d Number of D’s P(D) = ½-3µ What is the max. like estimate of µ now? Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 31 Same Problem with Hidden Information REMEMBER Someone tells us that P(A) = ½ Number of High grades (A’s + B’s) = h P(B) = µ =c Number of C’s P(C) = 2µ =d Number of D’s P(D) = ½-3µ What is the max. like estimate of µ now? We can answer this question circularly: EXPECTATION If we know the value of µ we could compute the expected value of a and b 1 µ 2h a= b= h Since the ratio a:b should be the same as the ratio ½ : µ 1 +µ 1 +µ 2 2 MAXIMIZATION If we know the expected values of a and b b+c we could compute the maximum likelihood µ= 6(b + c + d ) value of µ Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 32 16
  • 17. E.M. for our Trivial Problem REMEMBER P(A) = ½ P(B) = µ We begin with a guess for µ P(C) = 2µ We iterate between EXPECTATION and MAXIMALIZATION to P(D) = ½-3µ improve our estimates of µ and a and b. Define µ(t) the estimate of µ on the t’th iteration b(t) the estimate of b on t’th iteration µ (0) = initial guess E-step µ(t)h = Ε[b | µ (t )] b(t ) = 1 + µ (t ) 2 b(t ) + c µ (t + 1) = M-step 6(b(t ) + c + d ) = max like est of µ given b(t ) Continue iterating until converged. Good news: Converging to local optimum is assured. Bad news: I said “local” optimum. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 33 E.M. Convergence • Convergence proof based on fact that Prob(data | µ) must increase or remain same between each iteration [NOT OBVIOUS] • But it can never exceed 1 [OBVIOUS] So it must therefore converge [OBVIOUS] In our example, t µ(t) b(t) suppose we had 0 0 0 h = 20 1 0.0833 2.857 c = 10 2 0.0937 3.158 d = 10 3 0.0947 3.185 µ(0) = 0 4 0.0948 3.187 Convergence is generally linear: error 5 0.0948 3.187 decreases by a constant factor each time step. 6 0.0948 3.187 Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 34 17
  • 18. Back to Unsupervised Learning of GMMs Remember: We have unlabeled data x1 x2 … xR We know there are k classes We know P(w1) P(w2) P(w3) … P(wk) We don’t know µ1 µ2 .. µk We can write P( data | µ1…. µk) = p(x1...xR µ1...µ k ) = ∏ p(xi µ1...µ k ) R i =1 ( ) = ∏∑ p xi w j , µ1...µ k P(w j ) R k i =1 j =1 = ∏∑ K exp⎜ − 2 (xi − µ j ) ⎟P(w j ) ⎛1 2⎞ R k ⎝ 2σ ⎠ i =1 j =1 Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 35 E.M. for GMMs ∂ log Pr ob(data µ1...µ k ) = 0 For Max likelihood we know ∂µ i Some wild' n' crazy algebra turns this into : quot; For Max likelihood, for each j, ∑ P(w xi , µ1...µ k ) xi R j µj = i =1 See ∑ P(w j xi , µ1...µ k ) R http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf i =1 This is n nonlinear equations in µj’s.” If, for each xi we knew that for each wj the prob that µj was in class wj is P(wj|xi,µ1…µk) Then… we would easily compute µj. If we knew each µj then we could easily compute P(wj|xi,µ1…µk) for each wj and xi. …I feel an EM experience coming on!! Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 36 18
  • 19. E.M. for GMMs Iterate. On the t’th iteration let our estimates be λt = { µ1(t), µ2(t) … µc(t) } E-step Just evaluate a Gaussian at Compute “expected” classes of all datapoints for each class ( ) xk p(xk wi , λt )P(wi λt ) p xk wi , µ i (t ), σ 2 I pi (t ) P(wi xk , λt ) = = p(xk λt ) ∑ p(x ) c w j , µ j (t ), σ 2 I p j (t ) k M-step. j =1 Compute Max. like µ given our data’s class membership distributions ∑ P(w x , λ ) x i k t k µ i (t + 1) = k ∑ P(w x , λ ) i k t k Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 37 E.M. Convergence • Your lecturer will (unless out of time) give you a nice intuitive explanation of why this rule works. • As with all EM procedures, • This algorithm is REALLY USED. And convergence to a in high dimensional state spaces, too. local optimum E.G. Vector Quantization for Speech guaranteed. Data Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 38 19
  • 20. E.M. for General GMMs pi(t) is shorthand for estimate of P(ωi) on t’th Iterate. On the t’th iteration let our estimates be iteration λt = { µ1(t), µ2(t) … µc(t), Σ1(t), Σ2(t) … Σc(t), p1(t), p2(t) … pc(t) } Just evaluate E-step a Gaussian at Compute “expected” classes of all datapoints for each class xk p(xk wi , λt )P(wi λt ) p(xk wi , µ i (t ), Σ i (t ) ) pi (t ) P(wi xk , λt ) = = p(xk λt ) ∑ p(x ) c w j , µ j (t ), Σ j (t ) p j (t ) k M-step. j =1 Compute Max. like µ given our data’s class membership distributions ∑ P(w x , λ ) [x − µ (t + 1)][x − µ i (t + 1)] ∑ P(w x , λ ) x T i k t k i k Σ (t + 1) = i k t k µ (t + 1) = k ∑ P(w x , λ ) k ∑ P(w x , λ ) i i i k t i k t k k ∑ P(w x , λ ) i k t pi (t + 1) = k R = #records R Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 39 Gaussian Mixture Example: Start Advance apologies: in Black and White this example will be incomprehensible Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 40 20
  • 21. After first iteration Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 41 After 2nd iteration Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 42 21
  • 22. After 3rd iteration Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 43 After 4th iteration Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 44 22
  • 23. After 5th iteration Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 45 After 6th iteration Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 46 23
  • 24. After 20th iteration Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 47 Some Bio Assay data Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 48 24
  • 25. GMM clustering of the assay data Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 49 Resulting Density Estimator Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 50 25
  • 26. Where are we now? Inputs Inference P(E1|E2) Joint DE, Bayes Net Structure Learning Engine Learn Dec Tree, Sigmoid Perceptron, Sigmoid N.Net, Inputs Predict Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes Classifier category Net Based BC, Cascade Correlation Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve Inputs Inputs Prob- Density DE, Bayes Net Structure Learning, GMMs ability Estimator Linear Regression, Polynomial Regression, Predict Regressor Perceptron, Neural Net, N.Neigh, Kernel, LWR, real no. RBFs, Robust Regression, Cascade Correlation, Regression Trees, GMDH, Multilinear Interp, MARS Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 51 The old trick… Inputs Inference P(E1|E2) Joint DE, Bayes Net Structure Learning Engine Learn Dec Tree, Sigmoid Perceptron, Sigmoid N.Net, Inputs Predict Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes Classifier category Net Based BC, Cascade Correlation, GMM-BC Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve Inputs Inputs Prob- Density DE, Bayes Net Structure Learning, GMMs ability Estimator Linear Regression, Polynomial Regression, Predict Regressor Perceptron, Neural Net, N.Neigh, Kernel, LWR, real no. RBFs, Robust Regression, Cascade Correlation, Regression Trees, GMDH, Multilinear Interp, MARS Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 52 26
  • 27. Three classes of assay (each learned with it’s own mixture model) (Sorry, this will again be semi-useless in black and white) Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 53 Resulting Bayes Classifier Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 54 27
  • 28. Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness Yellow means anomalous Cyan means ambiguous Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 55 Unsupervised learning with symbolic attributes missing # KIDS NATION MARRIED It’s just a “learning Bayes net with known structure but hidden values” problem. Can use Gradient Descent. EASY, fun exercise to do an EM formulation for this case too. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 56 28
  • 29. Final Comments • Remember, E.M. can get stuck in local minima, and empirically it DOES. • Our unsupervised learning example assumed P(wi)’s known, and variances fixed and known. Easy to relax this. • It’s possible to do Bayesian unsupervised learning instead of max. likelihood. • There are other algorithms for unsupervised learning. We’ll visit K-means soon. Hierarchical clustering is also interesting. • Neural-net algorithms called “competitive learning” turn out to have interesting parallels with the EM method we saw. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 57 What you should know • How to “learn” maximum likelihood parameters (locally max. like.) in the case of unlabeled data. • Be happy with this kind of probabilistic analysis. • Understand the two examples of E.M. given in these notes. For more info, see Duda + Hart. It’s a great book. There’s much more in the book than in your handout. Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 58 29
  • 30. Other unsupervised learning methods • K-means (see next lecture) • Hierarchical clustering (e.g. Minimum spanning trees) (see next lecture) • Principal Component Analysis simple, useful tool • Non-linear PCA Neural Auto-Associators Locally weighted PCA Others… Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 59 30