SlideShare a Scribd company logo
1 of 67
Download to read offline
Note to other teachers and users of these slides.
                                             Andrew would be delighted if you found this source
                                             material useful in giving your own lectures. Feel free
                                             to use these slides verbatim, or to modify them to fit
                                             your own needs. PowerPoint originals are available. If
                                             you make use of a significant portion of these slides in




       Entropy and
                                             your own lecture, please include this message, or the
                                             following link to the source repository of Andrew’s
                                             tutorials: http://www.cs.cmu.edu/~awm/tutorials .
                                             Comments and corrections gratefully received.




     Information Gain
                   Andrew W. Moore
                        Professor
               School of Computer Science
               Carnegie Mellon University
                           www.cs.cmu.edu/~awm
                             awm@cs.cmu.edu
                               412-268-7599


Copyright © 2001, 2003, Andrew W. Moore
Bits
 You are watching a set of independent random samples of X
 You see that X has four possible values
P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4


 So you might see: BAACBADCDADDDA…
 You transmit data over a binary serial link. You can encode
 each reading with two bits (e.g. A = 00, B = 01, C = 10, D =
 11)

 0100001001001110110011111100…
 Copyright © 2001, 2003, Andrew W. Moore          Information Gain: Slide 2
Fewer Bits
 Someone tells you that the probabilities are not equal


P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8



 It’s possible…

       …to invent a coding for your transmission that only uses
       1.75 bits on average per symbol. How?


 Copyright © 2001, 2003, Andrew W. Moore            Information Gain: Slide 3
Fewer Bits
 Someone tells you that the probabilities are not equal

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8
 It’s possible…
       …to invent a coding for your transmission that only uses
       1.75 bits on average per symbol. How?
                                           A   0
                                           B   10
                                           C   110
                                           D   111



       (This is just one of several ways)
 Copyright © 2001, 2003, Andrew W. Moore             Information Gain: Slide 4
Fewer Bits
Suppose there are three equally likely values…

              P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3

    Here’s a naïve coding, costing 2 bits per symbol

                                           A   00
                                           B   01
                                           C   10


    Can you think of a coding that would need only 1.6 bits
    per symbol on average?

    In theory, it can in fact be done with 1.58496 bits per
    symbol.
 Copyright © 2001, 2003, Andrew W. Moore            Information Gain: Slide 5
General Case
Suppose X can have one of m values… V1, V2,          …   Vm

P(X=V1) = p1                     P(X=V2) = p2   ….        P(X=Vm) = pm
What’s the smallest possible number of bits, on average, per
 symbol, needed to transmit a stream of symbols drawn from
 X’s distribution? It’s
      H ( X )   p1 log 2 p1  p2 log 2 p2    pm log 2 pm
                          m
                    p j log 2 p j
                          j 1


H(X) = The entropy of X (Shannon, 1948)
• “High Entropy” means X is from a uniform (boring) distribution
• “Low Entropy” means X is from varied (peaks and valleys) distribution
 Copyright © 2001, 2003, Andrew W. Moore                      Information Gain: Slide 6
General Case
Suppose X can have one of m values… V1, V2,                     …   Vm

P(X=V1) = p1                     P(X=V2) = p2              ….        P(X=Vm) = pm
                                         A histogram of the
What’s the smallest possible number of frequency average, per
                                         bits, on distribution of
 symbol, needed to transmit a stream values of X would have
            A histogram of the            of symbols drawn from
 X’s distribution? It’s
            frequency distribution of    many lows and one or
            values log would be flat p   highs
    H(X )   p    of X p  p log        two
                                               p log p
                           1       2       1   2   2   2        m        2      m
                          m
                    p j log 2 p j
                          j 1


H(X) = The entropy of X
• “High Entropy” means X is from a uniform (boring) distribution
• “Low Entropy” means X is from varied (peaks and valleys) distribution
 Copyright © 2001, 2003, Andrew W. Moore                                     Information Gain: Slide 7
General Case
Suppose X can have one of m values… V1, V2,                     …   Vm

P(X=V1) = p1                     P(X=V2) = p2              ….        P(X=Vm) = pm
                                         A histogram of the
What’s the smallest possible number of frequency average, per
                                         bits, on distribution of
 symbol, needed to transmit a stream values of X would have
            A histogram of the            of symbols drawn from
 X’s distribution? It’s
            frequency distribution of    many lows and one or
            values log would be flat p   highs
    H(X )   p    of X p  p log        two
                                               p log p
                           1       2       1   2   2   2        m        2      m
                          m
                    p ..and sop j values
                         j log 2 the                        ..and so the values
                          j 1    sampled from it would     sampled from it would be
                                  be all over the place     more predictable
H(X) = The entropy of X
• “High Entropy” means X is from a uniform (boring) distribution
• “Low Entropy” means X is from varied (peaks and valleys) distribution
 Copyright © 2001, 2003, Andrew W. Moore                                     Information Gain: Slide 8
Entropy in a nut-shell




   Low Entropy                            High Entropy




Copyright © 2001, 2003, Andrew W. Moore                  Information Gain: Slide 9
Entropy in a nut-shell




   Low Entropy                                High Entropy
                                                             ..the values (locations of
                          ..the values (locations            soup) unpredictable...
                          of soup) sampled                   almost uniformly sampled
                          entirely from within the           throughout our dining room
                          soup bowl
Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 10
Entropy of a PDF
                                              
       Entropy of X  H [ X ]                p( x) log p( x)dx
                                            x  

                                                     Natural log (ln or loge)



      The larger the entropy of a distribution…
      …the harder it is to predict
      …the harder it is to compress it
      …the less spiky the distribution



Copyright © 2001, 2003, Andrew W. Moore                     Information Gain: Slide 11
1
           The “box”                                           w if
                                                      p( x)  
                                                                          | x |
                                                                                 w
                                                                                 2
          distribution                                         0 if
                                                              
                                                                          | x |
                                                                                 w
                                                                                 2




       1/w




                          -w/2                    0        w/2
                                          w/ 2                        w/ 2
                                              1    1      1    1
H [ X ]    p( x) log p( x)dx              log dx   log     wdx  log w
           x                    x  w / 2
                                              w    w      w    w x / 2
 Copyright © 2001, 2003, Andrew W. Moore                         Information Gain: Slide 12
1
   Unit variance                                       w if
                                              p( x)  
                                                                  | x |
                                                                         w
                                                                         2
  box distribution                                     0 if
                                                      
                                                                  | x |
                                                                         w
                                                                         2

                                                       E[ X ]  0
        1
                                                                w2
      2 3                                            Var[ X ] 
                                                                12


                       3                 0         3
if w  2 3 then Var[ X ]  1 and H [ X ]  1.242
Copyright © 2001, 2003, Andrew W. Moore                  Information Gain: Slide 13
The Hat                                      w | x |
                                                       
                                              p ( x)   w2
                                                                  if        |x|  w
         distribution                                   0
                                                                 if       |x|  w


                                                       E[ X ]  0
        1                                                         2
                                                                w
        w                                            Var[ X ] 
                                                                 6


                        w                0            w


Copyright © 2001, 2003, Andrew W. Moore                      Information Gain: Slide 14
Unit variance hat                                       w | x |
                                                       
                                              p ( x)   w2
                                                                  if        |x|  w
  distribution                                          0
                                                                 if       |x|  w


                                                       E[ X ]  0
        1                                                         2
                                                                w
         6                                           Var[ X ] 
                                                                 6



                       6                 0             6
if w  6 then Var[ X ]  1 and H [ X ]  1.396
Copyright © 2001, 2003, Andrew W. Moore                      Information Gain: Slide 15
Dirac Delta
    The “2 spikes”                                                       ( x  1)   ( x  1)
                                                          p ( x) 
     distribution                                                                    2


                                 1                       1                        E[ X ]  0
      
                                    ( x  1)              ( x  1)
                                 2                       2
      2                                                                        Var[ X ]  1



                         -1                         0                1
                                            
                              H[ X ]        p( x) log p( x)dx  
                                           x  

Copyright © 2001, 2003, Andrew W. Moore                                        Information Gain: Slide 16
Entropies of unit-variance
                   distributions
                 Distribution             Entropy

                 Box                      1.242

                 Hat                      1.396

                 2 spikes                 -infinity

                 ???                      1.4189        Largest possible
                                                      entropy of any unit-
                                                      variance distribution



Copyright © 2001, 2003, Andrew W. Moore                   Information Gain: Slide 17
Unit variance                                           p( x) 
                                                                      1       x2 
                                                                         exp   
                                                                              2
       Gaussian                                                       2         


                                                                              E[ X ]  0

                                                                           Var[ X ]  1




                                           
                          H[ X ]           p( x) log p( x)dx  1.4189
                                          x  

Copyright © 2001, 2003, Andrew W. Moore                                    Information Gain: Slide 18
Specific Conditional Entropy H(Y|X=v)
Suppose I’m trying to predict output Y and I have input X
X = College Major                     Let’s assume this reflects the true
                                      probabilities
Y = Likes “Gladiator”
      X         Y                     E.G. From this data we estimate
   Math            Yes                     • P(LikeG = Yes) = 0.5
   History         No                      • P(Major = Math & LikeG = No) = 0.25
   CS              Yes                     • P(Major = Math) = 0.5
   Math            No
                                           • P(LikeG = Yes | Major = History) = 0
   Math            No
                                      Note:
   CS              Yes
   History         No                      • H(X) = 1.5
   Math            Yes                     •H(Y) = 1
 Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 19
Specific Conditional Entropy H(Y|X=v)
X = College Major                     Definition of Specific Conditional
Y = Likes “Gladiator”                 Entropy:
                                       H(Y |X=v) = The entropy of Y
        X                Y            among only those records in which
   Math            Yes
                                      X has value v
   History         No
   CS              Yes
   Math            No
   Math            No
   CS              Yes
   History         No
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                       Information Gain: Slide 20
Specific Conditional Entropy H(Y|X=v)
X = College Major                     Definition of Specific Conditional
Y = Likes “Gladiator”                 Entropy:
                                       H(Y |X=v) = The entropy of Y
        X                Y            among only those records in which
   Math            Yes
                                      X has value v
   History         No                 Example:
   CS              Yes
                                      • H(Y|X=Math) = 1
   Math            No
                                      • H(Y|X=History) = 0
   Math            No
   CS              Yes                • H(Y|X=CS) = 0
   History         No
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                       Information Gain: Slide 21
Conditional Entropy H(Y|X)
X = College Major                     Definition of Conditional
Y = Likes “Gladiator”                 Entropy:
                                      H(Y |X) = The average specific
        X                Y
                                      conditional entropy of Y
   Math            Yes
   History         No                 = if you choose a record at random what
   CS              Yes                will be the conditional entropy of Y,
   Math            No                 conditioned on that row’s value of X
   Math            No                 = Expected number of bits to transmit Y if
   CS              Yes                both sides will know the value of X
   History         No
   Math            Yes                = Σj Prob(X=vj) H(Y | X = vj)
 Copyright © 2001, 2003, Andrew W. Moore                           Information Gain: Slide 22
Conditional Entropy
X = College Major                    Definition of Conditional Entropy:
Y = Likes “Gladiator”
                                           H(Y|X) = The average conditional
                                           entropy of Y
                                           = ΣjProb(X=vj) H(Y | X = vj)
        X                Y
   Math            Yes                     Example:
   History         No                         vj      Prob(X=vj)    H(Y | X = vj)
   CS              Yes
   Math            No
                                           Math       0.5           1
   Math            No                      History    0.25          0
   CS              Yes                     CS         0.25          0
   History         No
   Math            Yes               H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5
 Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 23
Information Gain
X = College Major                    Definition of Information Gain:
Y = Likes “Gladiator”
                                            IG(Y|X) = I must transmit Y.
                                           How many bits on average
                                           would it save me if both ends of
        X                Y
                                           the line knew X?
   Math            Yes                     IG(Y|X) = H(Y) - H(Y | X)
   History         No
   CS              Yes                     Example:
   Math            No                        • H(Y) = 1
   Math            No
                                             • H(Y|X) = 0.5
   CS              Yes
   History         No                        • Thus IG(Y|X) = 1 – 0.5 = 0.5
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                          Information Gain: Slide 24
Relative Entropy:Distance Kullback-
                      Leibler



                                     p( x)
          D( p, q)   p( x) log 2 (        )
                     x               q ( x)




Copyright © 2001, 2003, Andrew W. Moore     Information Gain: Slide 25
Mutual Information
A quantity that measures the mutual dependence of
  the two random variables.
                                                  p(x , y )
                  I (X ,Y )    p(x , y )log2(            )
                                                 p(x )q (y )

                                                 p(x , y )
                 I (X ,Y )    p(x , y )log2(            )dxdy
                             Y X                p(x )q (y )

                                                        p(x , y |c )
        I (X ,Y |C )           p(x , y |c )log2(   p(x |c )q (y |c )
                                                                       )


Copyright © 2001, 2003, Andrew W. Moore                            Information Gain: Slide 26
Mutual Information
I(X,Y)=H(Y)-H(Y/X)


                                                         p(y / x )
                     I (X ,Y )       p(x , y )log2(
                                    x     y               q (y )
                                                                  )



     I (X ,Y )    p(x , y )log2(p(y ))    p(x , y )log2(p(y / x ))
                        x       y                           x    y



      I ( X , Y )   q( y) log 2 (q( y))   p( x) p( y / x) log 2 ( p( y / x))
                            y                     x          y




Copyright © 2001, 2003, Andrew W. Moore                               Information Gain: Slide 27
Mutual information
•   I(X,Y)=H(Y)-H(Y/X)
•   I(X,Y)=H(X)-H(X/Y)
•   I(X,Y)=H(X)+H(Y)-H(X,Y)
•   I(X,Y)=I(Y,X)
•   I(X,X)=H(X)




Copyright © 2001, 2003, Andrew W. Moore      Information Gain: Slide 28
Information Gain Example




Copyright © 2001, 2003, Andrew W. Moore   Information Gain: Slide 29
Another example




Copyright © 2001, 2003, Andrew W. Moore      Information Gain: Slide 30
Relative Information Gain
X = College Major     Definition of Relative Information
Y = Likes “Gladiator” Gain:
                                     RIG(Y|X) = I must transmit Y, what
                                     fraction of the bits on average would
        X                Y
                                     it save me if both ends of the line
                                     knew X?
   Math            Yes
   History         No                RIG(Y|X) = [H(Y) - H(Y | X) ]/ H(Y)
   CS              Yes
   Math            No                      Example:
   Math            No                      •   H(Y|X) = 0.5
   CS              Yes
                                           •   H(Y) = 1
   History         No
   Math            Yes                     •   Thus IG(Y|X) = (1 – 0.5)/1 = 0.5
 Copyright © 2001, 2003, Andrew W. Moore                          Information Gain: Slide 31
What is Information Gain used for?
Suppose you are trying to predict whether someone
is going live past 80 years. From historical data you
might find…
      •IG(LongLife | HairColor) = 0.01
      •IG(LongLife | Smoker) = 0.2
      •IG(LongLife | Gender) = 0.25
      •IG(LongLife | LastDigitOfSSN) = 0.00001
IG tells you how interesting a 2-d contingency table is
going to be.
Copyright © 2001, 2003, Andrew W. Moore    Information Gain: Slide 32
Cross Entropy
     Sea X una variable aleatoria con distribucion conocida p(x) y distribucion
     estimada q(x), la “cross entropy” mide la diferencia entre las dos
     distribuciones y se define por
               HC ( x)  E[ log( q( x)]  H ( x)  KL( p, q)
     donde H(X) es la entropia de X con respecto a la distribucion p y KL es
     la distancia Kullback-Leibler ente p y q.
     Si p y q son discretas se reduce a :

                                    H C ( X )   p( x) log 2 (q( x))
                                                 x
     y para p y q continuas se tiene



Copyright © 2001, 2003, Andrew W. Moore                             Information Gain: Slide 33
Bivariate Gaussians
                   X
    Write r.v. X   
                   Y              Then define             X ~ N (μ, Σ) to mean
                    

                p ( x) 
                                    1
                                          1
                                                      
                                                  exp  1 (x  μ)T Σ 1 (x  μ)
                                                        2
                                                                                   
                             2 || Σ ||       2


   Where the Gaussian’s parameters are…

                    x            2 x           xy 
                 μ 
                              Σ                    
                    y                            2 
                                                    y
                                    xy

    Where we insist that S is symmetric non-negative definite




Copyright © 2001, 2003, Andrew W. Moore                                      Information Gain: Slide 34
Bivariate Gaussians
                   X
    Write r.v. X   
                   Y              Then define             X ~ N (μ, Σ) to mean
                    

                p ( x) 
                                    1
                                          1
                                                      
                                                  exp  1 (x  μ)T Σ 1 (x  μ)
                                                        2
                                                                                   
                             2 || Σ ||       2


   Where the Gaussian’s parameters are…

                    x            2 x           xy 
                 μ 
                              Σ                    
                    y                            2 
                                                    y
                                    xy

    Where we insist that S is symmetric non-negative definite

    It turns out that E[X] =  and Cov[X] = S. (Note that this is a
    resulting property of Gaussians, not a definition)



Copyright © 2001, 2003, Andrew W. Moore                                      Information Gain: Slide 35
Evaluating                               p ( x) 
                                                          1
                                                                               
                                                                          exp  1 (x  μ)T Σ 1 (x  μ)             
p(x): Step 1
                                                                  1             2
                                                     2 || Σ ||       2




 1.    Begin with vector x

                                                                                   x



                                                                           




Copyright © 2001, 2003, Andrew W. Moore                                                Information Gain: Slide 36
Evaluating                               p ( x) 
                                                          1
                                                                                   
                                                                          exp  1 (x  μ)T Σ 1 (x  μ)                 
p(x): Step 2
                                                                  1             2
                                                     2 || Σ ||       2




 1.    Begin with vector x
 2.    Define  = x - 
                                                                                       x

                                                                               
                                                                           




Copyright © 2001, 2003, Andrew W. Moore                                                    Information Gain: Slide 37
Evaluating                               p ( x) 
                                                          1
                                                                                   
                                                                          exp  1 (x  μ)T Σ 1 (x  μ)                     
p(x): Step 3
                                                                  1             2
                                                     2 || Σ ||       2


                                                                                        Contours defined by
 1.    Begin with vector x                                                             sqrt(TS-1) = constant
 2.    Define  = x - 
                                                                                           x
 3.    Count the number of contours
       crossed of the ellipsoids                                               
       formed S-1                                                          
      D = this count = sqrt(TS-1)
       = Mahalonobis Distance
       between x and 




Copyright © 2001, 2003, Andrew W. Moore                                                        Information Gain: Slide 38
Evaluating                               p ( x) 
                                                              1
                                                                                  
                                                                              exp  1 (x  μ)T Σ 1 (x  μ)           
p(x): Step 4
                                                                      1             2
                                                         2 || Σ ||       2




 1.    Begin with vector x
 2.    Define  = x - 
 3.    Count the number of contours




                                                exp(-D 2/2)
       crossed of the ellipsoids
       formed S-1
      D = this count = sqrt(TS-1)
       = Mahalonobis Distance
       between x and 
 4.    Define w = exp(-D 2/2)
                                                                                D2
                                               x close to  in squared Mahalonobis
                                            space gets a large weight. Far away gets
                                                          a tiny weight
Copyright © 2001, 2003, Andrew W. Moore                                                  Information Gain: Slide 39
Evaluating                                            p ( x) 
                                                                               1
                                                                                                   
                                                                                               exp  1 (x  μ)T Σ 1 (x  μ)           
p(x): Step 5
                                                                                       1             2
                                                                          2 || Σ ||       2




 1.    Begin with vector x
 2.    Define  = x - 
 3.    Count the number of contours




                                                                 exp(-D 2/2)
       crossed of the ellipsoids
       formed S-1
      D = this count = sqrt(TS-1)
       = Mahalonobis Distance
       between x and 
 4.    Define w = exp(-D 2/2)
                         1
 5. Multiply w by                1       to ensure p(x)dx  1                                   D2
                    2 || Σ ||       2




Copyright © 2001, 2003, Andrew W. Moore                                                                   Information Gain: Slide 40
Normal Bivariada NB(0,0,1,1,0)




    persp(x,y,a,theta=30,phi=10,zlab="f(x,y)",box=FALSE,col=4)

Copyright © 2001, 2003, Andrew W. Moore          Information Gain: Slide 41
Normal Bivariada NB(0,0,1,1,0)

                         3                                          0.20

                         2
                                                                    0.15
                         1

                         0                                          0.10

                        -1
                                                                    0.05
                        -2

                        -3                                          0.00
                              -3     -2   -1   0   1   2   3


     filled.contour(x,y,a,nlevels=4,col=2:5)
Copyright © 2001, 2003, Andrew W. Moore                        Information Gain: Slide 42
Multivariate Gaussians
                  X1 
                     
                  X2 
  Write r.v. X                            Then define              X ~ N (μ, Σ) to mean
                    
                     
                 X 
                  m

              p ( x)              m
                                       1
                                                     1
                                                                 
                                                             exp  1 (x  μ)T Σ 1 (x  μ)
                                                                   2
                                                                                                
                          (2 )        2
                                           || Σ ||       2



                                                                 1        21  12         1m 
    Where the Gaussian’s                                                                         
    parameters have…                                             2        12  2 2        2m 
                                                              μ       Σ
                                                                   
                                                                                              
                                                                 
                                                                                            2 
                                                                 m        1m  2 m          m
 Where we insist that S is symmetric non-negative definite
 Again, E[X] =  and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition)
Copyright © 2001, 2003, Andrew W. Moore                                              Information Gain: Slide 43
General Gaussians
                               1           21  12      1m 
                                                               
                               2           12  2 2     2m 
                            μ          Σ
                                 
                                                            
                               
                                                          2 
                               m           1m  2 m       m




                             x2




                                                 x1
Copyright © 2001, 2003, Andrew W. Moore                               Information Gain: Slide 44
Axis-Aligned Gaussians
                                          21 0     0          0       0 
                                                                            
                     1                0  22     0        0         0 
                                       0
                     2                     0     23      0         0  
                  μ               Σ
                                                                    
                                                                          
                                                          2 m 1
                     m                 0    0     0                   0 
                                         0                             2m 
                                              0     0         0             
X i  X i for i  j
                             x2




                                               x1
Copyright © 2001, 2003, Andrew W. Moore                                 Information Gain: Slide 45
Spherical Gaussians
                                             2   0    0       0   0 
                                                                       
                     1                    0    2   0     0     0 
                                           0
                     2                          0    2    0     0 
                  μ                    Σ                           
                                                                
                                                                     
                                                           2
                     m                     0    0    0            0 
                                             0               0     2
                                                  0    0               
X i  X i for i  j
                             x2




                                                   x1
Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 46
Subsets of variables
                                                 X1 
                                                           
                     X1                   U  
                                              X          
                     X2          U           m (u ) 
          Write X        as X    where
                                  V
                                            X m ( u ) 1 
                                                           
                    X                     V   
                     m
                                                X 
                                                      m      

         This will be our standard notation for breaking an m-
         dimensional distribution into subsets of variables


Copyright © 2001, 2003, Andrew W. Moore                      Information Gain: Slide 47
Gaussian Marginals                                  U
                                                      
                                                           Margin-
                                                                              U
                                                     V
   are Gaussian
                                                            alize
                                                      

              X1 
                                          X1       X m (u ) 1 
              X2          U                                  
   Write X  
                  as X   V  where U    , V    
                            
                                       X         X 
             X                            m(u )         m      
              m
      U     μ u   Σuu               Σuv  
  IF   ~ N  ,  T
     V    μ   Σ                          
            v   uv                  Σ vv  
                                               

  THEN U is also distributed as a Gaussian

 U ~ Nμu , Σuu 
Copyright © 2001, 2003, Andrew W. Moore                       Information Gain: Slide 48
Gaussian Marginals                                  U
                                                      
                                                               Margin-
                                                                                  U
                                                     V
   are Gaussian
                                                                alize
                                                      

              X1 
                                          X1       X m (u ) 1 
              X2          U                                  
   Write X  
                  as X   V  where U    , V    
                            
                                       X         X 
             X                            m(u )         m      
              m
      U     μ u   Σuu               Σuv  
  IF   ~ N  ,  T
     V    μ   Σ                          
            v   uv                  Σ vv  
                                                           This fact is not
                                                           immediately obvious
  THEN U is also distributed as a Gaussian
                                                      Obvious, once we know
 U ~ Nμu , Σuu                                       it’s a Gaussian (why?)

Copyright © 2001, 2003, Andrew W. Moore                           Information Gain: Slide 49
Gaussian Marginals                                  U
                                                      
                                                               Margin-
                                                                                    U
                                                     V
   are Gaussian
                                                                alize
                                                      

              X1 
                                          X1        X m (u ) 1 
              X2          U                                   
   Write X  
                  as X   V  where U    , V    
                            
                                       X          X 
             X                            m(u )          m      
              m                         How would you prove
                                                                 this?
      U     μ u   Σuu               Σuv  
  IF   ~ N  ,  T
     V    μ   Σ                          
            v   uv                  Σ vv  
                                               
                                                                   p (u)
  THEN U is also distributed as a Gaussian                      p(u, v)dv
                                                                v
 U ~ Nμu , Σuu                                                   (snore...)
Copyright © 2001, 2003, Andrew W. Moore                             Information Gain: Slide 50
Matrix A
   Linear Transforms                                    X     Multiply         AX
    remain Gaussian
  Assume X is an m-dimensional Gaussian r.v.

                                      X ~ Nμ, Σ
  Define Y to be a p-dimensional r. v. thusly (note         p  m):

                                          Y  AX

   …where A is a p x m matrix. Then…

                                          
                               Y ~ N Aμ, AΣ AT      

Copyright © 2001, 2003, Andrew W. Moore                          Information Gain: Slide 51
Adding samples of 2
  independent Gaussians                           X
                                                              +              XY
        is Gaussian                               Y

   if X ~ Nμ x , Σ x  and Y ~ Nμ y , Σ y  and X  Y

                 then X  Y ~ Nμ x  μ y , Σ x  Σ y 

    Why doesn’t this hold if X and Y are dependent?
    Which of the below statements is true?
                If X and Y are dependent, then X+Y is Gaussian but possibly
                with some other covariance
                If X and Y are dependent, then X+Y might be non-Gaussian


Copyright © 2001, 2003, Andrew W. Moore                       Information Gain: Slide 52
Conditional of                                    U
                                                      
                                                           Condition-
                                                                               U|V
                                                     V
Gaussian is Gaussian
                                                             alize
                                                      

        U     μ u   Σuu             Σuv  
    IF   ~ N  ,  T
       V    μ   Σ                        
              v   uv                Σ vv  
                                               

   THEN U | V ~ Nμu|v , Σu|v  where

                               1
               μu|v  μu  ΣT Σvv (V  μ v )
                            uv


                                     
                    Σu|v  Σuu  ΣT Σvv1Σuv
                                  uv




Copyright © 2001, 2003, Andrew W. Moore                        Information Gain: Slide 53
 U     μ u   Σuu              Σuv          w     2977   849 2  967  
IF   ~ N  ,  T
   V    μ   Σ                             IF   ~ N 
                                                      y      76 ,   967 3.682  
                                                                     
                                                                                       
          v   uv                 Σ vv  
                                                        
                                                                    
                                                                                     
                                                                                     
THEN U | V ~ Nμu|v , Σu|v  where                THEN w | y ~ Nμ w| y , Σ w| y  where
                                                                     976( y  76)
                       1
μu|v  μu  Σ Σ (V  μ v )
                  T
                                                  μ w| y  2977 
                  uv   vv
                                                                       3.682
                                                                   967 2
Σu|v  Σuu  ΣT Σvv1Σuv
              uv                                  Σ w| y    8492          8082
                                                                    3.682




 Copyright © 2001, 2003, Andrew W. Moore                                      Information Gain: Slide 54
 U     μ u   Σuu              Σuv          w     2977   849 2  967  
IF   ~ N  ,  T
   V    μ   Σ                             IF   ~ N 
                                                      y      76 ,   967 3.682  
                                                                     
                                                                                       
          v   uv                 Σ vv  
                                                        
                                                                    
                                                                                     
                                                                                     
THEN U | V ~ Nμu|v , Σu|v  where                THEN w | y ~ Nμ w| y , Σ w| y  where
                                                                     976( y  76)
                       1
μu|v  μu  Σ Σ (V  μ v )
                  T
                                                  μ w| y  2977 
                  uv   vv
                                                                       3.682
                                                                   967 2
Σu|v  Σuu  ΣT Σvv1Σuv
              uv                                  Σ w| y    8492          8082
                                                                    3.682


                                                                           P(w|m=82)

                                                                              P(w|m=76)




                                                                              P(w)

 Copyright © 2001, 2003, Andrew W. Moore                                      Information Gain: Slide 55
 U     μ u   Σuu              Σuv                 w       2977   849 2  967  
IF   ~ N  ,  T
   V    μ   Σ                                IF   ~ N 
                                                             y        76 ,   967 3.682  
                                                                              
                                                                                                
          v   uv                 Σ vv  
                                                           Note:  
                                                                            
                                                                      when given value of  
                                                                                              

THEN U | V ~ Nμu|v , Σu|v  where                   THEN v isy~, Nμ w| y , Σ w| y  where
                                                                  w | v the conditional
                                                                     mean of u is u
                                                                       976( y  76)
                1
μu|v  μu  ΣT Σvv (V  μ v )                         μ w| y  2977 
             uv
                                                                           3.682
                                                                      967 2
Σu|v  Σuu  ΣT Σvv1Σuv                              Σ w| y  8492           8082
              uv
                                                        Note: marginal 2
                                                                      3.68 mean is
                                                           a linear function of v


                                                                          P(w|m=82)
                                                     Note: conditional
                                                   variance can only be
                                                  equal to or smaller than P(w|m=76)
                                                     marginal variance
     Note: conditional
 variance is independent
  of the given value of v

                                                                               P(w)

 Copyright © 2001, 2003, Andrew W. Moore                                        Information Gain: Slide 56
Gaussians and the                                U|V       Chain             U
                                                                               
                                                                              V
    chain rule
                                                            Rule
                                                   V                           

Let A be a constant matrix

IF U | V ~ NAV , Σu|v  and V ~ Nμv , Σvv 
      U
THEN   ~ Nμ, Σ , with
     V
      
   Aμ v                      AΣ vv AT  Σu|v   AΣ vv 
μ
   μ                     Σ
                               ( AΣ )T
                                                        
   v                                 vv         Σ vv 
                                                        




Copyright © 2001, 2003, Andrew W. Moore                      Information Gain: Slide 57
Available Gaussian tools
U            Margin-                       U   μ   Σ
                                         IF   ~ N  u ,  uu
                                                                   Σuv  
                                                                            THEN U ~ Nμu , Σuu 
 
V             alize
                                  U         V
                                             
                                                     μ   ΣT
                                                     v   uv    Σ vv  
                                                                        
 
              Matrix A
                                          IF X ~ Nμ, Σ AND Y  AX THEN Y ~ N Aμ, AΣ AT                             
X              Multiply          AX
                                               if X ~ Nμ x , Σ x  and Y ~ Nμ y , Σ y  and X  Y
X                                              then X  Y ~ Nμ x  μ y , Σ x  Σ y 
Y
                   +              XY
                                                                                          U | V ~ Nμu|v , Σu|v 
                                                  U   μ   Σ             Σuv   THEN
                                              IF   ~ N  u ,  uu
                                                 V      μ   ΣT               
                                                                                   
                                                                              Σ vv  
U          Condition-                                 v   uv
 
V            alize
                                  U | V where                         1
                                                      μu|v  μu  ΣT Σvv (V  μ v )
                                                                 uv                                    
                                                                                        Σu|v  Σuu  ΣT Σvv1Σuv
                                                                                                      uv


                                                IF U | V ~ NAV , Σu|v  and V ~ Nμv , Σvv 
U|V                Chain              U
                   Rule
                                       
                                      V             U                      AΣ vv AT  Σu|v            AΣ vv 
 V                                            THEN   ~ Nμ, Σ , with Σ  
                                                     V                       ( AΣ )T
                                                                                                                 
                                                                                     vv                  Σ vv 
                                                                                                                 

    Copyright © 2001, 2003, Andrew W. Moore                                              Information Gain: Slide 58
Assume…
• You are an intellectual snob
• You have a child




Copyright © 2001, 2003, Andrew W. Moore             Information Gain: Slide 59
Intellectual snobs with children
• …are obsessed with IQ
• In the world as a whole, IQs are drawn from
  a Gaussian N(100,152)




Copyright © 2001, 2003, Andrew W. Moore   Information Gain: Slide 60
IQ tests
• If you take an IQ test you’ll get a score that,
  on average (over many tests) will be your
  IQ
• But because of noise on any one test the
  score will often be a few points lower or
  higher than your true IQ.

                          SCORE | IQ ~ N(IQ,102)

Copyright © 2001, 2003, Andrew W. Moore              Information Gain: Slide 61
Assume…
• You drag your kid off to get tested
• She gets a score of 130
• “Yippee” you screech and start deciding how
  to casually refer to her membership of the
  top 2% of IQs in your Christmas newsletter.

                                                P(X<130|=100,2=152) =
                                                P(X<2| =0,2=1) =
                                                erf(2) = 0.977



Copyright © 2001, 2003, Andrew W. Moore                    Information Gain: Slide 62
Assume…
• You drag your kid off to get tested
                           You are thinking:
• She gets a score of 130
                  Well sure the test isn’t accurate, so
• “Yippee” you screech andan IQ of 120 or she how
                  she might have start deciding
                   might have an 1Q of 140, but the
  to casually refermost her IQ given the evidenceof the
                    to likely membership
  top 2% of IQs in“score=130” is, of course, newsletter.
                     your Christmas 130.

                                                P(X<130|=100,2=152) =
                                                P(X<2| =0,2=1) =
                                                erf(2) = 0.977

                                                        Can we trust
                                                       this reasoning?
Copyright © 2001, 2003, Andrew W. Moore                    Information Gain: Slide 63
What we really want:
• IQ~N(100,152)
• S|IQ ~ N(IQ, 102)
• S=130

• Question: What is
  IQ | (S=130)?



       Called the Posterior
        Distribution of IQ
Copyright © 2001, 2003, Andrew W. Moore   Information Gain: Slide 64
Which tool or tools?
• IQ~N(100,152)                           U    Margin-
                                           
                                          V     alize
                                                                  U
• S|IQ ~ N(IQ, 102)                        
                                                 Matrix A
• S=130
                                          X      Multiply        AX

• Question: What is                       X
                                                   +             XY
  IQ | (S=130)?                           Y
                                          U   Condition-
                                           
                                          V     alize
                                                                  U|V
                                           

                                          U|V       Chain              U
                                                    Rule
                                                                        
                                                                       V
                                           V                            
Copyright © 2001, 2003, Andrew W. Moore                Information Gain: Slide 65
Plan
• IQ~N(100,152)
• S|IQ ~ N(IQ, 102)
• S=130

• Question: What is
  IQ | (S=130)?


 S | IQ              Chain           S             IQ    Condition-
                     Rule
                                     
                                     IQ    Swap    
                                                     S       alize
                                                                                IQ | S
  IQ                                               

Copyright © 2001, 2003, Andrew W. Moore                           Information Gain: Slide 66
Working…                                             U   μ   Σ
                                                     IF   ~ N  u ,  uu
                                                        V      μ   ΣT
                                                                                    Σuv   THEN
                                                                                         
                                                                                    Σ vv  
                                                               v   uv              
IQ~N(100,152)                                                                  1
                                                               μu|v  μu  ΣT Σvv (V  μ v )
S|IQ ~ N(IQ, 102)                                                           uv


S=130
                                          IF U | V ~ NAV , Σu|v  and V ~ Nμv , Σvv 

Question: What is IQ | (S=130)?                 U                      AΣ vv AT  Σu|v         AΣ vv 
                                          THEN   ~ Nμ, Σ , with Σ  
                                               V                                                      
                                                                       ( AΣ )T                  Σ vv 
                                                                                 vv                    




Copyright © 2001, 2003, Andrew W. Moore                                        Information Gain: Slide 67

More Related Content

What's hot

Statistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsStatistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsChristian Robert
 
New Bounds on the Size of Optimal Meshes
New Bounds on the Size of Optimal MeshesNew Bounds on the Size of Optimal Meshes
New Bounds on the Size of Optimal MeshesDon Sheehy
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to TensorflowTzar Umang
 
Going bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesGoing bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesPawel Szulc
 
Lesson 16: Derivatives of Exponential and Logarithmic Functions
Lesson 16: Derivatives of Exponential and Logarithmic FunctionsLesson 16: Derivatives of Exponential and Logarithmic Functions
Lesson 16: Derivatives of Exponential and Logarithmic FunctionsMatthew Leingang
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsXavier Anguera
 
Time Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthTime Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthXavier Anguera
 
Discrete Models in Computer Vision
Discrete Models in Computer VisionDiscrete Models in Computer Vision
Discrete Models in Computer VisionYap Wooi Hen
 
Intro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.jsIntro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.jsOswald Campesato
 
“Going bananas with recursion schemes for fixed point data types”
“Going bananas with recursion schemes for fixed point data types”“Going bananas with recursion schemes for fixed point data types”
“Going bananas with recursion schemes for fixed point data types”Pawel Szulc
 
TensorFlow in Your Browser
TensorFlow in Your BrowserTensorFlow in Your Browser
TensorFlow in Your BrowserOswald Campesato
 
Introduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlowIntroduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlowOswald Campesato
 
Deep Learning in Your Browser
Deep Learning in Your BrowserDeep Learning in Your Browser
Deep Learning in Your BrowserOswald Campesato
 
Lesson 18: Maximum and Minimum Values (Section 021 slides)
Lesson 18: Maximum and Minimum Values (Section 021 slides)Lesson 18: Maximum and Minimum Values (Section 021 slides)
Lesson 18: Maximum and Minimum Values (Section 021 slides)Matthew Leingang
 

What's hot (19)

Fec512.02
Fec512.02Fec512.02
Fec512.02
 
Assignment 2 solution acs
Assignment 2 solution acsAssignment 2 solution acs
Assignment 2 solution acs
 
Google TensorFlow Tutorial
Google TensorFlow TutorialGoogle TensorFlow Tutorial
Google TensorFlow Tutorial
 
Midterm I Review
Midterm I ReviewMidterm I Review
Midterm I Review
 
Statistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsStatistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: Models
 
New Bounds on the Size of Optimal Meshes
New Bounds on the Size of Optimal MeshesNew Bounds on the Size of Optimal Meshes
New Bounds on the Size of Optimal Meshes
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
 
Going bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data typesGoing bananas with recursion schemes for fixed point data types
Going bananas with recursion schemes for fixed point data types
 
Lesson 16: Derivatives of Exponential and Logarithmic Functions
Lesson 16: Derivatives of Exponential and Logarithmic FunctionsLesson 16: Derivatives of Exponential and Logarithmic Functions
Lesson 16: Derivatives of Exponential and Logarithmic Functions
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applications
 
Time Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthTime Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New Youth
 
Discrete Models in Computer Vision
Discrete Models in Computer VisionDiscrete Models in Computer Vision
Discrete Models in Computer Vision
 
Intro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.jsIntro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.js
 
“Going bananas with recursion schemes for fixed point data types”
“Going bananas with recursion schemes for fixed point data types”“Going bananas with recursion schemes for fixed point data types”
“Going bananas with recursion schemes for fixed point data types”
 
TensorFlow in Your Browser
TensorFlow in Your BrowserTensorFlow in Your Browser
TensorFlow in Your Browser
 
Introduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlowIntroduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlow
 
Deep Learning in Your Browser
Deep Learning in Your BrowserDeep Learning in Your Browser
Deep Learning in Your Browser
 
Lesson 5: Continuity
Lesson 5: ContinuityLesson 5: Continuity
Lesson 5: Continuity
 
Lesson 18: Maximum and Minimum Values (Section 021 slides)
Lesson 18: Maximum and Minimum Values (Section 021 slides)Lesson 18: Maximum and Minimum Values (Section 021 slides)
Lesson 18: Maximum and Minimum Values (Section 021 slides)
 

Viewers also liked

eLA (2015) Turning the Ship
eLA (2015) Turning the ShipeLA (2015) Turning the Ship
eLA (2015) Turning the ShipAndrew Moore
 
Max Entropy
Max EntropyMax Entropy
Max Entropyjianingy
 
Topic6 pptldshp4leadinglearningL70
 Topic6 pptldshp4leadinglearningL70 Topic6 pptldshp4leadinglearningL70
Topic6 pptldshp4leadinglearningL70omarraarmstrong
 
Rubyのオブジェクト図をもう一度
Rubyのオブジェクト図をもう一度Rubyのオブジェクト図をもう一度
Rubyのオブジェクト図をもう一度ionis111
 
2014-1 computer algorithm w2 notes
2014-1 computer algorithm w2 notes2014-1 computer algorithm w2 notes
2014-1 computer algorithm w2 notesDongseo University
 
Beautiful morning
Beautiful morningBeautiful morning
Beautiful morningpsjlew
 
印象.西湖.雨
印象.西湖.雨印象.西湖.雨
印象.西湖.雨psjlew
 
縮肛治百病
縮肛治百病縮肛治百病
縮肛治百病psjlew
 
揭秘全球最大金庫
揭秘全球最大金庫揭秘全球最大金庫
揭秘全球最大金庫psjlew
 
Les PDi a l'aula
Les PDi a l'aula Les PDi a l'aula
Les PDi a l'aula susana
 
罕见的
罕见的罕见的
罕见的psjlew
 
Outlook for Canada’s Economy
Outlook for Canada’s EconomyOutlook for Canada’s Economy
Outlook for Canada’s EconomyNulogx
 
認識自已臨終前的樣子
認識自已臨終前的樣子認識自已臨終前的樣子
認識自已臨終前的樣子psjlew
 
蒋介石与毛泽东的面相
蒋介石与毛泽东的面相蒋介石与毛泽东的面相
蒋介石与毛泽东的面相psjlew
 
台灣綠島
台灣綠島台灣綠島
台灣綠島psjlew
 
服貿協議
服貿協議服貿協議
服貿協議psjlew
 

Viewers also liked (20)

eLA (2015) Turning the Ship
eLA (2015) Turning the ShipeLA (2015) Turning the Ship
eLA (2015) Turning the Ship
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 
Chap 3
Chap 3Chap 3
Chap 3
 
Topic6 pptldshp4leadinglearningL70
 Topic6 pptldshp4leadinglearningL70 Topic6 pptldshp4leadinglearningL70
Topic6 pptldshp4leadinglearningL70
 
Rubyのオブジェクト図をもう一度
Rubyのオブジェクト図をもう一度Rubyのオブジェクト図をもう一度
Rubyのオブジェクト図をもう一度
 
Creation Care
Creation CareCreation Care
Creation Care
 
2014-1 computer algorithm w2 notes
2014-1 computer algorithm w2 notes2014-1 computer algorithm w2 notes
2014-1 computer algorithm w2 notes
 
Naucalpan2
Naucalpan2Naucalpan2
Naucalpan2
 
Beautiful morning
Beautiful morningBeautiful morning
Beautiful morning
 
印象.西湖.雨
印象.西湖.雨印象.西湖.雨
印象.西湖.雨
 
縮肛治百病
縮肛治百病縮肛治百病
縮肛治百病
 
揭秘全球最大金庫
揭秘全球最大金庫揭秘全球最大金庫
揭秘全球最大金庫
 
Les PDi a l'aula
Les PDi a l'aula Les PDi a l'aula
Les PDi a l'aula
 
罕见的
罕见的罕见的
罕见的
 
Outlook for Canada’s Economy
Outlook for Canada’s EconomyOutlook for Canada’s Economy
Outlook for Canada’s Economy
 
認識自已臨終前的樣子
認識自已臨終前的樣子認識自已臨終前的樣子
認識自已臨終前的樣子
 
Lektion 13 april
Lektion 13 aprilLektion 13 april
Lektion 13 april
 
蒋介石与毛泽东的面相
蒋介石与毛泽东的面相蒋介石与毛泽东的面相
蒋介石与毛泽东的面相
 
台灣綠島
台灣綠島台灣綠島
台灣綠島
 
服貿協議
服貿協議服貿協議
服貿協議
 

Similar to 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Probability density in data mining and coveriance
Probability density in data mining and coverianceProbability density in data mining and coveriance
Probability density in data mining and coverianceudhayax793
 
Tele4653 l9
Tele4653 l9Tele4653 l9
Tele4653 l9Vin Voro
 
Lecture7 channel capacity
Lecture7   channel capacityLecture7   channel capacity
Lecture7 channel capacityFrank Katta
 
Probability distribution for Dummies
Probability distribution for DummiesProbability distribution for Dummies
Probability distribution for DummiesBalaji P
 
Radio receiver and information coding .pptx
Radio receiver and information coding .pptxRadio receiver and information coding .pptx
Radio receiver and information coding .pptxswatihalunde
 
advance coding techniques - probability
advance coding techniques -  probabilityadvance coding techniques -  probability
advance coding techniques - probabilityYaseenMo
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論岳華 杜
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and ApplicationsSaurav Jha
 
The End-to-End Distance of RNA as a Randomly Self-Paired Polymer
The End-to-End Distance of RNA as a Randomly Self-Paired PolymerThe End-to-End Distance of RNA as a Randomly Self-Paired Polymer
The End-to-End Distance of RNA as a Randomly Self-Paired PolymerLi Tai Fang
 

Similar to 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy (20)

Probability density in data mining and coveriance
Probability density in data mining and coverianceProbability density in data mining and coveriance
Probability density in data mining and coveriance
 
Tele4653 l9
Tele4653 l9Tele4653 l9
Tele4653 l9
 
Lecture7 channel capacity
Lecture7   channel capacityLecture7   channel capacity
Lecture7 channel capacity
 
Losseless
LosselessLosseless
Losseless
 
Probability distribution for Dummies
Probability distribution for DummiesProbability distribution for Dummies
Probability distribution for Dummies
 
Radio receiver and information coding .pptx
Radio receiver and information coding .pptxRadio receiver and information coding .pptx
Radio receiver and information coding .pptx
 
Information theory
Information theoryInformation theory
Information theory
 
Information theory
Information theoryInformation theory
Information theory
 
advance coding techniques - probability
advance coding techniques -  probabilityadvance coding techniques -  probability
advance coding techniques - probability
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Excel Homework Help
Excel Homework HelpExcel Homework Help
Excel Homework Help
 
The End-to-End Distance of RNA as a Randomly Self-Paired Polymer
The End-to-End Distance of RNA as a Randomly Self-Paired PolymerThe End-to-End Distance of RNA as a Randomly Self-Paired Polymer
The End-to-End Distance of RNA as a Randomly Self-Paired Polymer
 
Unit 5.pdf
Unit 5.pdfUnit 5.pdf
Unit 5.pdf
 
My7class
My7classMy7class
My7class
 
6주차
6주차6주차
6주차
 
Dcs unit 2
Dcs unit 2Dcs unit 2
Dcs unit 2
 
Statistics Homework Help
Statistics Homework HelpStatistics Homework Help
Statistics Homework Help
 
Probabilistic systems exam help
Probabilistic systems exam helpProbabilistic systems exam help
Probabilistic systems exam help
 

More from Dongseo University

Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfDongseo University
 
Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Dongseo University
 
Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Dongseo University
 
Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Dongseo University
 
Average Linear Selection Algorithm
Average Linear Selection AlgorithmAverage Linear Selection Algorithm
Average Linear Selection AlgorithmDongseo University
 
Lower Bound of Comparison Sort
Lower Bound of Comparison SortLower Bound of Comparison Sort
Lower Bound of Comparison SortDongseo University
 
Running Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayRunning Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayDongseo University
 
Proof By Math Induction Example
Proof By Math Induction ExampleProof By Math Induction Example
Proof By Math Induction ExampleDongseo University
 
Estimating probability distributions
Estimating probability distributionsEstimating probability distributions
Estimating probability distributionsDongseo University
 
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)Dongseo University
 
2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)Dongseo University
 
2017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #52017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #5Dongseo University
 

More from Dongseo University (20)

Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
 
Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03
 
Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02
 
Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01
 
Markov Chain Monte Carlo
Markov Chain Monte CarloMarkov Chain Monte Carlo
Markov Chain Monte Carlo
 
Simplex Lecture Notes
Simplex Lecture NotesSimplex Lecture Notes
Simplex Lecture Notes
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Median of Medians
Median of MediansMedian of Medians
Median of Medians
 
Average Linear Selection Algorithm
Average Linear Selection AlgorithmAverage Linear Selection Algorithm
Average Linear Selection Algorithm
 
Lower Bound of Comparison Sort
Lower Bound of Comparison SortLower Bound of Comparison Sort
Lower Bound of Comparison Sort
 
Running Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayRunning Time of Building Binary Heap using Array
Running Time of Building Binary Heap using Array
 
Running Time of MergeSort
Running Time of MergeSortRunning Time of MergeSort
Running Time of MergeSort
 
Binary Trees
Binary TreesBinary Trees
Binary Trees
 
Proof By Math Induction Example
Proof By Math Induction ExampleProof By Math Induction Example
Proof By Math Induction Example
 
TRPO and PPO notes
TRPO and PPO notesTRPO and PPO notes
TRPO and PPO notes
 
Estimating probability distributions
Estimating probability distributionsEstimating probability distributions
Estimating probability distributions
 
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
 
2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)
 
2017-2 ML W11 GAN #1
2017-2 ML W11 GAN #12017-2 ML W11 GAN #1
2017-2 ML W11 GAN #1
 
2017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #52017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #5
 

Recently uploaded

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 

Recently uploaded (20)

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 

2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

  • 1. Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Entropy and your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright © 2001, 2003, Andrew W. Moore
  • 2. Bits You are watching a set of independent random samples of X You see that X has four possible values P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4 So you might see: BAACBADCDADDDA… You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100… Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 2
  • 3. Fewer Bits Someone tells you that the probabilities are not equal P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 3
  • 4. Fewer Bits Someone tells you that the probabilities are not equal P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? A 0 B 10 C 110 D 111 (This is just one of several ways) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 4
  • 5. Fewer Bits Suppose there are three equally likely values… P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3 Here’s a naïve coding, costing 2 bits per symbol A 00 B 01 C 10 Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol. Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 5
  • 6. General Case Suppose X can have one of m values… V1, V2, … Vm P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H ( X )   p1 log 2 p1  p2 log 2 p2    pm log 2 pm m   p j log 2 p j j 1 H(X) = The entropy of X (Shannon, 1948) • “High Entropy” means X is from a uniform (boring) distribution • “Low Entropy” means X is from varied (peaks and valleys) distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 6
  • 7. General Case Suppose X can have one of m values… V1, V2, … Vm P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm A histogram of the What’s the smallest possible number of frequency average, per bits, on distribution of symbol, needed to transmit a stream values of X would have A histogram of the of symbols drawn from X’s distribution? It’s frequency distribution of many lows and one or values log would be flat p   highs H(X )   p of X p  p log two p log p 1 2 1 2 2 2 m 2 m m   p j log 2 p j j 1 H(X) = The entropy of X • “High Entropy” means X is from a uniform (boring) distribution • “Low Entropy” means X is from varied (peaks and valleys) distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 7
  • 8. General Case Suppose X can have one of m values… V1, V2, … Vm P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm A histogram of the What’s the smallest possible number of frequency average, per bits, on distribution of symbol, needed to transmit a stream values of X would have A histogram of the of symbols drawn from X’s distribution? It’s frequency distribution of many lows and one or values log would be flat p   highs H(X )   p of X p  p log two p log p 1 2 1 2 2 2 m 2 m m   p ..and sop j values j log 2 the ..and so the values j 1 sampled from it would sampled from it would be be all over the place more predictable H(X) = The entropy of X • “High Entropy” means X is from a uniform (boring) distribution • “Low Entropy” means X is from varied (peaks and valleys) distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 8
  • 9. Entropy in a nut-shell Low Entropy High Entropy Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 9
  • 10. Entropy in a nut-shell Low Entropy High Entropy ..the values (locations of ..the values (locations soup) unpredictable... of soup) sampled almost uniformly sampled entirely from within the throughout our dining room soup bowl Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 10
  • 11. Entropy of a PDF  Entropy of X  H [ X ]    p( x) log p( x)dx x   Natural log (ln or loge) The larger the entropy of a distribution… …the harder it is to predict …the harder it is to compress it …the less spiky the distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 11
  • 12. 1 The “box”  w if p( x)   | x | w 2 distribution  0 if  | x | w 2 1/w -w/2 0 w/2  w/ 2 w/ 2 1 1 1 1 H [ X ]    p( x) log p( x)dx    log dx   log wdx  log w x   x  w / 2 w w w w x / 2 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 12
  • 13. 1 Unit variance  w if p( x)   | x | w 2 box distribution  0 if  | x | w 2 E[ X ]  0 1 w2 2 3 Var[ X ]  12  3 0 3 if w  2 3 then Var[ X ]  1 and H [ X ]  1.242 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 13
  • 14. The Hat  w | x |  p ( x)   w2 if |x|  w distribution  0  if |x|  w E[ X ]  0 1 2 w w Var[ X ]  6 w 0 w Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 14
  • 15. Unit variance hat  w | x |  p ( x)   w2 if |x|  w distribution  0  if |x|  w E[ X ]  0 1 2 w 6 Var[ X ]  6  6 0 6 if w  6 then Var[ X ]  1 and H [ X ]  1.396 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 15
  • 16. Dirac Delta The “2 spikes”  ( x  1)   ( x  1) p ( x)  distribution 2 1 1 E[ X ]  0   ( x  1)  ( x  1) 2 2 2 Var[ X ]  1 -1 0 1  H[ X ]    p( x) log p( x)dx   x   Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 16
  • 17. Entropies of unit-variance distributions Distribution Entropy Box 1.242 Hat 1.396 2 spikes -infinity ??? 1.4189 Largest possible entropy of any unit- variance distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 17
  • 18. Unit variance p( x)  1  x2  exp     2 Gaussian 2   E[ X ]  0 Var[ X ]  1  H[ X ]    p( x) log p( x)dx  1.4189 x   Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 18
  • 19. Specific Conditional Entropy H(Y|X=v) Suppose I’m trying to predict output Y and I have input X X = College Major Let’s assume this reflects the true probabilities Y = Likes “Gladiator” X Y E.G. From this data we estimate Math Yes • P(LikeG = Yes) = 0.5 History No • P(Major = Math & LikeG = No) = 0.25 CS Yes • P(Major = Math) = 0.5 Math No • P(LikeG = Yes | Major = History) = 0 Math No Note: CS Yes History No • H(X) = 1.5 Math Yes •H(Y) = 1 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 19
  • 20. Specific Conditional Entropy H(Y|X=v) X = College Major Definition of Specific Conditional Y = Likes “Gladiator” Entropy: H(Y |X=v) = The entropy of Y X Y among only those records in which Math Yes X has value v History No CS Yes Math No Math No CS Yes History No Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 20
  • 21. Specific Conditional Entropy H(Y|X=v) X = College Major Definition of Specific Conditional Y = Likes “Gladiator” Entropy: H(Y |X=v) = The entropy of Y X Y among only those records in which Math Yes X has value v History No Example: CS Yes • H(Y|X=Math) = 1 Math No • H(Y|X=History) = 0 Math No CS Yes • H(Y|X=CS) = 0 History No Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 21
  • 22. Conditional Entropy H(Y|X) X = College Major Definition of Conditional Y = Likes “Gladiator” Entropy: H(Y |X) = The average specific X Y conditional entropy of Y Math Yes History No = if you choose a record at random what CS Yes will be the conditional entropy of Y, Math No conditioned on that row’s value of X Math No = Expected number of bits to transmit Y if CS Yes both sides will know the value of X History No Math Yes = Σj Prob(X=vj) H(Y | X = vj) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 22
  • 23. Conditional Entropy X = College Major Definition of Conditional Entropy: Y = Likes “Gladiator” H(Y|X) = The average conditional entropy of Y = ΣjProb(X=vj) H(Y | X = vj) X Y Math Yes Example: History No vj Prob(X=vj) H(Y | X = vj) CS Yes Math No Math 0.5 1 Math No History 0.25 0 CS Yes CS 0.25 0 History No Math Yes H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 23
  • 24. Information Gain X = College Major Definition of Information Gain: Y = Likes “Gladiator” IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of X Y the line knew X? Math Yes IG(Y|X) = H(Y) - H(Y | X) History No CS Yes Example: Math No • H(Y) = 1 Math No • H(Y|X) = 0.5 CS Yes History No • Thus IG(Y|X) = 1 – 0.5 = 0.5 Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 24
  • 25. Relative Entropy:Distance Kullback- Leibler p( x) D( p, q)   p( x) log 2 ( ) x q ( x) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 25
  • 26. Mutual Information A quantity that measures the mutual dependence of the two random variables. p(x , y ) I (X ,Y )    p(x , y )log2( ) p(x )q (y ) p(x , y ) I (X ,Y )    p(x , y )log2( )dxdy Y X p(x )q (y ) p(x , y |c ) I (X ,Y |C )    p(x , y |c )log2( p(x |c )q (y |c ) ) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 26
  • 27. Mutual Information I(X,Y)=H(Y)-H(Y/X) p(y / x ) I (X ,Y )    p(x , y )log2( x y q (y ) ) I (X ,Y )    p(x , y )log2(p(y ))    p(x , y )log2(p(y / x )) x y x y I ( X , Y )   q( y) log 2 (q( y))   p( x) p( y / x) log 2 ( p( y / x)) y x y Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 27
  • 28. Mutual information • I(X,Y)=H(Y)-H(Y/X) • I(X,Y)=H(X)-H(X/Y) • I(X,Y)=H(X)+H(Y)-H(X,Y) • I(X,Y)=I(Y,X) • I(X,X)=H(X) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 28
  • 29. Information Gain Example Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 29
  • 30. Another example Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 30
  • 31. Relative Information Gain X = College Major Definition of Relative Information Y = Likes “Gladiator” Gain: RIG(Y|X) = I must transmit Y, what fraction of the bits on average would X Y it save me if both ends of the line knew X? Math Yes History No RIG(Y|X) = [H(Y) - H(Y | X) ]/ H(Y) CS Yes Math No Example: Math No • H(Y|X) = 0.5 CS Yes • H(Y) = 1 History No Math Yes • Thus IG(Y|X) = (1 – 0.5)/1 = 0.5 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 31
  • 32. What is Information Gain used for? Suppose you are trying to predict whether someone is going live past 80 years. From historical data you might find… •IG(LongLife | HairColor) = 0.01 •IG(LongLife | Smoker) = 0.2 •IG(LongLife | Gender) = 0.25 •IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells you how interesting a 2-d contingency table is going to be. Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 32
  • 33. Cross Entropy Sea X una variable aleatoria con distribucion conocida p(x) y distribucion estimada q(x), la “cross entropy” mide la diferencia entre las dos distribuciones y se define por HC ( x)  E[ log( q( x)]  H ( x)  KL( p, q) donde H(X) es la entropia de X con respecto a la distribucion p y KL es la distancia Kullback-Leibler ente p y q. Si p y q son discretas se reduce a : H C ( X )   p( x) log 2 (q( x)) x y para p y q continuas se tiene Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 33
  • 34. Bivariate Gaussians X Write r.v. X    Y  Then define X ~ N (μ, Σ) to mean   p ( x)  1 1  exp  1 (x  μ)T Σ 1 (x  μ) 2  2 || Σ || 2 Where the Gaussian’s parameters are…  x   2 x  xy  μ    Σ   y  2   y  xy Where we insist that S is symmetric non-negative definite Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 34
  • 35. Bivariate Gaussians X Write r.v. X    Y  Then define X ~ N (μ, Σ) to mean   p ( x)  1 1  exp  1 (x  μ)T Σ 1 (x  μ) 2  2 || Σ || 2 Where the Gaussian’s parameters are…  x   2 x  xy  μ    Σ   y  2   y  xy Where we insist that S is symmetric non-negative definite It turns out that E[X] =  and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 35
  • 36. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ)  p(x): Step 1 1 2 2 || Σ || 2 1. Begin with vector x x  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 36
  • 37. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ)  p(x): Step 2 1 2 2 || Σ || 2 1. Begin with vector x 2. Define  = x -  x   Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 37
  • 38. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ)  p(x): Step 3 1 2 2 || Σ || 2 Contours defined by 1. Begin with vector x sqrt(TS-1) = constant 2. Define  = x -  x 3. Count the number of contours crossed of the ellipsoids  formed S-1  D = this count = sqrt(TS-1) = Mahalonobis Distance between x and  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 38
  • 39. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ)  p(x): Step 4 1 2 2 || Σ || 2 1. Begin with vector x 2. Define  = x -  3. Count the number of contours exp(-D 2/2) crossed of the ellipsoids formed S-1 D = this count = sqrt(TS-1) = Mahalonobis Distance between x and  4. Define w = exp(-D 2/2) D2 x close to  in squared Mahalonobis space gets a large weight. Far away gets a tiny weight Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 39
  • 40. Evaluating p ( x)  1  exp  1 (x  μ)T Σ 1 (x  μ)  p(x): Step 5 1 2 2 || Σ || 2 1. Begin with vector x 2. Define  = x -  3. Count the number of contours exp(-D 2/2) crossed of the ellipsoids formed S-1 D = this count = sqrt(TS-1) = Mahalonobis Distance between x and  4. Define w = exp(-D 2/2) 1 5. Multiply w by 1 to ensure p(x)dx  1 D2 2 || Σ || 2 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 40
  • 41. Normal Bivariada NB(0,0,1,1,0) persp(x,y,a,theta=30,phi=10,zlab="f(x,y)",box=FALSE,col=4) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 41
  • 42. Normal Bivariada NB(0,0,1,1,0) 3 0.20 2 0.15 1 0 0.10 -1 0.05 -2 -3 0.00 -3 -2 -1 0 1 2 3 filled.contour(x,y,a,nlevels=4,col=2:5) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 42
  • 43. Multivariate Gaussians  X1     X2  Write r.v. X   Then define X ~ N (μ, Σ) to mean     X   m p ( x)  m 1 1  exp  1 (x  μ)T Σ 1 (x  μ) 2  (2 ) 2 || Σ || 2  1    21  12   1m  Where the Gaussian’s     parameters have…  2    12  2 2   2m  μ  Σ              2   m  1m  2 m   m Where we insist that S is symmetric non-negative definite Again, E[X] =  and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 43
  • 44. General Gaussians  1    21  12   1m       2    12  2 2   2m  μ  Σ              2   m  1m  2 m   m x2 x1 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 44
  • 45. Axis-Aligned Gaussians   21 0 0  0 0     1   0  22 0  0 0     0  2   0  23  0 0   μ  Σ                  2 m 1  m  0 0 0 0   0   2m   0 0 0  X i  X i for i  j x2 x1 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 45
  • 46. Spherical Gaussians  2 0 0  0 0     1   0 2 0  0 0     0  2  0 2  0 0  μ  Σ                  2  m  0 0 0 0   0  0 2  0 0  X i  X i for i  j x2 x1 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 46
  • 47. Subsets of variables  X1     X1  U     X   X2   U  m (u )  Write X   as X    where V      X m ( u ) 1      X  V     m  X   m  This will be our standard notation for breaking an m- dimensional distribution into subsets of variables Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 47
  • 48. Gaussian Marginals U   Margin- U V are Gaussian alize    X1     X1   X m (u ) 1   X2   U     Write X     as X   V  where U    , V           X   X  X   m(u )   m   m  U   μ u   Σuu Σuv   IF   ~ N  ,  T V μ   Σ      v   uv Σ vv    THEN U is also distributed as a Gaussian U ~ Nμu , Σuu  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 48
  • 49. Gaussian Marginals U   Margin- U V are Gaussian alize    X1     X1   X m (u ) 1   X2   U     Write X     as X   V  where U    , V           X   X  X   m(u )   m   m  U   μ u   Σuu Σuv   IF   ~ N  ,  T V μ   Σ      v   uv Σ vv    This fact is not immediately obvious THEN U is also distributed as a Gaussian Obvious, once we know U ~ Nμu , Σuu  it’s a Gaussian (why?) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 49
  • 50. Gaussian Marginals U   Margin- U V are Gaussian alize    X1     X1   X m (u ) 1   X2   U     Write X     as X   V  where U    , V           X   X  X   m(u )   m   m How would you prove this?  U   μ u   Σuu Σuv   IF   ~ N  ,  T V μ   Σ      v   uv Σ vv    p (u) THEN U is also distributed as a Gaussian   p(u, v)dv v U ~ Nμu , Σuu   (snore...) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 50
  • 51. Matrix A Linear Transforms X Multiply AX remain Gaussian Assume X is an m-dimensional Gaussian r.v. X ~ Nμ, Σ Define Y to be a p-dimensional r. v. thusly (note p  m): Y  AX …where A is a p x m matrix. Then…  Y ~ N Aμ, AΣ AT  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 51
  • 52. Adding samples of 2 independent Gaussians X + XY is Gaussian Y if X ~ Nμ x , Σ x  and Y ~ Nμ y , Σ y  and X  Y then X  Y ~ Nμ x  μ y , Σ x  Σ y  Why doesn’t this hold if X and Y are dependent? Which of the below statements is true? If X and Y are dependent, then X+Y is Gaussian but possibly with some other covariance If X and Y are dependent, then X+Y might be non-Gaussian Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 52
  • 53. Conditional of U   Condition- U|V V Gaussian is Gaussian alize    U   μ u   Σuu Σuv   IF   ~ N  ,  T V μ   Σ      v   uv Σ vv    THEN U | V ~ Nμu|v , Σu|v  where 1 μu|v  μu  ΣT Σvv (V  μ v ) uv  Σu|v  Σuu  ΣT Σvv1Σuv uv Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 53
  • 54.  U   μ u   Σuu Σuv    w   2977   849 2  967   IF   ~ N  ,  T V μ   Σ  IF   ~ N   y  76 ,   967 3.682         v   uv Σ vv           THEN U | V ~ Nμu|v , Σu|v  where THEN w | y ~ Nμ w| y , Σ w| y  where 976( y  76) 1 μu|v  μu  Σ Σ (V  μ v ) T μ w| y  2977  uv vv 3.682  967 2 Σu|v  Σuu  ΣT Σvv1Σuv uv Σ w| y  8492   8082 3.682 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 54
  • 55.  U   μ u   Σuu Σuv    w   2977   849 2  967   IF   ~ N  ,  T V μ   Σ  IF   ~ N   y  76 ,   967 3.682         v   uv Σ vv           THEN U | V ~ Nμu|v , Σu|v  where THEN w | y ~ Nμ w| y , Σ w| y  where 976( y  76) 1 μu|v  μu  Σ Σ (V  μ v ) T μ w| y  2977  uv vv 3.682  967 2 Σu|v  Σuu  ΣT Σvv1Σuv uv Σ w| y  8492   8082 3.682 P(w|m=82) P(w|m=76) P(w) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 55
  • 56.  U   μ u   Σuu Σuv    w   2977   849 2  967   IF   ~ N  ,  T V μ   Σ  IF   ~ N   y  76 ,   967 3.682         v   uv Σ vv     Note:      when given value of    THEN U | V ~ Nμu|v , Σu|v  where THEN v isy~, Nμ w| y , Σ w| y  where w | v the conditional mean of u is u 976( y  76) 1 μu|v  μu  ΣT Σvv (V  μ v ) μ w| y  2977  uv 3.682  967 2 Σu|v  Σuu  ΣT Σvv1Σuv Σ w| y  8492   8082 uv Note: marginal 2 3.68 mean is a linear function of v P(w|m=82) Note: conditional variance can only be equal to or smaller than P(w|m=76) marginal variance Note: conditional variance is independent of the given value of v P(w) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 56
  • 57. Gaussians and the U|V Chain U   V chain rule Rule V   Let A be a constant matrix IF U | V ~ NAV , Σu|v  and V ~ Nμv , Σvv   U THEN   ~ Nμ, Σ , with V    Aμ v   AΣ vv AT  Σu|v AΣ vv  μ  μ   Σ  ( AΣ )T   v   vv Σ vv   Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 57
  • 58. Available Gaussian tools U Margin-  U μ   Σ IF   ~ N  u ,  uu Σuv    THEN U ~ Nμu , Σuu    V alize U V     μ   ΣT   v   uv Σ vv      Matrix A IF X ~ Nμ, Σ AND Y  AX THEN Y ~ N Aμ, AΣ AT   X Multiply AX if X ~ Nμ x , Σ x  and Y ~ Nμ y , Σ y  and X  Y X then X  Y ~ Nμ x  μ y , Σ x  Σ y  Y + XY U | V ~ Nμu|v , Σu|v   U μ   Σ Σuv   THEN IF   ~ N  u ,  uu V   μ   ΣT   Σ vv   U Condition-     v   uv   V alize U | V where 1 μu|v  μu  ΣT Σvv (V  μ v )   uv  Σu|v  Σuu  ΣT Σvv1Σuv uv IF U | V ~ NAV , Σu|v  and V ~ Nμv , Σvv  U|V Chain U Rule   V  U  AΣ vv AT  Σu|v AΣ vv  V   THEN   ~ Nμ, Σ , with Σ   V  ( AΣ )T     vv Σ vv   Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 58
  • 59. Assume… • You are an intellectual snob • You have a child Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 59
  • 60. Intellectual snobs with children • …are obsessed with IQ • In the world as a whole, IQs are drawn from a Gaussian N(100,152) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 60
  • 61. IQ tests • If you take an IQ test you’ll get a score that, on average (over many tests) will be your IQ • But because of noise on any one test the score will often be a few points lower or higher than your true IQ. SCORE | IQ ~ N(IQ,102) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 61
  • 62. Assume… • You drag your kid off to get tested • She gets a score of 130 • “Yippee” you screech and start deciding how to casually refer to her membership of the top 2% of IQs in your Christmas newsletter. P(X<130|=100,2=152) = P(X<2| =0,2=1) = erf(2) = 0.977 Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 62
  • 63. Assume… • You drag your kid off to get tested You are thinking: • She gets a score of 130 Well sure the test isn’t accurate, so • “Yippee” you screech andan IQ of 120 or she how she might have start deciding might have an 1Q of 140, but the to casually refermost her IQ given the evidenceof the to likely membership top 2% of IQs in“score=130” is, of course, newsletter. your Christmas 130. P(X<130|=100,2=152) = P(X<2| =0,2=1) = erf(2) = 0.977 Can we trust this reasoning? Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 63
  • 64. What we really want: • IQ~N(100,152) • S|IQ ~ N(IQ, 102) • S=130 • Question: What is IQ | (S=130)? Called the Posterior Distribution of IQ Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 64
  • 65. Which tool or tools? • IQ~N(100,152) U Margin-   V alize U • S|IQ ~ N(IQ, 102)   Matrix A • S=130 X Multiply AX • Question: What is X + XY IQ | (S=130)? Y U Condition-   V alize U|V   U|V Chain U Rule   V V   Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 65
  • 66. Plan • IQ~N(100,152) • S|IQ ~ N(IQ, 102) • S=130 • Question: What is IQ | (S=130)? S | IQ Chain  S   IQ  Condition- Rule    IQ  Swap    S  alize IQ | S IQ     Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 66
  • 67. Working…  U μ   Σ IF   ~ N  u ,  uu V   μ   ΣT Σuv   THEN  Σ vv       v   uv  IQ~N(100,152) 1 μu|v  μu  ΣT Σvv (V  μ v ) S|IQ ~ N(IQ, 102) uv S=130 IF U | V ~ NAV , Σu|v  and V ~ Nμv , Σvv  Question: What is IQ | (S=130)?  U  AΣ vv AT  Σu|v AΣ vv  THEN   ~ Nμ, Σ , with Σ   V     ( AΣ )T Σ vv   vv  Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 67