More Related Content
Similar to 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy
Similar to 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy (20)
More from Dongseo University
More from Dongseo University (20)
2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy
- 1. Note to other teachers and users of these slides.
Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
Entropy and
your own lecture, please include this message, or the
following link to the source repository of Andrew’s
tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.
Information Gain
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Copyright © 2001, 2003, Andrew W. Moore
- 2. Bits
You are watching a set of independent random samples of X
You see that X has four possible values
P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4
So you might see: BAACBADCDADDDA…
You transmit data over a binary serial link. You can encode
each reading with two bits (e.g. A = 00, B = 01, C = 10, D =
11)
0100001001001110110011111100…
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 2
- 3. Fewer Bits
Someone tells you that the probabilities are not equal
P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8
It’s possible…
…to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 3
- 4. Fewer Bits
Someone tells you that the probabilities are not equal
P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8
It’s possible…
…to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?
A 0
B 10
C 110
D 111
(This is just one of several ways)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 4
- 5. Fewer Bits
Suppose there are three equally likely values…
P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3
Here’s a naïve coding, costing 2 bits per symbol
A 00
B 01
C 10
Can you think of a coding that would need only 1.6 bits
per symbol on average?
In theory, it can in fact be done with 1.58496 bits per
symbol.
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 5
- 6. General Case
Suppose X can have one of m values… V1, V2, … Vm
P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm
What’s the smallest possible number of bits, on average, per
symbol, needed to transmit a stream of symbols drawn from
X’s distribution? It’s
H ( X ) p1 log 2 p1 p2 log 2 p2 pm log 2 pm
m
p j log 2 p j
j 1
H(X) = The entropy of X (Shannon, 1948)
• “High Entropy” means X is from a uniform (boring) distribution
• “Low Entropy” means X is from varied (peaks and valleys) distribution
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 6
- 7. General Case
Suppose X can have one of m values… V1, V2, … Vm
P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm
A histogram of the
What’s the smallest possible number of frequency average, per
bits, on distribution of
symbol, needed to transmit a stream values of X would have
A histogram of the of symbols drawn from
X’s distribution? It’s
frequency distribution of many lows and one or
values log would be flat p highs
H(X ) p of X p p log two
p log p
1 2 1 2 2 2 m 2 m
m
p j log 2 p j
j 1
H(X) = The entropy of X
• “High Entropy” means X is from a uniform (boring) distribution
• “Low Entropy” means X is from varied (peaks and valleys) distribution
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 7
- 8. General Case
Suppose X can have one of m values… V1, V2, … Vm
P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm
A histogram of the
What’s the smallest possible number of frequency average, per
bits, on distribution of
symbol, needed to transmit a stream values of X would have
A histogram of the of symbols drawn from
X’s distribution? It’s
frequency distribution of many lows and one or
values log would be flat p highs
H(X ) p of X p p log two
p log p
1 2 1 2 2 2 m 2 m
m
p ..and sop j values
j log 2 the ..and so the values
j 1 sampled from it would sampled from it would be
be all over the place more predictable
H(X) = The entropy of X
• “High Entropy” means X is from a uniform (boring) distribution
• “Low Entropy” means X is from varied (peaks and valleys) distribution
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 8
- 9. Entropy in a nut-shell
Low Entropy High Entropy
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 9
- 10. Entropy in a nut-shell
Low Entropy High Entropy
..the values (locations of
..the values (locations soup) unpredictable...
of soup) sampled almost uniformly sampled
entirely from within the throughout our dining room
soup bowl
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 10
- 11. Entropy of a PDF
Entropy of X H [ X ] p( x) log p( x)dx
x
Natural log (ln or loge)
The larger the entropy of a distribution…
…the harder it is to predict
…the harder it is to compress it
…the less spiky the distribution
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 11
- 12. 1
The “box” w if
p( x)
| x |
w
2
distribution 0 if
| x |
w
2
1/w
-w/2 0 w/2
w/ 2 w/ 2
1 1 1 1
H [ X ] p( x) log p( x)dx log dx log wdx log w
x x w / 2
w w w w x / 2
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 12
- 13. 1
Unit variance w if
p( x)
| x |
w
2
box distribution 0 if
| x |
w
2
E[ X ] 0
1
w2
2 3 Var[ X ]
12
3 0 3
if w 2 3 then Var[ X ] 1 and H [ X ] 1.242
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 13
- 14. The Hat w | x |
p ( x) w2
if |x| w
distribution 0
if |x| w
E[ X ] 0
1 2
w
w Var[ X ]
6
w 0 w
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 14
- 15. Unit variance hat w | x |
p ( x) w2
if |x| w
distribution 0
if |x| w
E[ X ] 0
1 2
w
6 Var[ X ]
6
6 0 6
if w 6 then Var[ X ] 1 and H [ X ] 1.396
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 15
- 16. Dirac Delta
The “2 spikes” ( x 1) ( x 1)
p ( x)
distribution 2
1 1 E[ X ] 0
( x 1) ( x 1)
2 2
2 Var[ X ] 1
-1 0 1
H[ X ] p( x) log p( x)dx
x
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 16
- 17. Entropies of unit-variance
distributions
Distribution Entropy
Box 1.242
Hat 1.396
2 spikes -infinity
??? 1.4189 Largest possible
entropy of any unit-
variance distribution
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 17
- 18. Unit variance p( x)
1 x2
exp
2
Gaussian 2
E[ X ] 0
Var[ X ] 1
H[ X ] p( x) log p( x)dx 1.4189
x
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 18
- 19. Specific Conditional Entropy H(Y|X=v)
Suppose I’m trying to predict output Y and I have input X
X = College Major Let’s assume this reflects the true
probabilities
Y = Likes “Gladiator”
X Y E.G. From this data we estimate
Math Yes • P(LikeG = Yes) = 0.5
History No • P(Major = Math & LikeG = No) = 0.25
CS Yes • P(Major = Math) = 0.5
Math No
• P(LikeG = Yes | Major = History) = 0
Math No
Note:
CS Yes
History No • H(X) = 1.5
Math Yes •H(Y) = 1
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 19
- 20. Specific Conditional Entropy H(Y|X=v)
X = College Major Definition of Specific Conditional
Y = Likes “Gladiator” Entropy:
H(Y |X=v) = The entropy of Y
X Y among only those records in which
Math Yes
X has value v
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 20
- 21. Specific Conditional Entropy H(Y|X=v)
X = College Major Definition of Specific Conditional
Y = Likes “Gladiator” Entropy:
H(Y |X=v) = The entropy of Y
X Y among only those records in which
Math Yes
X has value v
History No Example:
CS Yes
• H(Y|X=Math) = 1
Math No
• H(Y|X=History) = 0
Math No
CS Yes • H(Y|X=CS) = 0
History No
Math Yes
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 21
- 22. Conditional Entropy H(Y|X)
X = College Major Definition of Conditional
Y = Likes “Gladiator” Entropy:
H(Y |X) = The average specific
X Y
conditional entropy of Y
Math Yes
History No = if you choose a record at random what
CS Yes will be the conditional entropy of Y,
Math No conditioned on that row’s value of X
Math No = Expected number of bits to transmit Y if
CS Yes both sides will know the value of X
History No
Math Yes = Σj Prob(X=vj) H(Y | X = vj)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 22
- 23. Conditional Entropy
X = College Major Definition of Conditional Entropy:
Y = Likes “Gladiator”
H(Y|X) = The average conditional
entropy of Y
= ΣjProb(X=vj) H(Y | X = vj)
X Y
Math Yes Example:
History No vj Prob(X=vj) H(Y | X = vj)
CS Yes
Math No
Math 0.5 1
Math No History 0.25 0
CS Yes CS 0.25 0
History No
Math Yes H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 23
- 24. Information Gain
X = College Major Definition of Information Gain:
Y = Likes “Gladiator”
IG(Y|X) = I must transmit Y.
How many bits on average
would it save me if both ends of
X Y
the line knew X?
Math Yes IG(Y|X) = H(Y) - H(Y | X)
History No
CS Yes Example:
Math No • H(Y) = 1
Math No
• H(Y|X) = 0.5
CS Yes
History No • Thus IG(Y|X) = 1 – 0.5 = 0.5
Math Yes
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 24
- 26. Mutual Information
A quantity that measures the mutual dependence of
the two random variables.
p(x , y )
I (X ,Y ) p(x , y )log2( )
p(x )q (y )
p(x , y )
I (X ,Y ) p(x , y )log2( )dxdy
Y X p(x )q (y )
p(x , y |c )
I (X ,Y |C ) p(x , y |c )log2( p(x |c )q (y |c )
)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 26
- 27. Mutual Information
I(X,Y)=H(Y)-H(Y/X)
p(y / x )
I (X ,Y ) p(x , y )log2(
x y q (y )
)
I (X ,Y ) p(x , y )log2(p(y )) p(x , y )log2(p(y / x ))
x y x y
I ( X , Y ) q( y) log 2 (q( y)) p( x) p( y / x) log 2 ( p( y / x))
y x y
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 27
- 28. Mutual information
• I(X,Y)=H(Y)-H(Y/X)
• I(X,Y)=H(X)-H(X/Y)
• I(X,Y)=H(X)+H(Y)-H(X,Y)
• I(X,Y)=I(Y,X)
• I(X,X)=H(X)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 28
- 31. Relative Information Gain
X = College Major Definition of Relative Information
Y = Likes “Gladiator” Gain:
RIG(Y|X) = I must transmit Y, what
fraction of the bits on average would
X Y
it save me if both ends of the line
knew X?
Math Yes
History No RIG(Y|X) = [H(Y) - H(Y | X) ]/ H(Y)
CS Yes
Math No Example:
Math No • H(Y|X) = 0.5
CS Yes
• H(Y) = 1
History No
Math Yes • Thus IG(Y|X) = (1 – 0.5)/1 = 0.5
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 31
- 32. What is Information Gain used for?
Suppose you are trying to predict whether someone
is going live past 80 years. From historical data you
might find…
•IG(LongLife | HairColor) = 0.01
•IG(LongLife | Smoker) = 0.2
•IG(LongLife | Gender) = 0.25
•IG(LongLife | LastDigitOfSSN) = 0.00001
IG tells you how interesting a 2-d contingency table is
going to be.
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 32
- 33. Cross Entropy
Sea X una variable aleatoria con distribucion conocida p(x) y distribucion
estimada q(x), la “cross entropy” mide la diferencia entre las dos
distribuciones y se define por
HC ( x) E[ log( q( x)] H ( x) KL( p, q)
donde H(X) es la entropia de X con respecto a la distribucion p y KL es
la distancia Kullback-Leibler ente p y q.
Si p y q son discretas se reduce a :
H C ( X ) p( x) log 2 (q( x))
x
y para p y q continuas se tiene
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 33
- 34. Bivariate Gaussians
X
Write r.v. X
Y Then define X ~ N (μ, Σ) to mean
p ( x)
1
1
exp 1 (x μ)T Σ 1 (x μ)
2
2 || Σ || 2
Where the Gaussian’s parameters are…
x 2 x xy
μ
Σ
y 2
y
xy
Where we insist that S is symmetric non-negative definite
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 34
- 35. Bivariate Gaussians
X
Write r.v. X
Y Then define X ~ N (μ, Σ) to mean
p ( x)
1
1
exp 1 (x μ)T Σ 1 (x μ)
2
2 || Σ || 2
Where the Gaussian’s parameters are…
x 2 x xy
μ
Σ
y 2
y
xy
Where we insist that S is symmetric non-negative definite
It turns out that E[X] = and Cov[X] = S. (Note that this is a
resulting property of Gaussians, not a definition)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 35
- 36. Evaluating p ( x)
1
exp 1 (x μ)T Σ 1 (x μ)
p(x): Step 1
1 2
2 || Σ || 2
1. Begin with vector x
x
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 36
- 37. Evaluating p ( x)
1
exp 1 (x μ)T Σ 1 (x μ)
p(x): Step 2
1 2
2 || Σ || 2
1. Begin with vector x
2. Define = x -
x
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 37
- 38. Evaluating p ( x)
1
exp 1 (x μ)T Σ 1 (x μ)
p(x): Step 3
1 2
2 || Σ || 2
Contours defined by
1. Begin with vector x sqrt(TS-1) = constant
2. Define = x -
x
3. Count the number of contours
crossed of the ellipsoids
formed S-1
D = this count = sqrt(TS-1)
= Mahalonobis Distance
between x and
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 38
- 39. Evaluating p ( x)
1
exp 1 (x μ)T Σ 1 (x μ)
p(x): Step 4
1 2
2 || Σ || 2
1. Begin with vector x
2. Define = x -
3. Count the number of contours
exp(-D 2/2)
crossed of the ellipsoids
formed S-1
D = this count = sqrt(TS-1)
= Mahalonobis Distance
between x and
4. Define w = exp(-D 2/2)
D2
x close to in squared Mahalonobis
space gets a large weight. Far away gets
a tiny weight
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 39
- 40. Evaluating p ( x)
1
exp 1 (x μ)T Σ 1 (x μ)
p(x): Step 5
1 2
2 || Σ || 2
1. Begin with vector x
2. Define = x -
3. Count the number of contours
exp(-D 2/2)
crossed of the ellipsoids
formed S-1
D = this count = sqrt(TS-1)
= Mahalonobis Distance
between x and
4. Define w = exp(-D 2/2)
1
5. Multiply w by 1 to ensure p(x)dx 1 D2
2 || Σ || 2
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 40
- 41. Normal Bivariada NB(0,0,1,1,0)
persp(x,y,a,theta=30,phi=10,zlab="f(x,y)",box=FALSE,col=4)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 41
- 42. Normal Bivariada NB(0,0,1,1,0)
3 0.20
2
0.15
1
0 0.10
-1
0.05
-2
-3 0.00
-3 -2 -1 0 1 2 3
filled.contour(x,y,a,nlevels=4,col=2:5)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 42
- 43. Multivariate Gaussians
X1
X2
Write r.v. X Then define X ~ N (μ, Σ) to mean
X
m
p ( x) m
1
1
exp 1 (x μ)T Σ 1 (x μ)
2
(2 ) 2
|| Σ || 2
1 21 12 1m
Where the Gaussian’s
parameters have… 2 12 2 2 2m
μ Σ
2
m 1m 2 m m
Where we insist that S is symmetric non-negative definite
Again, E[X] = and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 43
- 44. General Gaussians
1 21 12 1m
2 12 2 2 2m
μ Σ
2
m 1m 2 m m
x2
x1
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 44
- 45. Axis-Aligned Gaussians
21 0 0 0 0
1 0 22 0 0 0
0
2 0 23 0 0
μ Σ
2 m 1
m 0 0 0 0
0 2m
0 0 0
X i X i for i j
x2
x1
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 45
- 46. Spherical Gaussians
2 0 0 0 0
1 0 2 0 0 0
0
2 0 2 0 0
μ Σ
2
m 0 0 0 0
0 0 2
0 0
X i X i for i j
x2
x1
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 46
- 47. Subsets of variables
X1
X1 U
X
X2 U m (u )
Write X as X where
V
X m ( u ) 1
X V
m
X
m
This will be our standard notation for breaking an m-
dimensional distribution into subsets of variables
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 47
- 48. Gaussian Marginals U
Margin-
U
V
are Gaussian
alize
X1
X1 X m (u ) 1
X2 U
Write X
as X V where U , V
X X
X m(u ) m
m
U μ u Σuu Σuv
IF ~ N , T
V μ Σ
v uv Σ vv
THEN U is also distributed as a Gaussian
U ~ Nμu , Σuu
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 48
- 49. Gaussian Marginals U
Margin-
U
V
are Gaussian
alize
X1
X1 X m (u ) 1
X2 U
Write X
as X V where U , V
X X
X m(u ) m
m
U μ u Σuu Σuv
IF ~ N , T
V μ Σ
v uv Σ vv
This fact is not
immediately obvious
THEN U is also distributed as a Gaussian
Obvious, once we know
U ~ Nμu , Σuu it’s a Gaussian (why?)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 49
- 50. Gaussian Marginals U
Margin-
U
V
are Gaussian
alize
X1
X1 X m (u ) 1
X2 U
Write X
as X V where U , V
X X
X m(u ) m
m How would you prove
this?
U μ u Σuu Σuv
IF ~ N , T
V μ Σ
v uv Σ vv
p (u)
THEN U is also distributed as a Gaussian p(u, v)dv
v
U ~ Nμu , Σuu (snore...)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 50
- 51. Matrix A
Linear Transforms X Multiply AX
remain Gaussian
Assume X is an m-dimensional Gaussian r.v.
X ~ Nμ, Σ
Define Y to be a p-dimensional r. v. thusly (note p m):
Y AX
…where A is a p x m matrix. Then…
Y ~ N Aμ, AΣ AT
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 51
- 52. Adding samples of 2
independent Gaussians X
+ XY
is Gaussian Y
if X ~ Nμ x , Σ x and Y ~ Nμ y , Σ y and X Y
then X Y ~ Nμ x μ y , Σ x Σ y
Why doesn’t this hold if X and Y are dependent?
Which of the below statements is true?
If X and Y are dependent, then X+Y is Gaussian but possibly
with some other covariance
If X and Y are dependent, then X+Y might be non-Gaussian
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 52
- 53. Conditional of U
Condition-
U|V
V
Gaussian is Gaussian
alize
U μ u Σuu Σuv
IF ~ N , T
V μ Σ
v uv Σ vv
THEN U | V ~ Nμu|v , Σu|v where
1
μu|v μu ΣT Σvv (V μ v )
uv
Σu|v Σuu ΣT Σvv1Σuv
uv
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 53
- 54. U μ u Σuu Σuv w 2977 849 2 967
IF ~ N , T
V μ Σ IF ~ N
y 76 , 967 3.682
v uv Σ vv
THEN U | V ~ Nμu|v , Σu|v where THEN w | y ~ Nμ w| y , Σ w| y where
976( y 76)
1
μu|v μu Σ Σ (V μ v )
T
μ w| y 2977
uv vv
3.682
967 2
Σu|v Σuu ΣT Σvv1Σuv
uv Σ w| y 8492 8082
3.682
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 54
- 55. U μ u Σuu Σuv w 2977 849 2 967
IF ~ N , T
V μ Σ IF ~ N
y 76 , 967 3.682
v uv Σ vv
THEN U | V ~ Nμu|v , Σu|v where THEN w | y ~ Nμ w| y , Σ w| y where
976( y 76)
1
μu|v μu Σ Σ (V μ v )
T
μ w| y 2977
uv vv
3.682
967 2
Σu|v Σuu ΣT Σvv1Σuv
uv Σ w| y 8492 8082
3.682
P(w|m=82)
P(w|m=76)
P(w)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 55
- 56. U μ u Σuu Σuv w 2977 849 2 967
IF ~ N , T
V μ Σ IF ~ N
y 76 , 967 3.682
v uv Σ vv
Note:
when given value of
THEN U | V ~ Nμu|v , Σu|v where THEN v isy~, Nμ w| y , Σ w| y where
w | v the conditional
mean of u is u
976( y 76)
1
μu|v μu ΣT Σvv (V μ v ) μ w| y 2977
uv
3.682
967 2
Σu|v Σuu ΣT Σvv1Σuv Σ w| y 8492 8082
uv
Note: marginal 2
3.68 mean is
a linear function of v
P(w|m=82)
Note: conditional
variance can only be
equal to or smaller than P(w|m=76)
marginal variance
Note: conditional
variance is independent
of the given value of v
P(w)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 56
- 57. Gaussians and the U|V Chain U
V
chain rule
Rule
V
Let A be a constant matrix
IF U | V ~ NAV , Σu|v and V ~ Nμv , Σvv
U
THEN ~ Nμ, Σ , with
V
Aμ v AΣ vv AT Σu|v AΣ vv
μ
μ Σ
( AΣ )T
v vv Σ vv
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 57
- 58. Available Gaussian tools
U Margin- U μ Σ
IF ~ N u , uu
Σuv
THEN U ~ Nμu , Σuu
V alize
U V
μ ΣT
v uv Σ vv
Matrix A
IF X ~ Nμ, Σ AND Y AX THEN Y ~ N Aμ, AΣ AT
X Multiply AX
if X ~ Nμ x , Σ x and Y ~ Nμ y , Σ y and X Y
X then X Y ~ Nμ x μ y , Σ x Σ y
Y
+ XY
U | V ~ Nμu|v , Σu|v
U μ Σ Σuv THEN
IF ~ N u , uu
V μ ΣT
Σ vv
U Condition- v uv
V alize
U | V where 1
μu|v μu ΣT Σvv (V μ v )
uv
Σu|v Σuu ΣT Σvv1Σuv
uv
IF U | V ~ NAV , Σu|v and V ~ Nμv , Σvv
U|V Chain U
Rule
V U AΣ vv AT Σu|v AΣ vv
V THEN ~ Nμ, Σ , with Σ
V ( AΣ )T
vv Σ vv
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 58
- 59. Assume…
• You are an intellectual snob
• You have a child
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 59
- 60. Intellectual snobs with children
• …are obsessed with IQ
• In the world as a whole, IQs are drawn from
a Gaussian N(100,152)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 60
- 61. IQ tests
• If you take an IQ test you’ll get a score that,
on average (over many tests) will be your
IQ
• But because of noise on any one test the
score will often be a few points lower or
higher than your true IQ.
SCORE | IQ ~ N(IQ,102)
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 61
- 62. Assume…
• You drag your kid off to get tested
• She gets a score of 130
• “Yippee” you screech and start deciding how
to casually refer to her membership of the
top 2% of IQs in your Christmas newsletter.
P(X<130|=100,2=152) =
P(X<2| =0,2=1) =
erf(2) = 0.977
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 62
- 63. Assume…
• You drag your kid off to get tested
You are thinking:
• She gets a score of 130
Well sure the test isn’t accurate, so
• “Yippee” you screech andan IQ of 120 or she how
she might have start deciding
might have an 1Q of 140, but the
to casually refermost her IQ given the evidenceof the
to likely membership
top 2% of IQs in“score=130” is, of course, newsletter.
your Christmas 130.
P(X<130|=100,2=152) =
P(X<2| =0,2=1) =
erf(2) = 0.977
Can we trust
this reasoning?
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 63
- 64. What we really want:
• IQ~N(100,152)
• S|IQ ~ N(IQ, 102)
• S=130
• Question: What is
IQ | (S=130)?
Called the Posterior
Distribution of IQ
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 64
- 65. Which tool or tools?
• IQ~N(100,152) U Margin-
V alize
U
• S|IQ ~ N(IQ, 102)
Matrix A
• S=130
X Multiply AX
• Question: What is X
+ XY
IQ | (S=130)? Y
U Condition-
V alize
U|V
U|V Chain U
Rule
V
V
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 65
- 66. Plan
• IQ~N(100,152)
• S|IQ ~ N(IQ, 102)
• S=130
• Question: What is
IQ | (S=130)?
S | IQ Chain S IQ Condition-
Rule
IQ Swap
S alize
IQ | S
IQ
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 66
- 67. Working… U μ Σ
IF ~ N u , uu
V μ ΣT
Σuv THEN
Σ vv
v uv
IQ~N(100,152) 1
μu|v μu ΣT Σvv (V μ v )
S|IQ ~ N(IQ, 102) uv
S=130
IF U | V ~ NAV , Σu|v and V ~ Nμv , Σvv
Question: What is IQ | (S=130)? U AΣ vv AT Σu|v AΣ vv
THEN ~ Nμ, Σ , with Σ
V
( AΣ )T Σ vv
vv
Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 67