SlideShare a Scribd company logo
1 of 108
Download to read offline
VC.M. Bishop’s PRML
Chapter 10: Approximate Inference
Tran Quoc Hoan
@k09hthaduonght.wordpress.com/
13 December 2015, PRML Reading, Hasegawa lab., Tokyo
The University of Tokyo
Excuse me…
Variational Inference 2
Section
Concentrate
ability
Speaker
Audiences
Should we take a break?
(or next time)
Outline
Variational Inference 3
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and 

the variational principle
Part II:
Design the 

variational algorithms
10.7. Expectation
Propagation
Progress…
Variational Inference 4
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
Probabilistic Inference
10.1 Variational Inference 5
Any mechanism by which we deduce the probabilities
in our model based data.
Statistical Inference
In probabilistic models, we need reason over the probability of events
Inference links the observed data with our statistical assumptions and allows us
to ask questions of our data: predictions, visualization, model selection.
Modeling and Inference
10.1 Variational Inference 6
Posterior
Bayes’ rule in many of inferential problems
Probabilistic modeling will involve:
• Decide on a priori beliefs.
• Posit an explanation of how the observed
data is generated, i.e. provide a
probabilistic description.
=
Likelihood Prior
Marginal likelihood
(Model evidence)
p(z|x)
p(x|z) p(z)
Z
p(x, z)dz
Observed data
Hidden
variable
Modeling and Inference
10.1 Variational Inference 7
Most inference problems will be one of:
Marginalization
Expectation
Prediction
Posterior
=
Likelihood Prior
Marginal likelihood
(Model evidence)
Z
p(x, z)dz
p(z|x)
p(x|z) p(z)
Complex form for which
the expectations are not
tractable
Importance Sampling
10.1 Variational Inference 8
IntegralBasic idea:
Transform the integral
into an expectation over a
simple, known
distribution
p(z) f(z)
z
q(z)
Conditions:
• q(z) > 0 when f(z)p(z) ≠ 0
• Easy to sample from q(z)
E[f] =
Z
f(z)p(z)dz
E[f] =
Z
f(z)p(z)
q(z)
q(z)
dz
Notice: x is abbreviated in formula
E[f] =
Z
f(z)
p(z)
q(z)
q(z)dz
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
E[f] =
1
S
X
s
w(s)
f(z(s)
)
Proposal
Importance
weight
Monte Carlo
Importance Sampling
10.1 Variational Inference 9
p(z) f(z)
z
q(z)
Properties:
• Unbiased estimate of the expectation
• No independent samples from posterior
distribution
• Many draws from proposal needed,
especially in high dimensions
E[f] =
1
S
X
s
w(s)
f(z(s)
)
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
Stochastic
Approximation
Chapter 11
Importance Sampling
10.1 Variational Inference 10
p(z) f(z)
z
q(z)
Take inspiration from importance sampling, but instead:
• Obtain a deterministic algorithm
• Scaled up to high-dimensional and large data problems
• Easy convergence assessment
E[f] =
1
S
X
s
w(s)
f(z(s)
)
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
Variational
Inference
What is a Variational Method?
10.1 Variational Inference 11
Variational Principle
General family of methods for approximating complicated
densities by a simpler class of densities
Approximation class
True posterior
Deterministic approximation procedures
with bounds on probabilities of interest
Fit the variational parameters
Variational Calculus
10.1 Variational Inference 12
Functions:
• Variables as input, output is a value
• Full and partial derivatives df/dx
• Ex. Maximize likelihood p(x|µ) w.r.t.
parameters µ
Both types of derivatives are
exploited in variational inference
Functionals:
• Functions as input, output is a value
• Functional derivatives ∂f/∂x
• Ex. Maximize the entropy H[p(x)] w.r.t.
p(x)
Variational method derives from the
Calculus of Variations
From IS to Variational Inference
10.1 Variational Inference 13
Integral
Importance weight
Jensen’s inequality
ln p(X) = ln
Z
p(X|Z)p(Z)dZ
ln p(X) = ln
Z
p(X|Z)
p(Z)
q(Z)
q(Z)dZ
ln
Z
p(x)g(x)dx
Z
p(x) ln g(x)dx
ln p(X)
Z
q(Z) ln (p(X|Z)
p(Z)
q(Z)
)dZ
Variational
(evidence) lower
bound
=
Z
q(Z) ln p(X|Z)dZ
Z
q(Z) ln
q(Z)
p(Z)
dZ
Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]=
Variational Lower Bound
10.1 Variational Inference 14
F(X, q) =
Reconstruction PenaltyApprox. Posterior
• Penalty: Ensures the explanation of the data q(Z) doesn’t deviate too far from
your beliefs p(Z).
• Reconstruction cost: The expected log-likelihood measure how well samples
from q(Z) are able to explain the data X.
• Approximate posterior distribution q(Z): Best match to true posterior p(Z|X),
one of the unknown inferential quantities of interest to us.
Interpreting the bound:

Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]
How low the VLB (ELB)?
10.1 Variational Inference 15
• Variational parameters: Parameters of q(Z) (Ex. if a Gaussian, they’re mean
and variance).
• Integration switched to optimization: optimize q(Z) directly (my thought:
actually it is q(Z|X) ) to minimize
Some comments on q:

ln p(X) F(X, q) =
Z
q(Z) ln p(X)dZ F(X, q)
=
Z
q(Z) ln p(X)dZ
Z
q(Z) ln p(X|Z)dZ +
Z
q(Z) ln
q(Z)
p(Z)
dZ
=
Z
q(Z) ln
p(X)q(Z)
p(X|Z)p(Z)
dZ =
Z
q(Z) ln
q(Z)
p(Z|X)
dZ
= KL[q(Z)||p(Z|X)]
KL[q(Z)||p(Z|X)]
From the book
10.1 Variational Inference 16
KL(q k p) =
Z
q(Z) ln
⇢
p(Z|X)
q(Z)
dZ
ln p(X) = L(q) + KL(q k p)
Maximum occurs when 

q(Z) = p(Z|X)
Approximate the maximum by variational method
(10.2)
(10.3)
(10.4)
• What exactly is q(z)?
• How do we find the
variational parameters?
• How do we compute
the gradients?
• How do we optimize
the model parameters?
L(q) = F(X, q) =
Z
q(Z) ln
⇢
p(X, Z)
q(Z)
dZ
Free-form and Fixed-form
10.1 Variational Inference 17
Free-form variational method solves for the exact distribution
setting the functional derivative to zero
Fixed-form variational method specifies an explicit form of the q-
distribution
The optimal solution is the
true posterior distribution
but solving for the
normalization is original
Ideally rich class of
distribution
q (Z) = f(Z; )
L(q)
q(Z)
= 0 s.t.
Z
q(Z)dZ = 1
Variational parameter
q(Z) / p(Z)p(X|Z, ✓)
10.1.1 Factorized distributions (I)
10.1 Variational Inference 18
• Mean-field methods assume that the distribution is factorized
Restricted class of approximations: every dimension (or subset of
dimensions) of the posterior is independent
• Let Z be partitioned into disjoint groups Zi (i = 1…M)
q(Z) =
MY
i=1
qi(Zi) No restriction on the
functional form of qi(Zi)
Factorized distributions (II)
10.1 Variational Inference 19
L(q) =
Z
q(Z) ln
⇢
p(X, Z)
q(Z)
dZ
=
Z
q(Z) {ln p(X, Z) ln q(Z)} dZ
=
Z Y
i
qi(Zi)
! (
ln p(X, Z)
X
i
ln qi(Zi)
)
dZ
=
Z Y
i
qi
!
[ln p(X, Z)] dZ
Z Y
i
qi
!
X
i
ln qi
!
dZ
qi
Factorized distributions (III)
10.1 Variational Inference 20
L(q) =
Z Y
i
qi
!
[ln p(X, Z)] dZ
Z Y
i
qi
!
X
i
ln qi
!
dZ
=
Z
qj
8
<
:
Z
ln p(X, Z)
Y
i6=j
qidZi
9
=
;
dZj
Z Y
i
qi
! 8
<
:
ln qj +
X
i6=j
ln qi
9
=
;
dZ
=
Z
qj
8
<
:
Z
ln p(X, Z)
Y
i6=j
qidZi
9
=
;
dZj
Z Y
i
qi
!
ln qjdZ
Z
0
@
Y
i6=j
qi
1
A
0
@
X
i6=j
ln qi
1
A
✓Z
qjdZj
◆
dZi6=j
=
Z
qj (ln ˜p(X, Zj) const) dZj
Z
qj ln qjdZj
Z
0
@
Y
i6=j
qi
1
A
0
@
X
i6=j
ln qi
1
A dZi6=j
• Consider with function qj(Zj)
= 1
Factorized distributions (IV)
10.1 Variational Inference 21
L(q) =
Z
qj ln ˜p(X, Zj)dZj
Z
qj ln qjdZj + const
negative KL divergence
where
˜p(X, Zj) = Ei6=j[ln p(X, Z)] + const
Ei6=j[ln p(X, Z)] =
Z
ln p(X, Z)
Y
i6=j
qidZi
and
• Maximize by keeping fixed
• This is same as minimizing KL divergence between
and
L(q) {qi6=j}
˜p(X, Zj)
qj(Zj)
(10.6)
(10.7)
(10.8)
Optimal Solution
10.1 Variational Inference 22
(1) Initialize all qj appropriately.
ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const
q⇤
j (Zj) =
exp(Ei6=j[ln p(X, Z)])
R
exp(Ei6=j[ln p(X, Z)])dZj
or
(2) Run below code until convergence.
• foreach qi
• Fixed all qj ≠qi and find optimal qi
• Update qi
(10.9)Today’s 

memo
Next slides are
detailed examples
10.1.2. Properties of factorized approximations
10.1 Variational Inference 23
Approximate Gaussian Distribution with factorized Gaussian
Consider,
p(z) = N(z|µ, ⇤ 1
)
µ = (µ1, µ2)T
, ⇤ =
✓
⇤11 ⇤12
⇤21 ⇤22
◆
q(z) = q1(z)q2(z)
ln q⇤
1(z1) = Ez2
[ln p(z)] + const
ln q⇤
1(z1) = Ez2 [
1
2
(z1 µ1)2
⇤11 (z1 µ1)⇤12(z2 µ2)] + const
where z = (z1, z2),
Approximate using
Optimal solution from (10.9)
consider only the terms have z1
10.1 Variational Inference 24
ln q⇤
1(z1) = Ez2 [
1
2
(z1 µ1)2
⇤11 (z1 µ1)⇤12(z2 µ2)] + const
=
Z
q2(z2)
⇢
1
2
(z1 µ1)2
⇤11 (z1 µ1)⇤12(z2 µ2) dz2 + const
=
1
2
(z1 µ1)2
⇤11 z1⇤12(Ez2 [z2] µ2) + const
q⇤
1(z1) = N(z1|m1, ⇤ 1
11 )
m1 = µ1 ⇤ 1
11 ⇤12(Ez2
[z2] µ2)
quadratic form of z1
Then, we have
(10.11)
10.1.2. Properties of factorized approximations
10.1 Variational Inference 25
q⇤
1(z1) = N(z1|m1, ⇤ 1
11 ) m1 = µ1 ⇤ 1
11 ⇤12(Ez2
[z2] µ2)
m2 = µ2 ⇤ 1
22 ⇤21(Ez1
[z1] µ1)q⇤
2(z2) = N(z2|m2, ⇤ 1
22 )
q(z) = q1(z)q2(z)Optimal solution of (10.12)-(10.15)
Mutual dependency:
• depends on (calculated by )q⇤
1(z1) q⇤
2(z2)
Ez1
[z1]
Ez2
[z2]
• depends on (calculated by )q⇤
1(z1)q⇤
2(z2)
Update alternately until convergence
10.1.2. Properties of factorized approximations
10.1 Variational Inference 26
Fig 10.2 The green contours corresponding to 1, 2
and 3 standard deviations for a correlated
Gaussian distribution p(z) over two variables z1
and z2 . The red contours represent the
corresponding levels for an approximating
distribution q(z) over the same variables given by
the product of two independent univariate
Gaussian
Minimize KL(q||p) Minimize KL(p||q)
• The mean is captured correctly, but the variance is underestimated in the
orthogonal direction
• Optimal solution                 
(that is the corresponding marginal distribution of p(Z) )
• Considering reverse KL divergence KL(p||q) =
Z
p(Z)[
MX
i=1
ln qi(Zi)]dZ + const
10.1.2. Properties of factorized approximations
(10.17)
10.1 Variational Inference 27
Minimize KL(q||p)
Minimize KL(p||q)
Reverse KL divergence
KL(p||q) =
Z
p(Z)[
MX
i=1
ln qi(Zi)]dZ + const
10.1.2. Properties of factorized approximations
KL(q||p) =
Z
q(Z) ln
⇢
p(Z)
q(Z)
• If near zero then tends to close to zerop(Z) q(Z)
• If near zero then is not
important
p(Z) q(Z)
• KL divergence is minimized by distributions that
are nonzero in regions when is nonzerop(Z)
q(Z)
More about divergence
10.1 Variational Inference 28
Fig 10.3 Blue contour =
bimodal distribution
p(Z). Red contour=
single Gaussian
distribution q(Z) that
best approximates p(Z)
Minimize KL(q||p)
(a)
Minimize KL(p||q) Minimize KL(q||p)
(b) (c)
• KL(p||q) and KL(q||p) belong to the alpha family of divergences
where
• If will underestimate
• If will overestimate
• If it related to Hellinger distance
p(x)
p(x)
D↵(p||q) ! KL(q||p)
D↵(p||q) ! KL(p||q)
10.1.3 The univariate Gaussian (I)
10.1 Variational Inference 29
• Goal: to inferrer posterior distribution for mean and precision given
data set
µ ⌧
D = {x1, ..., xN }
• Likelihood function
• Prior
• Approximate
and
(10.21)
(10.22)
(10.23)
(10.24)
10.1.3 The univariate Gaussian (II)
10.1 Variational Inference 30
ln q⇤
µ(µ) = E⌧ [ln p(D, µ, ⌧)] + const
= E⌧ [ln (p(D|µ, ⌧)p(µ|⌧)p(⌧))] + const
= E⌧ [ln p(D|µ, ⌧) + ln p(µ|⌧) + ln p(⌧)] + const
= E⌧ [ln p(D|µ, ⌧) + ln p(µ|⌧)] + const
= E⌧
"
⌧
2
NX
n=1
(xn µ)2 0⌧
2
(µ µ0)2
#
+ const
=
E⌧ [⌧]
2
" NX
n=1
(xn µ)2
+ 0(µ µ0)2
#
+ const (10.25)
quadratic form of µ
• Optimal solution (from formula 10.9)
10.1.3 The univariate Gaussian (II)
10.1 Variational Inference 31
• Optimal solution for mean
q⇤
µ(µ) = N(µ|µN , 1
N )
8
<
:
µn =
0µ0 + N ¯x
0 + N
N = ( 0 + N)E⌧ [⌧]
where
(10.26)
(10.27)
• Similar with optimal solution of q⌧ (⌧)
q⇤
⌧ (⌧) = Eµ[ln p(D|µ, ⌧) + ln p(µ|⌧) + ln p(⌧)] + const
=
⌧
2
Eµ
" NX
n=1
(xn µ)2
+ 0(µ µ0)2
#
+
N
2
ln ⌧ +
1
2
ln ⌧ + (a0 1) ln ⌧ b0⌧ + const
=
✓
a0 +
N + 1
2
1
◆
ln ⌧
✓
b0 +
1
2
Eµ[...]
◆
⌧ + const (10.28)
10.1.3 The univariate Gaussian (III)
10.1 Variational Inference 32
• Optimal solution of q⌧ (⌧)
q⇤
⌧ (⌧) / ⌧
0
@a0+
N + 1
2
1
1
A
exp
✓ ✓
b0 +
1
2
Eµ[...]
◆
⌧
◆
or
q⇤
⌧ (⌧) = Gam(⌧|aN , bN )
where
8
><
>:
aN = a0 +
N + 1
2
bN = b0 +
1
2
Eµ
hPN
n=1(xn µ)2
+ 0(µ µ0)2
i
(10.29)
(10.30)
• Using (10.26)(10.27) and (10.29)(10.30) alternately to compute posterior
by approximate variational inference
p(µ, ⌧|D)
10.1.3 The univariate Gaussian (IV)
10.1 Variational Inference 33
p(µ, ⌧|D)
qµ(µ)q⌧ (⌧)
qµ(µ)
Re-estimating
Re-estimating
q⌧ (⌧)
Convergence of
factorized
approximation
10.1.4 Model Comparison
10.1 Variational Inference 34
• Prior probabilities on the models be
• Goal: determine where is the observed data
• Approximate
• Maximizing by we get
• Maximizing by we find solutions for different m are coupled
due to the conditioning
• Optimize each individually and subsequently find
p(m)
p(m|X) X
q(Z, m) = q(Z|m)q(m)
ln p(X) = L
X
m
X
Z
q(Z|m)q(m) ln
⇢
p(Z, m|X)
q(Z|m)q(m)
where lower bound L =
X
m
X
Z
q(Z|m)q(m) ln
⇢
p(Z, X, m)
q(Z|m)q(m)
(10.34)
(10.35)
L q(m)
q(Z|m)
q(Z|m) q(m)
q(m) / p(m)exp(Lm) with
Lm
Lm =
X
Z
q(Z|m) ln
⇢
p(Z, X|m)
q(Z|m)
Progress…
Variational Inference 35
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
10.2 Variational Mixture of Gaussians
10.2 Variational Mixture of Gaussians 36
• Goal: apply variational inference for the Gaussian mixture model
• Problem formation:
• Latent variable zn = {znk} (1-of-K binary vector) corresponding with each
observation xn
Observed data
Hidden variables
X = {x1, ..., xN }
Z = {z1, ..., zN }
• Conditional distribution of Z, given the mixing coefficients π
• Conditional distribution of the observed data vectors, given the latent
variables and the component parameters
p(X|Z, µ, ⇤) =
NY
n=1
KY
k=1
N(xn|µk, ⇤ 1
k )znk
p(Z|⇡) =
NY
n=1
KY
k=1
⇡znk
k (10.37)
(10.38)
10.2 Variational Mixture of Gaussians
10.2 Variational Mixture of Gaussians 37
• Conjugate prior distributions
• Dirichlet distribution over the mixing coefficients π
p(⇡) = Dir(⇡|↵0) = C(↵0)
KY
k=1
⇡↵0 1
k
• Independent Gaussian-Whishart prior governing the mean and
precision of each Gaussian component
m0 = 0choose by symmetry
Fig 10.5 Directed acyclic graph representing the Bayesian
mixture of Gaussians model
(10.39)
(10.40)
10.2.1. Variational Distribution (I)
10.2 Variational Mixture of Gaussians 38
• Joint distribution
• Approximate
p(X, Z, ⇡, µ, ⇤) = p(X|Z, µ, ⇤)p(Z|⇡)p(⇡)p(µ|⇤)p(⇤)
q(Z, ⇡, µ, ⇤) = q(Z)q(⇡, µ, ⇤)
• Optimal solution (from formula 10.9)
ln q⇤
(Z) = E⇡,µ,⇤[ln p(X, Z, ⇡, µ, ⇤)] + const
= E⇡[ln p(Z|⇡)] + Eµ,⇤[ln p(X|Z, µ, ⇤)] + const
= E⇡
"
ln
NY
n=1
KY
k=1
⇡znk
k
!#
+ Eµ,⇤
"
ln
NY
n=1
KY
k=1
N(xn|µk, ⇤ 1
k )znk
!#
+ const
=
NX
n=1
KX
k=1
znkE⇡[ln ⇡k]
+
NX
n=1
KX
k=1
znkEµ,⇤

1
2
ln |⇤k|
D
2
ln(2⇡)
1
2
(xn µk)T
⇤k(xn µk) + const
(10.42)
(10.43)
(10.44)
D is the dimensionality of the data variable x
10.2.1 Variational Distribution (II)
10.2 Variational Mixture of Gaussians 39
• Optimal solution for q(Z)
ln q⇤
(Z) =
NX
n=1
KX
k=1
znk ln ⇢nk + const
where
then,
normalized,
where
also seen as
responsibilities
as in case of EM
(10.45)
(10.46)
(10.47)
(10.48)
10.2.1 Variational Distribution (III)
10.2 Variational Mixture of Gaussians 40
• Optimal solution for q(⇡, µ, ⇤)
define: (10.51)
(10.52)
(10.53)
optimal solution:
ln q⇤
(⇡, µ, ⇤) = EZ[ln p(X, Z, ⇡, µ, ⇤)] + const
= EZ[ln (p(X|Z, µ, ⇤)p(Z|⇡)p(⇡)p(µ|⇤)p(⇤))] + const
= EZ[ln p(X|Z, µ, ⇤)] + EZ[ln p(Z|⇡)] + ln p(⇡) + ln p(µ, ⇤) + const
=
NX
n=1
KX
k=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + EZ[ln p(Z|⇡)]
+ ln p(⇡) +
KX
k=1
ln p(µk, ⇤k) + const
(10.54)
10.2.1 Variational Distribution (IV)
10.2 Variational Mixture of Gaussians 41
=
NX
n=1
KX
k=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + EZ[ln p(Z|⇡)]
+ ln p(⇡) +
KX
k=1
ln p(µk, ⇤k) + const
ln q⇤
(⇡, µ, ⇤) ⇡
µ, ⇤
something of
+ something of
Then it could be further factorization
q(⇡, µ, ⇤) = q(⇡)
KY
k=1
q(µk, ⇤k)
q⇤
(⇡, µ, ⇤) = q⇤
(⇡)
KY
k=1
q⇤
(µk, ⇤k)
(10.54)
(10.54)
From (10.54) and (10.55) we have
ln q⇤
(⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const
ln q⇤
(µk, ⇤k) = ln p(µk, ⇤k) +
NX
n=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + const
10.2.1 Variational Distribution (VI)
10.2 Variational Mixture of Gaussians 42
ln q⇤
(⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const
= EZ
"
ln
NY
n=1
KY
k=1
⇡znk
k
!#
+ ln C(↵0)
KY
k=1
⇡↵0 1
k
!
+ const
=
NX
n=1
KX
k=1
EZ[znk] ln ⇡k + (↵0 1)
KX
k=1
ln ⇡k + const
=
KX
k=1
(Nk + ↵0 1) ln ⇡k + const
= ln
KY
k=1
⇡Nk+↵0 1
k
!
+ const
q⇤
(⇡) = Dir(⇡|↵)
↵k = Nk + ↵0
is recognized as Dirichlet distributionq⇤
(⇡)
(10.56)
(10.57)
(10.58)
10.2.1 Variational Distribution (VII)
10.2 Variational Mixture of Gaussians 43
ln q⇤
(µk, ⇤k) = ln p(µk, ⇤k) +
NX
n=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + const
We have Gaussian-Wishart distribution (exercise 10.13)
(10.59)
(10.60) - (10.63)
Analogous to the
M-step of the EM
algorithm
10.2.1 Variational Distribution (VIII)
10.2 Variational Mixture of Gaussians 44
• Optimize the variational posterior Gaussian mixture distribution
(1) Initialize the responsibilities rnk
(2) Update by (10.51)-(10.53)Nk, ¯xk, Sk
(3) [M step]
• Use (10.57) to find
• Use (10.59) to find
(4) [E step]
• Use (10.64)-(10.66) and (10.46) - (10.49) to update
responsibilities to find
(5) Back to (2) until convergence
q⇤
(⇡)
Use the current
distribution of
parameters to evaluate
responsibilities
Fix responsibilities and
use it to recompute the
variational distribution
over parameters
q⇤
(µk, ⇤k) (k = 1, ..., K)
10.2.1 Variational Distribution (IX)
10.2 Variational Mixture of Gaussians 45
Figure 10.6 Variational Bayesian mixture
of K = 6 Gaussians applied to the Old
Faithful data set. The ellipses denote the
one standard-deviation density contours
for each of the components, and the
density of red ink inside each ellipse
corresponds to the mean value of the
mixing coefficient for each component.
iteration iterations
iterations iterations
The coefficients of
meaningless distribution tend
to close to zero (disappear)
Compare EM with Variational Bayes
10.2 Variational Mixture of Gaussians 46
• The same calculate complexity
• As number of data point N → ∞, Bayesian treatment converges to
Maximum likelihood EM algorithm
• The advantage of Variational Bayes
A. Singularities that arise in ML are absent in Bayesian treatment,
removed by the introduction of prior
B. No over-fitting: could be used for determining the number of
components
10.2.2 Variational lower bound
10.2 Variational Mixture of Gaussians 47
• At each step of the iterative re-estimation procedure the value of this
bound should not decrease
• Useful to test convergence
• To check on the correctness of both mathematical expression and
implementation
• For the variational mixture of Gaussians, the lower bound is given by
10.2.3 Predictive density
10.2 Variational Mixture of Gaussians 48
• Predictive density , for a new value with corresponding
latent variable
• Depends on the posterior distribution of parameters
• As the posterior distribution is intractable the
variational approximation can be used to obtain an
approximate predictive density
P(ˆx|X) ˆx
ˆz
(10.78)
q(⇡)q(µ, ⇤)
variational approximation
10.2.4 Number of components
10.2 Variational Mixture of Gaussians 49
• For a given mixture model of K components, each parameter setting is a
member of a family of K! equivalent setting
Figure 10.7
Plot of the variational lower bound L versus the number K of
components in the Gaussian mixture model, for the Old Faithful
data, showing a distinct peak at K = 2 components. For each value of
K, the model is trained from 100 different random starts, and the
results shown as ‘+’ symbols plotted with small random hori- zontal
perturbations so that they can be distinguished. Note that some
solutions find suboptimal local maxima, but that this hap- pens
infrequently.
• Starting with relative large value of K and components with insufficient
contribution are pruned out: the mixing coefficient is driven to zero
10.2.5 Induced factorizations
10.2 Variational Mixture of Gaussians 50
Induced factorization arises from an interaction between the factorization
assumption in variational posterior and the conditional independence
properties of the true posterior
• For ex: Let A, B, C be disjoint groups of latent variables
• Factorization assumption
• The optimal solution ln q⇤
(A, B) = EC[ln p(A, B|X, C)] + const
q(A, B, C) = q(A, B)q(C)
• We need to determine if .
This is possible iff
• This can also determined from the directed-graph
model
Progress…
Variational Inference 51
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
10.3 Variational Linear Regression
10.3 Variational Linear Regression 52
• Return to the Bayesian linear regression model (section 3.3)
• Approximated the integration over by making point
estimates obtained by maximizing the log marginal likelihood
↵,
omitted input x
predict value training target value
ˆ↵, ˆ = argmax↵,
ˆ↵, ˆ = argmax↵,
flat
MLE
then
10.3 Variational Linear Regression
10.3 Variational Linear Regression 53
• The joint distribution of all the variables
• Prior
(10.90)
(10.89)
(10.87)
(10.88)
• Fully Bayesian approach would integrate over the hyper-parameters as well
as over parameters (this section)
• Suppose that the noise precision parameter is known.
n = (xn)where
p(↵) = Gam(↵|a0, b0) / ↵a0 1
exp( b0↵)
=
1
(2⇡)M/2
1
|⌃|1/2
exp
⇢
1
2
(wT
⌃ 1
w)
⌃ = ↵ 1
Iwhere
10.3.1 Variational Distribution
10.3 Variational Linear Regression 54
• Goal: find an approximation to the posterior distribution p(w, ↵|t)
• Factorized approximation
q(w, ↵) = q(w)q(↵) (10.91)
• Optimal solution (from 10.9) ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const
ln q⇤
(↵) = Ew[ln (p(t|w)p(w|↵)p(↵))] + const
= ln p(↵) + Ew[ln p(w|↵)] + const
= (a0 1) ln ↵ b0↵ +
M
2
ln ↵
↵
2
Ew[wT
w] + const
(10.92)
then
q⇤
(↵) = Gam(↵|aN , bN ) where(10.93)
(10.94)
(10.95)
M is number of fitting
parameters wi or input
dimension
10.3.1 Variational Distribution
10.3 Variational Linear Regression 55
• Similar optimal solution of q(w)
quadratic form of w
q⇤
(w) = N(w|mN , SN )
then
(10.99) where
(10.96)
(10.97)
(10.98)
10.3.1 Variational Distribution
10.3 Variational Linear Regression 56
• Optimal solution
q⇤
(↵) = Gam(↵|aN , bN ) where
(10.94)
(10.95)
(10.93)
q⇤
(w) = N(w|mN , SN )(10.99) where
(10.100)
(10.101)
Moment
(10.102)
(10.103)
More about variational linear regression
10.3 Variational Linear Regression 57
• Predictive distribution over t, given a new input x
• Lower bound
(10.105)
(10.107)
More about variational linear regression
10.3 Variational Linear Regression 58
Lower bound
Order M for a polynomial model
Figure 10.9 The lower bound
versus the order M of the
polynomial model, in which
a set of 10 data points is
generated from a polynomial
with M=3 sampled over (-5, 5)
with additive Gaussian noise
of var=0.09. The value of the
bounds gives the log
probability of the model
Peak at M = 3,
corresponding to
the true model
Progress…
Variational Inference 59
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and 

the variational principle
Now:
Design the 

variational algorithms
10.7. Expectation
Propagation
Progress…
Variational Inference 60
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 61
• For many of the models in this book, the complete data likelihood is drawn
from the exponential family
• In general, this will not be the case for the marginal likelihood function for
the observed data. Ex: in a mixture of Gaussians.
10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 62
• Observed data
• Latent variables
X = {x1, ..., xN }
Z = {z1, ..., zN }
• Suppose that the joint distribution is a member of the exponential family
where the conjugate prior for ⌘
(10.113)
p(⌘|⌫0, 0) = f(⌫0, 0)g(⌘)⌫0
exp{⌫0⌘T
0}
(prior number of observations all having the value for the u vector)⌫0 0
(10.114)
• Variational distributions
q(Z, ⌘) = q(Z)q(⌘)
10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 63
ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const• Optimal solution (from 10.9)
= E⌘[ln p(X, Z|⌘)] + const
=
NX
n=1
ln h(xn, zn) + E⌘[⌘T
]u(xn, zn) + const
sum of independent things
Induced factorization
q⇤
(Z) =
Y
n
q⇤
(zn)
(10.115)
where
(10.116)
ln q⇤
(Z) = E⌘[ln p(X, Z, ⌘)] = E⌘[ln p(X, Z|⌘)p(⌘|⌫0, 0)p(⌫0, 0)]
q⇤
(zn) = h(xn, zn)g(E⌘[⌘]) exp{E⌘[⌘T
]u(xn, zn)}
10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 64
ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const• Optimal solution (from 10.9)
ln q⇤
(⌘) = EZ[ln p(X, Z, ⌘)] = EZ[ln p(X, Z|⌘)p(⌘|⌫0, 0)p(⌫0, 0)]
= ln p(⌘|⌫0, 0) + EZ[ln p(X, Z|⌘)] + const
= ⌫0 ln g(⌘) + ⌫0⌘T
0 +
NX
n=1
ln g(⌘) + ⌘T
Ezn [u(xn, zn)] + const
(10.118)
q⇤
(⌘) = f(⌫N , N )g(⌘)⌫N
exp{⌫N ⌘T
N }
then
where
(10.119)
(10.120)
(10.121)
Variational message passing
10.4 Exponential Family Distributions 65
• The joint distribution corresponding to a directed graph
then the optimal solution is
thus the update of the factors in the variational posterior
distribution represents a local calculation on the graph
p(x) =
Y
i
p(xi|pai)
parent set corresponding to node ivariable(s) associated with node i (latent or observed)
• Variational approximation q(x) =
Y
i
qi(xi)
Markov blanket
Variational message passing
10.4 Exponential Family Distributions 66
• If all the conditional distributions have a conjugate-exponential
structure, then the variational update will be:
• The distribution associated with a particular node can be updated once that
node has received messages from all of its parents and all of its children.
• It requires that the children have already received messages from their co-
parents
Progress…
Variational Inference 67
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
10.5 Local Variational Methods
10.5 Local Variational Methods 68
• Global methods: approximation to the full posterior
• Local methods: approximation to individual or groups of variables
• Replace the likelihood with a simpler form - lower bound
that makes the expectation easy to compute
Convex duality
10.5 Local Variational Methods 69
Convex function f(x)
One of lower bound
but not the best
The line is moved
vertically to a tangent
Convex function f(x) Concave function f(x)
g(⌘) = max
x
{⌘x f(x)}
f(x) = max
⌘
{⌘x g(⌘)} f(x) = min
⌘
{⌘x g(⌘)}
g(⌘) = min
x
{⌘x f(x)}(10.130)
(10.131)
(10.133)
(10.132)
10.5 Local Variational Methods
10.5 Local Variational Methods 70
Original Problem p(y = 1|x) =
1
1 + exp( x)
= (x)
is concave function, then considerf(x) = ln (x)
g(⌘) = min
x
{⌘x f(x)} = ⌘ ln ⌘ (1 ⌘) ln(1 ⌘)
ln (x)  ⌘x g(⌘)
The upper bound
(x)  exp(⌘x g(⌘))
(10.135)
(10.136)
(10.137)
The upper bound
Logistic 

sigmoid
10.5 Local Variational Methods
10.5 Local Variational Methods 71
is convex function of the variable x2 , then consider
The stationarity conditions
(10.139)
The lower bound
Logistic 

sigmoid
• Gaussian lower bound (Jakkola and Jordan, 2000)
f(x) = ln(ex/2
+ e x/2
)
g(⌘) = max
x2
n
⌘x2
f(
p
x2)
o
0 = ⌘
d
dx2
d
dx
f(x) = ⌘ +
1
4x
tanh(
x
2
)
⌘ =
1
4⇠
tanh
✓
⇠
2
◆
=
1
2⇠

(⇠)
1
2
= (⇠)
g( (⇠)) = (⇠)⇠2
f(⇠) = (⇠)⇠2
+ ln(e⇠/2
+ e ⇠/2
)
f(x) (⇠)x2
g( (⇠)) = (⇠)x2
+ (⇠)⇠2
ln(e⇠/2
+ e ⇠/2
)
The bound on f(x)
The bound on sigmoid (x) (⇠) exp{(x ⇠)/2 (⇠)(x2
⇠2
)} (10.144)
How the bounds can be used
10.5 Local Variational Methods 72
• Evaluate I =
Z
(a)p(a)da
• The local variational bound (a) f(a, ⇠)
• The variational bound
(intractable)
I
Z
f(a, ⇠)p(a)da = F(⇠)
⇠ is additional parameter (depends on a)where
Finding the compromise to maximize F(⇠)⇠⇤
In Reviews…
10.5 Local Variational Methods 73
Original Problem
Local Bound
Bound with only
linear or quadratic
terms: expectations,
especially against a
Gaussian, are easy to
compute.
p(y = 1|x) =
1
1 + exp( x)
= (x)
(x) (⇠) exp{(x ⇠)/2 (⇠)(x2
⇠2
)}
⇠ is additional parameterwhere
Progress…
Variational Inference 74
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
10.6 Variational Logistic Regression
10.6 Variational Logistic Regression 75
Return to the Bayesian logistic regression model (section 4.5)
• The posterior distribution
where the prior distribution
p(w) = N(w|w0, S0)
and the likelihood function
p(t|w) =
NY
n=1
yn
tn
{1 yn}1 tn
• Then
where yn = (wT
n)
Maximize the posterior to give wMAP and then
• The Gaussian approximation
p(w) = N(w|wN , SN )
p(w|t) / p(w)p(t|w)
10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 76
A practical example of local variational method
• Recap of variational framework: maximize a lower bound on the marginal
likelihood
For the Bayesian logistic regression model, the marginal likelihood is:
The conditional distribution for t
(10.147)
(10.148)
where
10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 77
• Lower bound on the logistic sigmoid function
where
• We can therefore write
• Bound on the joint distribution of t and w
where
(10.149)
(10.151)
(10.152)
(10.153)
where and each training set observation corresponds with( n, tn) ⇠n
10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 78
• Lower bound on the log of the joint distribution of t and w
• Hypothesis for the prior p(w): Gaussian with parameters m0 and S0
considered as fixed
Then, the right side of (10.154) becomes as function of w
(10.154)
(10.155)
10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 79
• Quantity of interest: exact posterior distribution, requires normalization of
the left side in (10.152) but usually intractable
• Work instead with the right side of (10.155): a quadratic function of w which
is a lower bound of p(w, t)
• A Gaussian variational posterior of the form
where
(10.156)
(10.157)
(10.158)
10.6.1 Optimizing the variational parameters
10.6 Variational Logistic Regression 80
• Determine the variational parameters by maximizing the lower
bound of the marginal likelihood
Two approaches
Substitute (10.152) back into the marginal likelihood
(10.159)
• (1) View w as a latent variable and use the EM algorithm
• (2) Compute and maximize directly using the fact that p(w) is Gaussian
and is a quadratic function of w
Re-estimation equations
(10.164)
(10.163)
10.6.1 Optimizing the variational parameters
10.6 Variational Logistic Regression 81
• Invoke EM algorithm
1. Initialize values for
2. E step
• Use to calculate the posterior distribution
3. M step
⇠old
• Maximize the complete-data log likelihood
Q(⇠, ⇠old
) = Eq(w) [ln h(w, ⇠)p(w)]
• Solve (stationarity condition)
then (10.163)
(10.160)
(10.162)
Illustration
10.6 Variational Logistic Regression 82
10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 83
• Allow hyper parameters to be inferred from dataset
• Consider simple Gaussian prior form
• Consider conjugate hyper prior given by a gamma distribution
• The marginal likelihood
where the joint distribution
(10.168)
(10.167)
(10.166)
(10.165)
10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 84
Combine global and local approach
• (1) Global approach: consider a variational distribution and apply
the standard decomposition
• (2) The lower bound is intractable so apply the local approach as
before to get a lower bound on and on
L(q)
L(q) ln p(t)
• (3) Assume that q is factorized as
(10.169)
(10.172)
10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 85
• It follows (quadratic function of w)
where
(10.174)
(10.175)
(10.176)
• From (10.153) and (10.165)
(10.153)
(10.165)
10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 86
where
then
• Similar with from (10.165) and (10.166)
(10.177)
(10.178)
(10.179)
q(↵)
(10.165)
We have
(10.166)p(↵) = Gam(↵|a0, b0) / ↵a0 1
exp( b0↵)
q(↵) = Gam(↵|aN , bN ) =
1
(aN )
abN
N ↵aN 1
e bN ↵
10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 87
• The variational parameters are obtained by maximizing the lower bound
(10.180)
(10.181)
(10.183)
(10.182)
• Re-estimation equations
where
Q(⇠, ⇠old
) = Eq(w) [ln h(w, ⇠)p(w)] (10.160)
as we done before with
Progress…
Variational Inference 88
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
Expectation Propagation (Minka, 2001)
10.7 Expectation Propagation 89
• An alternative form of deterministic approximate inference based on the
reverse KL divergence KL(p||q) ( instead of KL(q||p)) where p is the complex
distribution
KL(p||q) =
Z
p(z) ln
p(z)
q(z)
dzKL(q||p) =
Z
q(z) ln
q(z)
p(z)
dz
• Consider fixed distribution p(z) and member of the exponential family q(z).
KL(p||q) = ln g(⌘) ⌘T
Ep(z)[u(z)] + const
⌘
setting gradient to zero
(10.185)
(10.186)
q(z) = h(z)g(⌘) exp{⌘T
u(z)}
• The Kullback-Leibler divergence as function of
(10.184)
Expectation Propagation
10.7 Expectation Propagation 90
• Member of the exponential family q(z).
(10.187)
q(z) = h(z)g(⌘) exp{⌘T
u(z)} (10.184)
then Z
h(z)g(⌘) exp{⌘T
u(z)}dz = 1
taking the gradient of both size
rg(⌘)
Z
h(z) exp{⌘T
u(z)}dz +g(⌘)
Z
h(z) exp{⌘T
u(z)}u(z)dz = 0
1
g(⌘)
rg(⌘) = g(⌘)
Z
h(z) exp{⌘T
u(z)}u(z)dz = Eq(z)[u(z)]
r ln g(⌘) = Eq(z)[u(z)]
From 10.186 we have
Moment matching, setting mean and covariance of q(z) the same as p(z)’s
Expectation Propagation
10.7 Expectation Propagation 91
• Assume the joint distribution of data and hidden variables and
parameters is of the form
D ✓
• Quantities of interest ad posterior distribution
and model evidence
(10.188)
(10.189)
(10.190)
Expectation Propagation
10.7 Expectation Propagation 92
• Expectation propagation is based on an approximation to the posterior
distribution which is also given by a product of factors
where each factor comes from the exponential family
• Ideally, we would like to determine by minimizing the KL divergence
between the true posterior and the approximation
• Minimize the KL divergence between each pair of factors and
independently but the product is usually poor approximation
• EP: optimize each factor in turn using the current values for the remaining
factors (good in logistic type but bad for mixtures type due to multi-modality)
Expectation Propagation
10.7 Expectation Propagation 93
Expectation Propagation
10.7 Expectation Propagation 94
The clutter problem
10.7 Expectation Propagation 95
The clutter problem
10.7 Expectation Propagation 96
• Mixture of Gaussians of the form
where w is the proportion of background cluster and is assumed to be
known. The prior is taken to be Gaussian
• The joint distribution of N observations and is given by
• To apply EP, first identify the factors
then, choose the exponential family
• The factor approximation takes the form of exponentials of quadratic
functions
(10.209)
(10.210)
(10.211)
(10.212)
(10.213)
The clutter problem
10.7 Expectation Propagation 97
The clutter problem
10.7 Expectation Propagation 98
• Evaluate the approximation to the model evidence
where
(10.223)
(10.224)
Expectation Propagation on graphs
10.7 Expectation Propagation 99
• The factors are not function of all variables. If the approximating
distribution is fully factorized, EP reduces to Loopy Belief Propagation
• We seek an approximation q(x) that has the same factorization
(10.225)
(10.226)
Expectation Propagation on graphs
10.7 Expectation Propagation 100
• Restrict to approximations in which the factors factorize with respect to the
individual variables so that
(10.227)
Expectation Propagation on graphs
10.7 Expectation Propagation 101
• Suppose all the factors are initialized and we choose to refine factor
• Minimizing the reverse KL when q factorizes, leads to an optimal solution q
where factors are the marginals of p
Expectation Propagation on graphs
10.7 Expectation Propagation 102
Standard belief propagation
10.7 Expectation Propagation 103
Standard belief propagation
10.7 Expectation Propagation 104
Standard belief propagation
10.7 Expectation Propagation 105
Standard belief propagation
10.7 Expectation Propagation 106
Same in the chapter 8
Expectation propagation
10.7 Expectation Propagation 107
• The sum-product BP arises as a special case of EP when a fully factorized
approximating distributions is used
• EP can be seen as a way to generalized this: group factors and update them
together, use partially connected graph
• Q remains: How to choose the best grouping and disconnection?
• Summary: EP and Variational message passing correspond to the
optimization of two different KL divergences
• Minka 2005 gives a more general point of view using the family of alpha-divergences that includes both
KL and reverse KL, but also other divergence like Hellinger distance, Chi2-distance...
• He shows that by choosing to optimize one or the other of these divergences, you can derive a broad
range of message passing algorithms including Variational message passing, Loopy BP, EP, Tree-
Reweighted BP, Fractional BP, power EP.
Summary
Variational Inference 108
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and 

the variational principle
Part II:
Design the 

variational algorithms
10.7. Expectation
Propagation

More Related Content

What's hot

PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説弘毅 露崎
 
PRML輪読#3
PRML輪読#3PRML輪読#3
PRML輪読#3matsuolab
 
[DL輪読会]Shaping Belief States with Generative Environment Models for RL
[DL輪読会]Shaping Belief States with Generative Environment Models for RL[DL輪読会]Shaping Belief States with Generative Environment Models for RL
[DL輪読会]Shaping Belief States with Generative Environment Models for RLDeep Learning JP
 
PRML読み会第一章
PRML読み会第一章PRML読み会第一章
PRML読み会第一章Takushi Miki
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」Keisuke Sugawara
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)Masahiro Suzuki
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシンShinya Shimizu
 
指数時間アルゴリズム入門
指数時間アルゴリズム入門指数時間アルゴリズム入門
指数時間アルゴリズム入門Yoichi Iwata
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門tmtm otm
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成Nagi Teramo
 
PRML RVM 7.2 7.2.3
PRML RVM 7.2 7.2.3PRML RVM 7.2 7.2.3
PRML RVM 7.2 7.2.3tmtm otm
 
混合ガウスモデルとEMアルゴリスム
混合ガウスモデルとEMアルゴリスム混合ガウスモデルとEMアルゴリスム
混合ガウスモデルとEMアルゴリスム貴之 八木
 
スペクトラル・クラスタリング
スペクトラル・クラスタリングスペクトラル・クラスタリング
スペクトラル・クラスタリングAkira Miyazawa
 
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process ModelsDeep Learning JP
 
[DL輪読会]Deep Learning 第2章 線形代数
[DL輪読会]Deep Learning 第2章 線形代数[DL輪読会]Deep Learning 第2章 線形代数
[DL輪読会]Deep Learning 第2章 線形代数Deep Learning JP
 
Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)yukihiro domae
 

What's hot (20)

Chapter2.3.6
Chapter2.3.6Chapter2.3.6
Chapter2.3.6
 
PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説
 
PRML輪読#3
PRML輪読#3PRML輪読#3
PRML輪読#3
 
[DL輪読会]Shaping Belief States with Generative Environment Models for RL
[DL輪読会]Shaping Belief States with Generative Environment Models for RL[DL輪読会]Shaping Belief States with Generative Environment Models for RL
[DL輪読会]Shaping Belief States with Generative Environment Models for RL
 
PRML読み会第一章
PRML読み会第一章PRML読み会第一章
PRML読み会第一章
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン
 
Prml nn
Prml nnPrml nn
Prml nn
 
指数時間アルゴリズム入門
指数時間アルゴリズム入門指数時間アルゴリズム入門
指数時間アルゴリズム入門
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
PRML RVM 7.2 7.2.3
PRML RVM 7.2 7.2.3PRML RVM 7.2 7.2.3
PRML RVM 7.2 7.2.3
 
混合ガウスモデルとEMアルゴリスム
混合ガウスモデルとEMアルゴリスム混合ガウスモデルとEMアルゴリスム
混合ガウスモデルとEMアルゴリスム
 
スペクトラル・クラスタリング
スペクトラル・クラスタリングスペクトラル・クラスタリング
スペクトラル・クラスタリング
 
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
 
[DL輪読会]Deep Learning 第2章 線形代数
[DL輪読会]Deep Learning 第2章 線形代数[DL輪読会]Deep Learning 第2章 線形代数
[DL輪読会]Deep Learning 第2章 線形代数
 
Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)
 

Viewers also liked

006 20151207 draws - Deep Recurrent Attentive Writer
006 20151207 draws - Deep Recurrent Attentive Writer006 20151207 draws - Deep Recurrent Attentive Writer
006 20151207 draws - Deep Recurrent Attentive WriterHa Phuong
 
005 20151130 adversary_networks
005 20151130 adversary_networks005 20151130 adversary_networks
005 20151130 adversary_networksHa Phuong
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian ProcessHa Phuong
 
007 20151214 Deep Unsupervised Learning using Nonequlibrium Thermodynamics
007 20151214 Deep Unsupervised Learning using Nonequlibrium Thermodynamics007 20151214 Deep Unsupervised Learning using Nonequlibrium Thermodynamics
007 20151214 Deep Unsupervised Learning using Nonequlibrium ThermodynamicsHa Phuong
 
PRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodPRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodHa Phuong
 
016_20160722 Molecular Circuits For Dynamic Noise Filtering
016_20160722 Molecular Circuits For Dynamic Noise Filtering016_20160722 Molecular Circuits For Dynamic Noise Filtering
016_20160722 Molecular Circuits For Dynamic Noise FilteringHa Phuong
 
015_20160422 Controlling Synchronous Patterns In Complex Networks
015_20160422 Controlling Synchronous Patterns In Complex Networks015_20160422 Controlling Synchronous Patterns In Complex Networks
015_20160422 Controlling Synchronous Patterns In Complex NetworksHa Phuong
 
018 20160902 Machine Learning Framework for Analysis of Transport through Com...
018 20160902 Machine Learning Framework for Analysis of Transport through Com...018 20160902 Machine Learning Framework for Analysis of Transport through Com...
018 20160902 Machine Learning Framework for Analysis of Transport through Com...Ha Phuong
 
013_20160328_Topological_Measurement_Of_Protein_Compressibility
013_20160328_Topological_Measurement_Of_Protein_Compressibility013_20160328_Topological_Measurement_Of_Protein_Compressibility
013_20160328_Topological_Measurement_Of_Protein_CompressibilityHa Phuong
 
008 20151221 Return of Frustrating Easy Domain Adaptation
008 20151221 Return of Frustrating Easy Domain Adaptation008 20151221 Return of Frustrating Easy Domain Adaptation
008 20151221 Return of Frustrating Easy Domain AdaptationHa Phuong
 
003 20151109 nn_faster_andfaster
003 20151109 nn_faster_andfaster003 20151109 nn_faster_andfaster
003 20151109 nn_faster_andfasterHa Phuong
 
009_20150201_Structural Inference for Uncertain Networks
009_20150201_Structural Inference for Uncertain Networks009_20150201_Structural Inference for Uncertain Networks
009_20150201_Structural Inference for Uncertain NetworksHa Phuong
 
011_20160321_Topological_data_analysis_of_contagion_map
011_20160321_Topological_data_analysis_of_contagion_map011_20160321_Topological_data_analysis_of_contagion_map
011_20160321_Topological_data_analysis_of_contagion_mapHa Phuong
 
017_20160826 Thermodynamics Of Stochastic Turing Machines
017_20160826 Thermodynamics Of Stochastic Turing Machines017_20160826 Thermodynamics Of Stochastic Turing Machines
017_20160826 Thermodynamics Of Stochastic Turing MachinesHa Phuong
 
Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Ha Phuong
 
Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)Ha Phuong
 
002 20151019 interconnected_network
002 20151019 interconnected_network002 20151019 interconnected_network
002 20151019 interconnected_networkHa Phuong
 
004 20151116 deep_unsupervisedlearningusingnonequlibriumthermodynamics
004 20151116 deep_unsupervisedlearningusingnonequlibriumthermodynamics004 20151116 deep_unsupervisedlearningusingnonequlibriumthermodynamics
004 20151116 deep_unsupervisedlearningusingnonequlibriumthermodynamicsHa Phuong
 
Expectation propagation
Expectation propagationExpectation propagation
Expectation propagationDong Guo
 
PRML 10.1節 ~ 10.3節 - 変分ベイズ法
PRML 10.1節 ~ 10.3節 - 変分ベイズ法PRML 10.1節 ~ 10.3節 - 変分ベイズ法
PRML 10.1節 ~ 10.3節 - 変分ベイズ法Yuki Soma
 

Viewers also liked (20)

006 20151207 draws - Deep Recurrent Attentive Writer
006 20151207 draws - Deep Recurrent Attentive Writer006 20151207 draws - Deep Recurrent Attentive Writer
006 20151207 draws - Deep Recurrent Attentive Writer
 
005 20151130 adversary_networks
005 20151130 adversary_networks005 20151130 adversary_networks
005 20151130 adversary_networks
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
 
007 20151214 Deep Unsupervised Learning using Nonequlibrium Thermodynamics
007 20151214 Deep Unsupervised Learning using Nonequlibrium Thermodynamics007 20151214 Deep Unsupervised Learning using Nonequlibrium Thermodynamics
007 20151214 Deep Unsupervised Learning using Nonequlibrium Thermodynamics
 
PRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodPRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling Method
 
016_20160722 Molecular Circuits For Dynamic Noise Filtering
016_20160722 Molecular Circuits For Dynamic Noise Filtering016_20160722 Molecular Circuits For Dynamic Noise Filtering
016_20160722 Molecular Circuits For Dynamic Noise Filtering
 
015_20160422 Controlling Synchronous Patterns In Complex Networks
015_20160422 Controlling Synchronous Patterns In Complex Networks015_20160422 Controlling Synchronous Patterns In Complex Networks
015_20160422 Controlling Synchronous Patterns In Complex Networks
 
018 20160902 Machine Learning Framework for Analysis of Transport through Com...
018 20160902 Machine Learning Framework for Analysis of Transport through Com...018 20160902 Machine Learning Framework for Analysis of Transport through Com...
018 20160902 Machine Learning Framework for Analysis of Transport through Com...
 
013_20160328_Topological_Measurement_Of_Protein_Compressibility
013_20160328_Topological_Measurement_Of_Protein_Compressibility013_20160328_Topological_Measurement_Of_Protein_Compressibility
013_20160328_Topological_Measurement_Of_Protein_Compressibility
 
008 20151221 Return of Frustrating Easy Domain Adaptation
008 20151221 Return of Frustrating Easy Domain Adaptation008 20151221 Return of Frustrating Easy Domain Adaptation
008 20151221 Return of Frustrating Easy Domain Adaptation
 
003 20151109 nn_faster_andfaster
003 20151109 nn_faster_andfaster003 20151109 nn_faster_andfaster
003 20151109 nn_faster_andfaster
 
009_20150201_Structural Inference for Uncertain Networks
009_20150201_Structural Inference for Uncertain Networks009_20150201_Structural Inference for Uncertain Networks
009_20150201_Structural Inference for Uncertain Networks
 
011_20160321_Topological_data_analysis_of_contagion_map
011_20160321_Topological_data_analysis_of_contagion_map011_20160321_Topological_data_analysis_of_contagion_map
011_20160321_Topological_data_analysis_of_contagion_map
 
017_20160826 Thermodynamics Of Stochastic Turing Machines
017_20160826 Thermodynamics Of Stochastic Turing Machines017_20160826 Thermodynamics Of Stochastic Turing Machines
017_20160826 Thermodynamics Of Stochastic Turing Machines
 
Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)
 
Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)
 
002 20151019 interconnected_network
002 20151019 interconnected_network002 20151019 interconnected_network
002 20151019 interconnected_network
 
004 20151116 deep_unsupervisedlearningusingnonequlibriumthermodynamics
004 20151116 deep_unsupervisedlearningusingnonequlibriumthermodynamics004 20151116 deep_unsupervisedlearningusingnonequlibriumthermodynamics
004 20151116 deep_unsupervisedlearningusingnonequlibriumthermodynamics
 
Expectation propagation
Expectation propagationExpectation propagation
Expectation propagation
 
PRML 10.1節 ~ 10.3節 - 変分ベイズ法
PRML 10.1節 ~ 10.3節 - 変分ベイズ法PRML 10.1節 ~ 10.3節 - 変分ベイズ法
PRML 10.1節 ~ 10.3節 - 変分ベイズ法
 

Similar to Approximate Inference (Chapter 10, PRML Reading)

Notes on Intersection theory
Notes on Intersection theoryNotes on Intersection theory
Notes on Intersection theoryHeinrich Hartmann
 
Chapter 6 frequency domain transformation.pptx
Chapter 6 frequency domain transformation.pptxChapter 6 frequency domain transformation.pptx
Chapter 6 frequency domain transformation.pptxEyob Adugnaw
 
Dcs lec02 - z-transform
Dcs   lec02 - z-transformDcs   lec02 - z-transform
Dcs lec02 - z-transformAmr E. Mohamed
 
Polya recurrence
Polya recurrencePolya recurrence
Polya recurrenceBrian Burns
 
dsp dsp by Dr. k Udaya kumar power point
dsp dsp by Dr. k Udaya kumar power pointdsp dsp by Dr. k Udaya kumar power point
dsp dsp by Dr. k Udaya kumar power pointAnujKumar734472
 
Digital Signal Processing and the z-transform
Digital Signal Processing and the  z-transformDigital Signal Processing and the  z-transform
Digital Signal Processing and the z-transformRowenaDulay1
 
Frequency Analysis using Z Transform.pptx
Frequency Analysis  using Z Transform.pptxFrequency Analysis  using Z Transform.pptx
Frequency Analysis using Z Transform.pptxDrPVIngole
 
short course on Subsurface stochastic modelling and geostatistics
short course on Subsurface stochastic modelling and geostatisticsshort course on Subsurface stochastic modelling and geostatistics
short course on Subsurface stochastic modelling and geostatisticsAmro Elfeki
 
digital control Chapter 2 slide
digital control Chapter 2 slidedigital control Chapter 2 slide
digital control Chapter 2 slideasyrafjpk
 
Z transform and Properties of Z Transform
Z transform and Properties of Z TransformZ transform and Properties of Z Transform
Z transform and Properties of Z TransformAnujKumar734472
 
Generating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceGenerating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceCheng-An Yang
 
Solution of the Difference equations.pptx
Solution of  the Difference equations.pptxSolution of  the Difference equations.pptx
Solution of the Difference equations.pptxvikhramecesec
 
Z Transform And Inverse Z Transform - Signal And Systems
Z Transform And Inverse Z Transform - Signal And SystemsZ Transform And Inverse Z Transform - Signal And Systems
Z Transform And Inverse Z Transform - Signal And SystemsMr. RahüL YøGi
 

Similar to Approximate Inference (Chapter 10, PRML Reading) (20)

Notes on Intersection theory
Notes on Intersection theoryNotes on Intersection theory
Notes on Intersection theory
 
Chapter 6 frequency domain transformation.pptx
Chapter 6 frequency domain transformation.pptxChapter 6 frequency domain transformation.pptx
Chapter 6 frequency domain transformation.pptx
 
Dcs lec02 - z-transform
Dcs   lec02 - z-transformDcs   lec02 - z-transform
Dcs lec02 - z-transform
 
Polya recurrence
Polya recurrencePolya recurrence
Polya recurrence
 
dsp dsp by Dr. k Udaya kumar power point
dsp dsp by Dr. k Udaya kumar power pointdsp dsp by Dr. k Udaya kumar power point
dsp dsp by Dr. k Udaya kumar power point
 
Digital Signal Processing and the z-transform
Digital Signal Processing and the  z-transformDigital Signal Processing and the  z-transform
Digital Signal Processing and the z-transform
 
Frequency Analysis using Z Transform.pptx
Frequency Analysis  using Z Transform.pptxFrequency Analysis  using Z Transform.pptx
Frequency Analysis using Z Transform.pptx
 
Signal3
Signal3Signal3
Signal3
 
clock_theorems
clock_theoremsclock_theorems
clock_theorems
 
short course on Subsurface stochastic modelling and geostatistics
short course on Subsurface stochastic modelling and geostatisticsshort course on Subsurface stochastic modelling and geostatistics
short course on Subsurface stochastic modelling and geostatistics
 
digital control Chapter 2 slide
digital control Chapter 2 slidedigital control Chapter 2 slide
digital control Chapter 2 slide
 
Z transform and Properties of Z Transform
Z transform and Properties of Z TransformZ transform and Properties of Z Transform
Z transform and Properties of Z Transform
 
lec z-transform.ppt
lec z-transform.pptlec z-transform.ppt
lec z-transform.ppt
 
z transform.pptx
z transform.pptxz transform.pptx
z transform.pptx
 
Generating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceGenerating Chebychev Chaotic Sequence
Generating Chebychev Chaotic Sequence
 
Unit 7 &amp; 8 z-transform
Unit 7 &amp; 8  z-transformUnit 7 &amp; 8  z-transform
Unit 7 &amp; 8 z-transform
 
Solution of the Difference equations.pptx
Solution of  the Difference equations.pptxSolution of  the Difference equations.pptx
Solution of the Difference equations.pptx
 
Transforms
TransformsTransforms
Transforms
 
Z Transform And Inverse Z Transform - Signal And Systems
Z Transform And Inverse Z Transform - Signal And SystemsZ Transform And Inverse Z Transform - Signal And Systems
Z Transform And Inverse Z Transform - Signal And Systems
 
5.pdf
5.pdf5.pdf
5.pdf
 

More from Ha Phuong

QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapQTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapHa Phuong
 
CCS2019-opological time-series analysis with delay-variant embedding
CCS2019-opological time-series analysis with delay-variant embeddingCCS2019-opological time-series analysis with delay-variant embedding
CCS2019-opological time-series analysis with delay-variant embeddingHa Phuong
 
SIAM-AG21-Topological Persistence Machine of Phase Transition
SIAM-AG21-Topological Persistence Machine of Phase TransitionSIAM-AG21-Topological Persistence Machine of Phase Transition
SIAM-AG21-Topological Persistence Machine of Phase TransitionHa Phuong
 
001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetwork001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetworkHa Phuong
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Ha Phuong
 
Prediction io–final 2014-jp-handout
Prediction io–final 2014-jp-handoutPrediction io–final 2014-jp-handout
Prediction io–final 2014-jp-handoutHa Phuong
 
A Study on Privacy Level in Publishing Data of Smart Tap Network
A Study on Privacy Level in Publishing Data of Smart Tap NetworkA Study on Privacy Level in Publishing Data of Smart Tap Network
A Study on Privacy Level in Publishing Data of Smart Tap NetworkHa Phuong
 

More from Ha Phuong (7)

QTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature MapQTML2021 UAP Quantum Feature Map
QTML2021 UAP Quantum Feature Map
 
CCS2019-opological time-series analysis with delay-variant embedding
CCS2019-opological time-series analysis with delay-variant embeddingCCS2019-opological time-series analysis with delay-variant embedding
CCS2019-opological time-series analysis with delay-variant embedding
 
SIAM-AG21-Topological Persistence Machine of Phase Transition
SIAM-AG21-Topological Persistence Machine of Phase TransitionSIAM-AG21-Topological Persistence Machine of Phase Transition
SIAM-AG21-Topological Persistence Machine of Phase Transition
 
001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetwork001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetwork
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
Prediction io–final 2014-jp-handout
Prediction io–final 2014-jp-handoutPrediction io–final 2014-jp-handout
Prediction io–final 2014-jp-handout
 
A Study on Privacy Level in Publishing Data of Smart Tap Network
A Study on Privacy Level in Publishing Data of Smart Tap NetworkA Study on Privacy Level in Publishing Data of Smart Tap Network
A Study on Privacy Level in Publishing Data of Smart Tap Network
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

Approximate Inference (Chapter 10, PRML Reading)

  • 1. VC.M. Bishop’s PRML Chapter 10: Approximate Inference Tran Quoc Hoan @k09hthaduonght.wordpress.com/ 13 December 2015, PRML Reading, Hasegawa lab., Tokyo The University of Tokyo
  • 2. Excuse me… Variational Inference 2 Section Concentrate ability Speaker Audiences Should we take a break? (or next time)
  • 3. Outline Variational Inference 3 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.3. Variational Linear Regression 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression Part I: Probabilistic modeling and 
 the variational principle Part II: Design the 
 variational algorithms 10.7. Expectation Propagation
  • 4. Progress… Variational Inference 4 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression 10.7. Expectation Propagation 10.3. Variational Linear Regression
  • 5. Probabilistic Inference 10.1 Variational Inference 5 Any mechanism by which we deduce the probabilities in our model based data. Statistical Inference In probabilistic models, we need reason over the probability of events Inference links the observed data with our statistical assumptions and allows us to ask questions of our data: predictions, visualization, model selection.
  • 6. Modeling and Inference 10.1 Variational Inference 6 Posterior Bayes’ rule in many of inferential problems Probabilistic modeling will involve: • Decide on a priori beliefs. • Posit an explanation of how the observed data is generated, i.e. provide a probabilistic description. = Likelihood Prior Marginal likelihood (Model evidence) p(z|x) p(x|z) p(z) Z p(x, z)dz Observed data Hidden variable
  • 7. Modeling and Inference 10.1 Variational Inference 7 Most inference problems will be one of: Marginalization Expectation Prediction Posterior = Likelihood Prior Marginal likelihood (Model evidence) Z p(x, z)dz p(z|x) p(x|z) p(z) Complex form for which the expectations are not tractable
  • 8. Importance Sampling 10.1 Variational Inference 8 IntegralBasic idea: Transform the integral into an expectation over a simple, known distribution p(z) f(z) z q(z) Conditions: • q(z) > 0 when f(z)p(z) ≠ 0 • Easy to sample from q(z) E[f] = Z f(z)p(z)dz E[f] = Z f(z)p(z) q(z) q(z) dz Notice: x is abbreviated in formula E[f] = Z f(z) p(z) q(z) q(z)dz w(s) = p(z(s) ) q(z(s)) z(s) ⇠ q(z) E[f] = 1 S X s w(s) f(z(s) ) Proposal Importance weight Monte Carlo
  • 9. Importance Sampling 10.1 Variational Inference 9 p(z) f(z) z q(z) Properties: • Unbiased estimate of the expectation • No independent samples from posterior distribution • Many draws from proposal needed, especially in high dimensions E[f] = 1 S X s w(s) f(z(s) ) w(s) = p(z(s) ) q(z(s)) z(s) ⇠ q(z) Stochastic Approximation Chapter 11
  • 10. Importance Sampling 10.1 Variational Inference 10 p(z) f(z) z q(z) Take inspiration from importance sampling, but instead: • Obtain a deterministic algorithm • Scaled up to high-dimensional and large data problems • Easy convergence assessment E[f] = 1 S X s w(s) f(z(s) ) w(s) = p(z(s) ) q(z(s)) z(s) ⇠ q(z) Variational Inference
  • 11. What is a Variational Method? 10.1 Variational Inference 11 Variational Principle General family of methods for approximating complicated densities by a simpler class of densities Approximation class True posterior Deterministic approximation procedures with bounds on probabilities of interest Fit the variational parameters
  • 12. Variational Calculus 10.1 Variational Inference 12 Functions: • Variables as input, output is a value • Full and partial derivatives df/dx • Ex. Maximize likelihood p(x|µ) w.r.t. parameters µ Both types of derivatives are exploited in variational inference Functionals: • Functions as input, output is a value • Functional derivatives ∂f/∂x • Ex. Maximize the entropy H[p(x)] w.r.t. p(x) Variational method derives from the Calculus of Variations
  • 13. From IS to Variational Inference 10.1 Variational Inference 13 Integral Importance weight Jensen’s inequality ln p(X) = ln Z p(X|Z)p(Z)dZ ln p(X) = ln Z p(X|Z) p(Z) q(Z) q(Z)dZ ln Z p(x)g(x)dx Z p(x) ln g(x)dx ln p(X) Z q(Z) ln (p(X|Z) p(Z) q(Z) )dZ Variational (evidence) lower bound = Z q(Z) ln p(X|Z)dZ Z q(Z) ln q(Z) p(Z) dZ Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]=
  • 14. Variational Lower Bound 10.1 Variational Inference 14 F(X, q) = Reconstruction PenaltyApprox. Posterior • Penalty: Ensures the explanation of the data q(Z) doesn’t deviate too far from your beliefs p(Z). • Reconstruction cost: The expected log-likelihood measure how well samples from q(Z) are able to explain the data X. • Approximate posterior distribution q(Z): Best match to true posterior p(Z|X), one of the unknown inferential quantities of interest to us. Interpreting the bound:
 Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]
  • 15. How low the VLB (ELB)? 10.1 Variational Inference 15 • Variational parameters: Parameters of q(Z) (Ex. if a Gaussian, they’re mean and variance). • Integration switched to optimization: optimize q(Z) directly (my thought: actually it is q(Z|X) ) to minimize Some comments on q:
 ln p(X) F(X, q) = Z q(Z) ln p(X)dZ F(X, q) = Z q(Z) ln p(X)dZ Z q(Z) ln p(X|Z)dZ + Z q(Z) ln q(Z) p(Z) dZ = Z q(Z) ln p(X)q(Z) p(X|Z)p(Z) dZ = Z q(Z) ln q(Z) p(Z|X) dZ = KL[q(Z)||p(Z|X)] KL[q(Z)||p(Z|X)]
  • 16. From the book 10.1 Variational Inference 16 KL(q k p) = Z q(Z) ln ⇢ p(Z|X) q(Z) dZ ln p(X) = L(q) + KL(q k p) Maximum occurs when 
 q(Z) = p(Z|X) Approximate the maximum by variational method (10.2) (10.3) (10.4) • What exactly is q(z)? • How do we find the variational parameters? • How do we compute the gradients? • How do we optimize the model parameters? L(q) = F(X, q) = Z q(Z) ln ⇢ p(X, Z) q(Z) dZ
  • 17. Free-form and Fixed-form 10.1 Variational Inference 17 Free-form variational method solves for the exact distribution setting the functional derivative to zero Fixed-form variational method specifies an explicit form of the q- distribution The optimal solution is the true posterior distribution but solving for the normalization is original Ideally rich class of distribution q (Z) = f(Z; ) L(q) q(Z) = 0 s.t. Z q(Z)dZ = 1 Variational parameter q(Z) / p(Z)p(X|Z, ✓)
  • 18. 10.1.1 Factorized distributions (I) 10.1 Variational Inference 18 • Mean-field methods assume that the distribution is factorized Restricted class of approximations: every dimension (or subset of dimensions) of the posterior is independent • Let Z be partitioned into disjoint groups Zi (i = 1…M) q(Z) = MY i=1 qi(Zi) No restriction on the functional form of qi(Zi)
  • 19. Factorized distributions (II) 10.1 Variational Inference 19 L(q) = Z q(Z) ln ⇢ p(X, Z) q(Z) dZ = Z q(Z) {ln p(X, Z) ln q(Z)} dZ = Z Y i qi(Zi) ! ( ln p(X, Z) X i ln qi(Zi) ) dZ = Z Y i qi ! [ln p(X, Z)] dZ Z Y i qi ! X i ln qi ! dZ qi
  • 20. Factorized distributions (III) 10.1 Variational Inference 20 L(q) = Z Y i qi ! [ln p(X, Z)] dZ Z Y i qi ! X i ln qi ! dZ = Z qj 8 < : Z ln p(X, Z) Y i6=j qidZi 9 = ; dZj Z Y i qi ! 8 < : ln qj + X i6=j ln qi 9 = ; dZ = Z qj 8 < : Z ln p(X, Z) Y i6=j qidZi 9 = ; dZj Z Y i qi ! ln qjdZ Z 0 @ Y i6=j qi 1 A 0 @ X i6=j ln qi 1 A ✓Z qjdZj ◆ dZi6=j = Z qj (ln ˜p(X, Zj) const) dZj Z qj ln qjdZj Z 0 @ Y i6=j qi 1 A 0 @ X i6=j ln qi 1 A dZi6=j • Consider with function qj(Zj) = 1
  • 21. Factorized distributions (IV) 10.1 Variational Inference 21 L(q) = Z qj ln ˜p(X, Zj)dZj Z qj ln qjdZj + const negative KL divergence where ˜p(X, Zj) = Ei6=j[ln p(X, Z)] + const Ei6=j[ln p(X, Z)] = Z ln p(X, Z) Y i6=j qidZi and • Maximize by keeping fixed • This is same as minimizing KL divergence between and L(q) {qi6=j} ˜p(X, Zj) qj(Zj) (10.6) (10.7) (10.8)
  • 22. Optimal Solution 10.1 Variational Inference 22 (1) Initialize all qj appropriately. ln q⇤ j (Zj) = Ei6=j[ln p(X, Z)] + const q⇤ j (Zj) = exp(Ei6=j[ln p(X, Z)]) R exp(Ei6=j[ln p(X, Z)])dZj or (2) Run below code until convergence. • foreach qi • Fixed all qj ≠qi and find optimal qi • Update qi (10.9)Today’s 
 memo Next slides are detailed examples
  • 23. 10.1.2. Properties of factorized approximations 10.1 Variational Inference 23 Approximate Gaussian Distribution with factorized Gaussian Consider, p(z) = N(z|µ, ⇤ 1 ) µ = (µ1, µ2)T , ⇤ = ✓ ⇤11 ⇤12 ⇤21 ⇤22 ◆ q(z) = q1(z)q2(z) ln q⇤ 1(z1) = Ez2 [ln p(z)] + const ln q⇤ 1(z1) = Ez2 [ 1 2 (z1 µ1)2 ⇤11 (z1 µ1)⇤12(z2 µ2)] + const where z = (z1, z2), Approximate using Optimal solution from (10.9) consider only the terms have z1
  • 24. 10.1 Variational Inference 24 ln q⇤ 1(z1) = Ez2 [ 1 2 (z1 µ1)2 ⇤11 (z1 µ1)⇤12(z2 µ2)] + const = Z q2(z2) ⇢ 1 2 (z1 µ1)2 ⇤11 (z1 µ1)⇤12(z2 µ2) dz2 + const = 1 2 (z1 µ1)2 ⇤11 z1⇤12(Ez2 [z2] µ2) + const q⇤ 1(z1) = N(z1|m1, ⇤ 1 11 ) m1 = µ1 ⇤ 1 11 ⇤12(Ez2 [z2] µ2) quadratic form of z1 Then, we have (10.11) 10.1.2. Properties of factorized approximations
  • 25. 10.1 Variational Inference 25 q⇤ 1(z1) = N(z1|m1, ⇤ 1 11 ) m1 = µ1 ⇤ 1 11 ⇤12(Ez2 [z2] µ2) m2 = µ2 ⇤ 1 22 ⇤21(Ez1 [z1] µ1)q⇤ 2(z2) = N(z2|m2, ⇤ 1 22 ) q(z) = q1(z)q2(z)Optimal solution of (10.12)-(10.15) Mutual dependency: • depends on (calculated by )q⇤ 1(z1) q⇤ 2(z2) Ez1 [z1] Ez2 [z2] • depends on (calculated by )q⇤ 1(z1)q⇤ 2(z2) Update alternately until convergence 10.1.2. Properties of factorized approximations
  • 26. 10.1 Variational Inference 26 Fig 10.2 The green contours corresponding to 1, 2 and 3 standard deviations for a correlated Gaussian distribution p(z) over two variables z1 and z2 . The red contours represent the corresponding levels for an approximating distribution q(z) over the same variables given by the product of two independent univariate Gaussian Minimize KL(q||p) Minimize KL(p||q) • The mean is captured correctly, but the variance is underestimated in the orthogonal direction • Optimal solution                  (that is the corresponding marginal distribution of p(Z) ) • Considering reverse KL divergence KL(p||q) = Z p(Z)[ MX i=1 ln qi(Zi)]dZ + const 10.1.2. Properties of factorized approximations (10.17)
  • 27. 10.1 Variational Inference 27 Minimize KL(q||p) Minimize KL(p||q) Reverse KL divergence KL(p||q) = Z p(Z)[ MX i=1 ln qi(Zi)]dZ + const 10.1.2. Properties of factorized approximations KL(q||p) = Z q(Z) ln ⇢ p(Z) q(Z) • If near zero then tends to close to zerop(Z) q(Z) • If near zero then is not important p(Z) q(Z) • KL divergence is minimized by distributions that are nonzero in regions when is nonzerop(Z) q(Z)
  • 28. More about divergence 10.1 Variational Inference 28 Fig 10.3 Blue contour = bimodal distribution p(Z). Red contour= single Gaussian distribution q(Z) that best approximates p(Z) Minimize KL(q||p) (a) Minimize KL(p||q) Minimize KL(q||p) (b) (c) • KL(p||q) and KL(q||p) belong to the alpha family of divergences where • If will underestimate • If will overestimate • If it related to Hellinger distance p(x) p(x) D↵(p||q) ! KL(q||p) D↵(p||q) ! KL(p||q)
  • 29. 10.1.3 The univariate Gaussian (I) 10.1 Variational Inference 29 • Goal: to inferrer posterior distribution for mean and precision given data set µ ⌧ D = {x1, ..., xN } • Likelihood function • Prior • Approximate and (10.21) (10.22) (10.23) (10.24)
  • 30. 10.1.3 The univariate Gaussian (II) 10.1 Variational Inference 30 ln q⇤ µ(µ) = E⌧ [ln p(D, µ, ⌧)] + const = E⌧ [ln (p(D|µ, ⌧)p(µ|⌧)p(⌧))] + const = E⌧ [ln p(D|µ, ⌧) + ln p(µ|⌧) + ln p(⌧)] + const = E⌧ [ln p(D|µ, ⌧) + ln p(µ|⌧)] + const = E⌧ " ⌧ 2 NX n=1 (xn µ)2 0⌧ 2 (µ µ0)2 # + const = E⌧ [⌧] 2 " NX n=1 (xn µ)2 + 0(µ µ0)2 # + const (10.25) quadratic form of µ • Optimal solution (from formula 10.9)
  • 31. 10.1.3 The univariate Gaussian (II) 10.1 Variational Inference 31 • Optimal solution for mean q⇤ µ(µ) = N(µ|µN , 1 N ) 8 < : µn = 0µ0 + N ¯x 0 + N N = ( 0 + N)E⌧ [⌧] where (10.26) (10.27) • Similar with optimal solution of q⌧ (⌧) q⇤ ⌧ (⌧) = Eµ[ln p(D|µ, ⌧) + ln p(µ|⌧) + ln p(⌧)] + const = ⌧ 2 Eµ " NX n=1 (xn µ)2 + 0(µ µ0)2 # + N 2 ln ⌧ + 1 2 ln ⌧ + (a0 1) ln ⌧ b0⌧ + const = ✓ a0 + N + 1 2 1 ◆ ln ⌧ ✓ b0 + 1 2 Eµ[...] ◆ ⌧ + const (10.28)
  • 32. 10.1.3 The univariate Gaussian (III) 10.1 Variational Inference 32 • Optimal solution of q⌧ (⌧) q⇤ ⌧ (⌧) / ⌧ 0 @a0+ N + 1 2 1 1 A exp ✓ ✓ b0 + 1 2 Eµ[...] ◆ ⌧ ◆ or q⇤ ⌧ (⌧) = Gam(⌧|aN , bN ) where 8 >< >: aN = a0 + N + 1 2 bN = b0 + 1 2 Eµ hPN n=1(xn µ)2 + 0(µ µ0)2 i (10.29) (10.30) • Using (10.26)(10.27) and (10.29)(10.30) alternately to compute posterior by approximate variational inference p(µ, ⌧|D)
  • 33. 10.1.3 The univariate Gaussian (IV) 10.1 Variational Inference 33 p(µ, ⌧|D) qµ(µ)q⌧ (⌧) qµ(µ) Re-estimating Re-estimating q⌧ (⌧) Convergence of factorized approximation
  • 34. 10.1.4 Model Comparison 10.1 Variational Inference 34 • Prior probabilities on the models be • Goal: determine where is the observed data • Approximate • Maximizing by we get • Maximizing by we find solutions for different m are coupled due to the conditioning • Optimize each individually and subsequently find p(m) p(m|X) X q(Z, m) = q(Z|m)q(m) ln p(X) = L X m X Z q(Z|m)q(m) ln ⇢ p(Z, m|X) q(Z|m)q(m) where lower bound L = X m X Z q(Z|m)q(m) ln ⇢ p(Z, X, m) q(Z|m)q(m) (10.34) (10.35) L q(m) q(Z|m) q(Z|m) q(m) q(m) / p(m)exp(Lm) with Lm Lm = X Z q(Z|m) ln ⇢ p(Z, X|m) q(Z|m)
  • 35. Progress… Variational Inference 35 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression 10.7. Expectation Propagation 10.3. Variational Linear Regression
  • 36. 10.2 Variational Mixture of Gaussians 10.2 Variational Mixture of Gaussians 36 • Goal: apply variational inference for the Gaussian mixture model • Problem formation: • Latent variable zn = {znk} (1-of-K binary vector) corresponding with each observation xn Observed data Hidden variables X = {x1, ..., xN } Z = {z1, ..., zN } • Conditional distribution of Z, given the mixing coefficients π • Conditional distribution of the observed data vectors, given the latent variables and the component parameters p(X|Z, µ, ⇤) = NY n=1 KY k=1 N(xn|µk, ⇤ 1 k )znk p(Z|⇡) = NY n=1 KY k=1 ⇡znk k (10.37) (10.38)
  • 37. 10.2 Variational Mixture of Gaussians 10.2 Variational Mixture of Gaussians 37 • Conjugate prior distributions • Dirichlet distribution over the mixing coefficients π p(⇡) = Dir(⇡|↵0) = C(↵0) KY k=1 ⇡↵0 1 k • Independent Gaussian-Whishart prior governing the mean and precision of each Gaussian component m0 = 0choose by symmetry Fig 10.5 Directed acyclic graph representing the Bayesian mixture of Gaussians model (10.39) (10.40)
  • 38. 10.2.1. Variational Distribution (I) 10.2 Variational Mixture of Gaussians 38 • Joint distribution • Approximate p(X, Z, ⇡, µ, ⇤) = p(X|Z, µ, ⇤)p(Z|⇡)p(⇡)p(µ|⇤)p(⇤) q(Z, ⇡, µ, ⇤) = q(Z)q(⇡, µ, ⇤) • Optimal solution (from formula 10.9) ln q⇤ (Z) = E⇡,µ,⇤[ln p(X, Z, ⇡, µ, ⇤)] + const = E⇡[ln p(Z|⇡)] + Eµ,⇤[ln p(X|Z, µ, ⇤)] + const = E⇡ " ln NY n=1 KY k=1 ⇡znk k !# + Eµ,⇤ " ln NY n=1 KY k=1 N(xn|µk, ⇤ 1 k )znk !# + const = NX n=1 KX k=1 znkE⇡[ln ⇡k] + NX n=1 KX k=1 znkEµ,⇤  1 2 ln |⇤k| D 2 ln(2⇡) 1 2 (xn µk)T ⇤k(xn µk) + const (10.42) (10.43) (10.44) D is the dimensionality of the data variable x
  • 39. 10.2.1 Variational Distribution (II) 10.2 Variational Mixture of Gaussians 39 • Optimal solution for q(Z) ln q⇤ (Z) = NX n=1 KX k=1 znk ln ⇢nk + const where then, normalized, where also seen as responsibilities as in case of EM (10.45) (10.46) (10.47) (10.48)
  • 40. 10.2.1 Variational Distribution (III) 10.2 Variational Mixture of Gaussians 40 • Optimal solution for q(⇡, µ, ⇤) define: (10.51) (10.52) (10.53) optimal solution: ln q⇤ (⇡, µ, ⇤) = EZ[ln p(X, Z, ⇡, µ, ⇤)] + const = EZ[ln (p(X|Z, µ, ⇤)p(Z|⇡)p(⇡)p(µ|⇤)p(⇤))] + const = EZ[ln p(X|Z, µ, ⇤)] + EZ[ln p(Z|⇡)] + ln p(⇡) + ln p(µ, ⇤) + const = NX n=1 KX k=1 EZ[znk] ln N(xn|µk, ⇤ 1 k ) + EZ[ln p(Z|⇡)] + ln p(⇡) + KX k=1 ln p(µk, ⇤k) + const (10.54)
  • 41. 10.2.1 Variational Distribution (IV) 10.2 Variational Mixture of Gaussians 41 = NX n=1 KX k=1 EZ[znk] ln N(xn|µk, ⇤ 1 k ) + EZ[ln p(Z|⇡)] + ln p(⇡) + KX k=1 ln p(µk, ⇤k) + const ln q⇤ (⇡, µ, ⇤) ⇡ µ, ⇤ something of + something of Then it could be further factorization q(⇡, µ, ⇤) = q(⇡) KY k=1 q(µk, ⇤k) q⇤ (⇡, µ, ⇤) = q⇤ (⇡) KY k=1 q⇤ (µk, ⇤k) (10.54) (10.54) From (10.54) and (10.55) we have ln q⇤ (⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const ln q⇤ (µk, ⇤k) = ln p(µk, ⇤k) + NX n=1 EZ[znk] ln N(xn|µk, ⇤ 1 k ) + const
  • 42. 10.2.1 Variational Distribution (VI) 10.2 Variational Mixture of Gaussians 42 ln q⇤ (⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const = EZ " ln NY n=1 KY k=1 ⇡znk k !# + ln C(↵0) KY k=1 ⇡↵0 1 k ! + const = NX n=1 KX k=1 EZ[znk] ln ⇡k + (↵0 1) KX k=1 ln ⇡k + const = KX k=1 (Nk + ↵0 1) ln ⇡k + const = ln KY k=1 ⇡Nk+↵0 1 k ! + const q⇤ (⇡) = Dir(⇡|↵) ↵k = Nk + ↵0 is recognized as Dirichlet distributionq⇤ (⇡) (10.56) (10.57) (10.58)
  • 43. 10.2.1 Variational Distribution (VII) 10.2 Variational Mixture of Gaussians 43 ln q⇤ (µk, ⇤k) = ln p(µk, ⇤k) + NX n=1 EZ[znk] ln N(xn|µk, ⇤ 1 k ) + const We have Gaussian-Wishart distribution (exercise 10.13) (10.59) (10.60) - (10.63) Analogous to the M-step of the EM algorithm
  • 44. 10.2.1 Variational Distribution (VIII) 10.2 Variational Mixture of Gaussians 44 • Optimize the variational posterior Gaussian mixture distribution (1) Initialize the responsibilities rnk (2) Update by (10.51)-(10.53)Nk, ¯xk, Sk (3) [M step] • Use (10.57) to find • Use (10.59) to find (4) [E step] • Use (10.64)-(10.66) and (10.46) - (10.49) to update responsibilities to find (5) Back to (2) until convergence q⇤ (⇡) Use the current distribution of parameters to evaluate responsibilities Fix responsibilities and use it to recompute the variational distribution over parameters q⇤ (µk, ⇤k) (k = 1, ..., K)
  • 45. 10.2.1 Variational Distribution (IX) 10.2 Variational Mixture of Gaussians 45 Figure 10.6 Variational Bayesian mixture of K = 6 Gaussians applied to the Old Faithful data set. The ellipses denote the one standard-deviation density contours for each of the components, and the density of red ink inside each ellipse corresponds to the mean value of the mixing coefficient for each component. iteration iterations iterations iterations The coefficients of meaningless distribution tend to close to zero (disappear)
  • 46. Compare EM with Variational Bayes 10.2 Variational Mixture of Gaussians 46 • The same calculate complexity • As number of data point N → ∞, Bayesian treatment converges to Maximum likelihood EM algorithm • The advantage of Variational Bayes A. Singularities that arise in ML are absent in Bayesian treatment, removed by the introduction of prior B. No over-fitting: could be used for determining the number of components
  • 47. 10.2.2 Variational lower bound 10.2 Variational Mixture of Gaussians 47 • At each step of the iterative re-estimation procedure the value of this bound should not decrease • Useful to test convergence • To check on the correctness of both mathematical expression and implementation • For the variational mixture of Gaussians, the lower bound is given by
  • 48. 10.2.3 Predictive density 10.2 Variational Mixture of Gaussians 48 • Predictive density , for a new value with corresponding latent variable • Depends on the posterior distribution of parameters • As the posterior distribution is intractable the variational approximation can be used to obtain an approximate predictive density P(ˆx|X) ˆx ˆz (10.78) q(⇡)q(µ, ⇤) variational approximation
  • 49. 10.2.4 Number of components 10.2 Variational Mixture of Gaussians 49 • For a given mixture model of K components, each parameter setting is a member of a family of K! equivalent setting Figure 10.7 Plot of the variational lower bound L versus the number K of components in the Gaussian mixture model, for the Old Faithful data, showing a distinct peak at K = 2 components. For each value of K, the model is trained from 100 different random starts, and the results shown as ‘+’ symbols plotted with small random hori- zontal perturbations so that they can be distinguished. Note that some solutions find suboptimal local maxima, but that this hap- pens infrequently. • Starting with relative large value of K and components with insufficient contribution are pruned out: the mixing coefficient is driven to zero
  • 50. 10.2.5 Induced factorizations 10.2 Variational Mixture of Gaussians 50 Induced factorization arises from an interaction between the factorization assumption in variational posterior and the conditional independence properties of the true posterior • For ex: Let A, B, C be disjoint groups of latent variables • Factorization assumption • The optimal solution ln q⇤ (A, B) = EC[ln p(A, B|X, C)] + const q(A, B, C) = q(A, B)q(C) • We need to determine if . This is possible iff • This can also determined from the directed-graph model
  • 51. Progress… Variational Inference 51 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression 10.7. Expectation Propagation 10.3. Variational Linear Regression
  • 52. 10.3 Variational Linear Regression 10.3 Variational Linear Regression 52 • Return to the Bayesian linear regression model (section 3.3) • Approximated the integration over by making point estimates obtained by maximizing the log marginal likelihood ↵, omitted input x predict value training target value ˆ↵, ˆ = argmax↵, ˆ↵, ˆ = argmax↵, flat MLE then
  • 53. 10.3 Variational Linear Regression 10.3 Variational Linear Regression 53 • The joint distribution of all the variables • Prior (10.90) (10.89) (10.87) (10.88) • Fully Bayesian approach would integrate over the hyper-parameters as well as over parameters (this section) • Suppose that the noise precision parameter is known. n = (xn)where p(↵) = Gam(↵|a0, b0) / ↵a0 1 exp( b0↵) = 1 (2⇡)M/2 1 |⌃|1/2 exp ⇢ 1 2 (wT ⌃ 1 w) ⌃ = ↵ 1 Iwhere
  • 54. 10.3.1 Variational Distribution 10.3 Variational Linear Regression 54 • Goal: find an approximation to the posterior distribution p(w, ↵|t) • Factorized approximation q(w, ↵) = q(w)q(↵) (10.91) • Optimal solution (from 10.9) ln q⇤ j (Zj) = Ei6=j[ln p(X, Z)] + const ln q⇤ (↵) = Ew[ln (p(t|w)p(w|↵)p(↵))] + const = ln p(↵) + Ew[ln p(w|↵)] + const = (a0 1) ln ↵ b0↵ + M 2 ln ↵ ↵ 2 Ew[wT w] + const (10.92) then q⇤ (↵) = Gam(↵|aN , bN ) where(10.93) (10.94) (10.95) M is number of fitting parameters wi or input dimension
  • 55. 10.3.1 Variational Distribution 10.3 Variational Linear Regression 55 • Similar optimal solution of q(w) quadratic form of w q⇤ (w) = N(w|mN , SN ) then (10.99) where (10.96) (10.97) (10.98)
  • 56. 10.3.1 Variational Distribution 10.3 Variational Linear Regression 56 • Optimal solution q⇤ (↵) = Gam(↵|aN , bN ) where (10.94) (10.95) (10.93) q⇤ (w) = N(w|mN , SN )(10.99) where (10.100) (10.101) Moment (10.102) (10.103)
  • 57. More about variational linear regression 10.3 Variational Linear Regression 57 • Predictive distribution over t, given a new input x • Lower bound (10.105) (10.107)
  • 58. More about variational linear regression 10.3 Variational Linear Regression 58 Lower bound Order M for a polynomial model Figure 10.9 The lower bound versus the order M of the polynomial model, in which a set of 10 data points is generated from a polynomial with M=3 sampled over (-5, 5) with additive Gaussian noise of var=0.09. The value of the bounds gives the log probability of the model Peak at M = 3, corresponding to the true model
  • 59. Progress… Variational Inference 59 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.3. Variational Linear Regression 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression Part I: Probabilistic modeling and 
 the variational principle Now: Design the 
 variational algorithms 10.7. Expectation Propagation
  • 60. Progress… Variational Inference 60 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression 10.7. Expectation Propagation 10.3. Variational Linear Regression
  • 61. 10.4 Exponential Family Distributions 10.4 Exponential Family Distributions 61 • For many of the models in this book, the complete data likelihood is drawn from the exponential family • In general, this will not be the case for the marginal likelihood function for the observed data. Ex: in a mixture of Gaussians.
  • 62. 10.4 Exponential Family Distributions 10.4 Exponential Family Distributions 62 • Observed data • Latent variables X = {x1, ..., xN } Z = {z1, ..., zN } • Suppose that the joint distribution is a member of the exponential family where the conjugate prior for ⌘ (10.113) p(⌘|⌫0, 0) = f(⌫0, 0)g(⌘)⌫0 exp{⌫0⌘T 0} (prior number of observations all having the value for the u vector)⌫0 0 (10.114) • Variational distributions q(Z, ⌘) = q(Z)q(⌘)
  • 63. 10.4 Exponential Family Distributions 10.4 Exponential Family Distributions 63 ln q⇤ j (Zj) = Ei6=j[ln p(X, Z)] + const• Optimal solution (from 10.9) = E⌘[ln p(X, Z|⌘)] + const = NX n=1 ln h(xn, zn) + E⌘[⌘T ]u(xn, zn) + const sum of independent things Induced factorization q⇤ (Z) = Y n q⇤ (zn) (10.115) where (10.116) ln q⇤ (Z) = E⌘[ln p(X, Z, ⌘)] = E⌘[ln p(X, Z|⌘)p(⌘|⌫0, 0)p(⌫0, 0)] q⇤ (zn) = h(xn, zn)g(E⌘[⌘]) exp{E⌘[⌘T ]u(xn, zn)}
  • 64. 10.4 Exponential Family Distributions 10.4 Exponential Family Distributions 64 ln q⇤ j (Zj) = Ei6=j[ln p(X, Z)] + const• Optimal solution (from 10.9) ln q⇤ (⌘) = EZ[ln p(X, Z, ⌘)] = EZ[ln p(X, Z|⌘)p(⌘|⌫0, 0)p(⌫0, 0)] = ln p(⌘|⌫0, 0) + EZ[ln p(X, Z|⌘)] + const = ⌫0 ln g(⌘) + ⌫0⌘T 0 + NX n=1 ln g(⌘) + ⌘T Ezn [u(xn, zn)] + const (10.118) q⇤ (⌘) = f(⌫N , N )g(⌘)⌫N exp{⌫N ⌘T N } then where (10.119) (10.120) (10.121)
  • 65. Variational message passing 10.4 Exponential Family Distributions 65 • The joint distribution corresponding to a directed graph then the optimal solution is thus the update of the factors in the variational posterior distribution represents a local calculation on the graph p(x) = Y i p(xi|pai) parent set corresponding to node ivariable(s) associated with node i (latent or observed) • Variational approximation q(x) = Y i qi(xi) Markov blanket
  • 66. Variational message passing 10.4 Exponential Family Distributions 66 • If all the conditional distributions have a conjugate-exponential structure, then the variational update will be: • The distribution associated with a particular node can be updated once that node has received messages from all of its parents and all of its children. • It requires that the children have already received messages from their co- parents
  • 67. Progress… Variational Inference 67 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression 10.7. Expectation Propagation 10.3. Variational Linear Regression
  • 68. 10.5 Local Variational Methods 10.5 Local Variational Methods 68 • Global methods: approximation to the full posterior • Local methods: approximation to individual or groups of variables • Replace the likelihood with a simpler form - lower bound that makes the expectation easy to compute
  • 69. Convex duality 10.5 Local Variational Methods 69 Convex function f(x) One of lower bound but not the best The line is moved vertically to a tangent Convex function f(x) Concave function f(x) g(⌘) = max x {⌘x f(x)} f(x) = max ⌘ {⌘x g(⌘)} f(x) = min ⌘ {⌘x g(⌘)} g(⌘) = min x {⌘x f(x)}(10.130) (10.131) (10.133) (10.132)
  • 70. 10.5 Local Variational Methods 10.5 Local Variational Methods 70 Original Problem p(y = 1|x) = 1 1 + exp( x) = (x) is concave function, then considerf(x) = ln (x) g(⌘) = min x {⌘x f(x)} = ⌘ ln ⌘ (1 ⌘) ln(1 ⌘) ln (x)  ⌘x g(⌘) The upper bound (x)  exp(⌘x g(⌘)) (10.135) (10.136) (10.137) The upper bound Logistic 
 sigmoid
  • 71. 10.5 Local Variational Methods 10.5 Local Variational Methods 71 is convex function of the variable x2 , then consider The stationarity conditions (10.139) The lower bound Logistic 
 sigmoid • Gaussian lower bound (Jakkola and Jordan, 2000) f(x) = ln(ex/2 + e x/2 ) g(⌘) = max x2 n ⌘x2 f( p x2) o 0 = ⌘ d dx2 d dx f(x) = ⌘ + 1 4x tanh( x 2 ) ⌘ = 1 4⇠ tanh ✓ ⇠ 2 ◆ = 1 2⇠  (⇠) 1 2 = (⇠) g( (⇠)) = (⇠)⇠2 f(⇠) = (⇠)⇠2 + ln(e⇠/2 + e ⇠/2 ) f(x) (⇠)x2 g( (⇠)) = (⇠)x2 + (⇠)⇠2 ln(e⇠/2 + e ⇠/2 ) The bound on f(x) The bound on sigmoid (x) (⇠) exp{(x ⇠)/2 (⇠)(x2 ⇠2 )} (10.144)
  • 72. How the bounds can be used 10.5 Local Variational Methods 72 • Evaluate I = Z (a)p(a)da • The local variational bound (a) f(a, ⇠) • The variational bound (intractable) I Z f(a, ⇠)p(a)da = F(⇠) ⇠ is additional parameter (depends on a)where Finding the compromise to maximize F(⇠)⇠⇤
  • 73. In Reviews… 10.5 Local Variational Methods 73 Original Problem Local Bound Bound with only linear or quadratic terms: expectations, especially against a Gaussian, are easy to compute. p(y = 1|x) = 1 1 + exp( x) = (x) (x) (⇠) exp{(x ⇠)/2 (⇠)(x2 ⇠2 )} ⇠ is additional parameterwhere
  • 74. Progress… Variational Inference 74 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression 10.7. Expectation Propagation 10.3. Variational Linear Regression
  • 75. 10.6 Variational Logistic Regression 10.6 Variational Logistic Regression 75 Return to the Bayesian logistic regression model (section 4.5) • The posterior distribution where the prior distribution p(w) = N(w|w0, S0) and the likelihood function p(t|w) = NY n=1 yn tn {1 yn}1 tn • Then where yn = (wT n) Maximize the posterior to give wMAP and then • The Gaussian approximation p(w) = N(w|wN , SN ) p(w|t) / p(w)p(t|w)
  • 76. 10.6.1 Variational posterior distribution 10.6 Variational Logistic Regression 76 A practical example of local variational method • Recap of variational framework: maximize a lower bound on the marginal likelihood For the Bayesian logistic regression model, the marginal likelihood is: The conditional distribution for t (10.147) (10.148) where
  • 77. 10.6.1 Variational posterior distribution 10.6 Variational Logistic Regression 77 • Lower bound on the logistic sigmoid function where • We can therefore write • Bound on the joint distribution of t and w where (10.149) (10.151) (10.152) (10.153) where and each training set observation corresponds with( n, tn) ⇠n
  • 78. 10.6.1 Variational posterior distribution 10.6 Variational Logistic Regression 78 • Lower bound on the log of the joint distribution of t and w • Hypothesis for the prior p(w): Gaussian with parameters m0 and S0 considered as fixed Then, the right side of (10.154) becomes as function of w (10.154) (10.155)
  • 79. 10.6.1 Variational posterior distribution 10.6 Variational Logistic Regression 79 • Quantity of interest: exact posterior distribution, requires normalization of the left side in (10.152) but usually intractable • Work instead with the right side of (10.155): a quadratic function of w which is a lower bound of p(w, t) • A Gaussian variational posterior of the form where (10.156) (10.157) (10.158)
  • 80. 10.6.1 Optimizing the variational parameters 10.6 Variational Logistic Regression 80 • Determine the variational parameters by maximizing the lower bound of the marginal likelihood Two approaches Substitute (10.152) back into the marginal likelihood (10.159) • (1) View w as a latent variable and use the EM algorithm • (2) Compute and maximize directly using the fact that p(w) is Gaussian and is a quadratic function of w Re-estimation equations (10.164) (10.163)
  • 81. 10.6.1 Optimizing the variational parameters 10.6 Variational Logistic Regression 81 • Invoke EM algorithm 1. Initialize values for 2. E step • Use to calculate the posterior distribution 3. M step ⇠old • Maximize the complete-data log likelihood Q(⇠, ⇠old ) = Eq(w) [ln h(w, ⇠)p(w)] • Solve (stationarity condition) then (10.163) (10.160) (10.162)
  • 83. 10.6.3 Inference of hyper parameters 10.6 Variational Logistic Regression 83 • Allow hyper parameters to be inferred from dataset • Consider simple Gaussian prior form • Consider conjugate hyper prior given by a gamma distribution • The marginal likelihood where the joint distribution (10.168) (10.167) (10.166) (10.165)
  • 84. 10.6.3 Inference of hyper parameters 10.6 Variational Logistic Regression 84 Combine global and local approach • (1) Global approach: consider a variational distribution and apply the standard decomposition • (2) The lower bound is intractable so apply the local approach as before to get a lower bound on and on L(q) L(q) ln p(t) • (3) Assume that q is factorized as (10.169) (10.172)
  • 85. 10.6.3 Inference of hyper parameters 10.6 Variational Logistic Regression 85 • It follows (quadratic function of w) where (10.174) (10.175) (10.176) • From (10.153) and (10.165) (10.153) (10.165)
  • 86. 10.6.3 Inference of hyper parameters 10.6 Variational Logistic Regression 86 where then • Similar with from (10.165) and (10.166) (10.177) (10.178) (10.179) q(↵) (10.165) We have (10.166)p(↵) = Gam(↵|a0, b0) / ↵a0 1 exp( b0↵) q(↵) = Gam(↵|aN , bN ) = 1 (aN ) abN N ↵aN 1 e bN ↵
  • 87. 10.6.3 Inference of hyper parameters 10.6 Variational Logistic Regression 87 • The variational parameters are obtained by maximizing the lower bound (10.180) (10.181) (10.183) (10.182) • Re-estimation equations where Q(⇠, ⇠old ) = Eq(w) [ln h(w, ⇠)p(w)] (10.160) as we done before with
  • 88. Progress… Variational Inference 88 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression 10.7. Expectation Propagation 10.3. Variational Linear Regression
  • 89. Expectation Propagation (Minka, 2001) 10.7 Expectation Propagation 89 • An alternative form of deterministic approximate inference based on the reverse KL divergence KL(p||q) ( instead of KL(q||p)) where p is the complex distribution KL(p||q) = Z p(z) ln p(z) q(z) dzKL(q||p) = Z q(z) ln q(z) p(z) dz • Consider fixed distribution p(z) and member of the exponential family q(z). KL(p||q) = ln g(⌘) ⌘T Ep(z)[u(z)] + const ⌘ setting gradient to zero (10.185) (10.186) q(z) = h(z)g(⌘) exp{⌘T u(z)} • The Kullback-Leibler divergence as function of (10.184)
  • 90. Expectation Propagation 10.7 Expectation Propagation 90 • Member of the exponential family q(z). (10.187) q(z) = h(z)g(⌘) exp{⌘T u(z)} (10.184) then Z h(z)g(⌘) exp{⌘T u(z)}dz = 1 taking the gradient of both size rg(⌘) Z h(z) exp{⌘T u(z)}dz +g(⌘) Z h(z) exp{⌘T u(z)}u(z)dz = 0 1 g(⌘) rg(⌘) = g(⌘) Z h(z) exp{⌘T u(z)}u(z)dz = Eq(z)[u(z)] r ln g(⌘) = Eq(z)[u(z)] From 10.186 we have Moment matching, setting mean and covariance of q(z) the same as p(z)’s
  • 91. Expectation Propagation 10.7 Expectation Propagation 91 • Assume the joint distribution of data and hidden variables and parameters is of the form D ✓ • Quantities of interest ad posterior distribution and model evidence (10.188) (10.189) (10.190)
  • 92. Expectation Propagation 10.7 Expectation Propagation 92 • Expectation propagation is based on an approximation to the posterior distribution which is also given by a product of factors where each factor comes from the exponential family • Ideally, we would like to determine by minimizing the KL divergence between the true posterior and the approximation • Minimize the KL divergence between each pair of factors and independently but the product is usually poor approximation • EP: optimize each factor in turn using the current values for the remaining factors (good in logistic type but bad for mixtures type due to multi-modality)
  • 95. The clutter problem 10.7 Expectation Propagation 95
  • 96. The clutter problem 10.7 Expectation Propagation 96 • Mixture of Gaussians of the form where w is the proportion of background cluster and is assumed to be known. The prior is taken to be Gaussian • The joint distribution of N observations and is given by • To apply EP, first identify the factors then, choose the exponential family • The factor approximation takes the form of exponentials of quadratic functions (10.209) (10.210) (10.211) (10.212) (10.213)
  • 97. The clutter problem 10.7 Expectation Propagation 97
  • 98. The clutter problem 10.7 Expectation Propagation 98 • Evaluate the approximation to the model evidence where (10.223) (10.224)
  • 99. Expectation Propagation on graphs 10.7 Expectation Propagation 99 • The factors are not function of all variables. If the approximating distribution is fully factorized, EP reduces to Loopy Belief Propagation • We seek an approximation q(x) that has the same factorization (10.225) (10.226)
  • 100. Expectation Propagation on graphs 10.7 Expectation Propagation 100 • Restrict to approximations in which the factors factorize with respect to the individual variables so that (10.227)
  • 101. Expectation Propagation on graphs 10.7 Expectation Propagation 101 • Suppose all the factors are initialized and we choose to refine factor • Minimizing the reverse KL when q factorizes, leads to an optimal solution q where factors are the marginals of p
  • 102. Expectation Propagation on graphs 10.7 Expectation Propagation 102
  • 103. Standard belief propagation 10.7 Expectation Propagation 103
  • 104. Standard belief propagation 10.7 Expectation Propagation 104
  • 105. Standard belief propagation 10.7 Expectation Propagation 105
  • 106. Standard belief propagation 10.7 Expectation Propagation 106 Same in the chapter 8
  • 107. Expectation propagation 10.7 Expectation Propagation 107 • The sum-product BP arises as a special case of EP when a fully factorized approximating distributions is used • EP can be seen as a way to generalized this: group factors and update them together, use partially connected graph • Q remains: How to choose the best grouping and disconnection? • Summary: EP and Variational message passing correspond to the optimization of two different KL divergences • Minka 2005 gives a more general point of view using the family of alpha-divergences that includes both KL and reverse KL, but also other divergence like Hellinger distance, Chi2-distance... • He shows that by choosing to optimize one or the other of these divergences, you can derive a broad range of message passing algorithms including Variational message passing, Loopy BP, EP, Tree- Reweighted BP, Fractional BP, power EP.
  • 108. Summary Variational Inference 108 10.1. Variational Inference 10.2. Variational Mixture of Gaussians 10.3. Variational Linear Regression 10.4. Exponential Family Distributions 10.5. Local Variational Methods 10.6. Variational Logistic Regression Part I: Probabilistic modeling and 
 the variational principle Part II: Design the 
 variational algorithms 10.7. Expectation Propagation