Approximate Inference (Chapter 10, PRML Reading)

VC.M. Bishop’s PRML
Chapter 10: Approximate Inference
Tran Quoc Hoan
@k09hthaduonght.wordpress.com/
13 December 2015, PRML Reading, Hasegawa lab., Tokyo
The University of Tokyo

Excuse me…
Variational Inference 2
Section
Concentrate
ability
Speaker
Audiences
Should we take a break?
(or next time)

Outline
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and  
the variational principle
Part II:
Design the  
variational algorithms
10.7. Expectation
Propagation

Progress…
10.1. Variational
Inference
10.2. Variational
Distributions
Methods
Regression
10.7. Expectation
Propagation
Regression

Probabilistic Inference
10.1 Variational Inference 5
Any mechanism by which we deduce the probabilities
in our model based data.
Statistical Inference
In probabilistic models, we need reason over the probability of events
Inference links the observed data with our statistical assumptions and allows us
to ask questions of our data: predictions, visualization, model selection.

Modeling and Inference
Posterior
Bayes’ rule in many of inferential problems
Probabilistic modeling will involve:
• Decide on a priori beliefs.
• Posit an explanation of how the observed
data is generated, i.e. provide a
probabilistic description.
=
Likelihood Prior
Marginal likelihood
(Model evidence)
p(z|x)
p(x|z) p(z)
Z
p(x, z)dz
Observed data
Hidden
variable

Modeling and Inference
Most inference problems will be one of:
Marginalization
Expectation
Prediction
Posterior
=
Likelihood Prior
Marginal likelihood
(Model evidence)
Z
p(x, z)dz
p(z|x)
p(x|z) p(z)
Complex form for which
the expectations are not
tractable

Importance Sampling
IntegralBasic idea:
Transform the integral
into an expectation over a
simple, known
distribution
p(z) f(z)
z
q(z)
Conditions:
• q(z) > 0 when f(z)p(z) ≠ 0
• Easy to sample from q(z)
E[f] =
Z
f(z)p(z)dz
E[f] =
Z
f(z)p(z)
q(z)
q(z)
dz
Notice: x is abbreviated in formula
E[f] =
Z
f(z)
p(z)
q(z)
q(z)dz
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
E[f] =
1
S
X
s
w(s)
f(z(s)
)
Proposal
Importance
weight
Monte Carlo

Importance Sampling
p(z) f(z)
z
q(z)
Properties:
• Unbiased estimate of the expectation
• No independent samples from posterior
distribution
• Many draws from proposal needed,
especially in high dimensions
E[f] =
1
S
X
s
w(s)
f(z(s)
)
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
Stochastic
Approximation
Chapter 11

Importance Sampling
p(z) f(z)
z
q(z)
Take inspiration from importance sampling, but instead:
• Obtain a deterministic algorithm
• Scaled up to high-dimensional and large data problems
• Easy convergence assessment
E[f] =
1
S
X
s
w(s)
f(z(s)
)
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
Variational
Inference

What is a Variational Method?
Variational Principle
General family of methods for approximating complicated
densities by a simpler class of densities
Approximation class
True posterior
Deterministic approximation procedures
with bounds on probabilities of interest
Fit the variational parameters

Variational Calculus
Functions:
• Variables as input, output is a value
• Full and partial derivatives df/dx
• Ex. Maximize likelihood p(x|µ) w.r.t.
parameters µ
Both types of derivatives are
exploited in variational inference
Functionals:
• Functions as input, output is a value
• Functional derivatives ∂f/∂x
• Ex. Maximize the entropy H[p(x)] w.r.t.
p(x)
Variational method derives from the
Calculus of Variations

From IS to Variational Inference
Integral
Importance weight
Jensen’s inequality
ln p(X) = ln
Z
p(X|Z)p(Z)dZ
ln p(X) = ln
Z
p(X|Z)
p(Z)
q(Z)
q(Z)dZ
ln
Z
p(x)g(x)dx
Z
p(x) ln g(x)dx
ln p(X)
Z
q(Z) ln (p(X|Z)
p(Z)
q(Z)
)dZ
Variational
(evidence) lower
bound
=
Z
q(Z) ln p(X|Z)dZ
Z
q(Z) ln
q(Z)
p(Z)
dZ
Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]=

Variational Lower Bound
F(X, q) =
Reconstruction PenaltyApprox. Posterior
• Penalty: Ensures the explanation of the data q(Z) doesn’t deviate too far from
your beliefs p(Z).
• Reconstruction cost: The expected log-likelihood measure how well samples
from q(Z) are able to explain the data X.
• Approximate posterior distribution q(Z): Best match to true posterior p(Z|X),
one of the unknown inferential quantities of interest to us.
Interpreting the bound: 
Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]

How low the VLB (ELB)?
• Variational parameters: Parameters of q(Z) (Ex. if a Gaussian, they’re mean
and variance).
• Integration switched to optimization: optimize q(Z) directly (my thought:
actually it is q(Z|X) ) to minimize
Some comments on q: 
ln p(X) F(X, q) =
Z
q(Z) ln p(X)dZ F(X, q)
=
Z
q(Z) ln p(X)dZ
Z
q(Z) ln p(X|Z)dZ +
Z
q(Z) ln
q(Z)
p(Z)
dZ
=
Z
q(Z) ln
p(X)q(Z)
p(X|Z)p(Z)
dZ =
Z
q(Z) ln
q(Z)
p(Z|X)
dZ
= KL[q(Z)||p(Z|X)]
KL[q(Z)||p(Z|X)]

From the book
KL(q k p) =
Z
q(Z) ln
⇢
p(Z|X)
q(Z)
dZ
ln p(X) = L(q) + KL(q k p)
Maximum occurs when  
q(Z) = p(Z|X)
Approximate the maximum by variational method
(10.2)
(10.3)
(10.4)
• What exactly is q(z)?
• How do we ﬁnd the
variational parameters?
• How do we compute
the gradients?
• How do we optimize
the model parameters?
L(q) = F(X, q) =
Z
q(Z) ln
⇢
p(X, Z)
q(Z)
dZ

Free-form and Fixed-form
Free-form variational method solves for the exact distribution
setting the functional derivative to zero
Fixed-form variational method speciﬁes an explicit form of the q-
distribution
The optimal solution is the
true posterior distribution
but solving for the
normalization is original
Ideally rich class of
distribution
q (Z) = f(Z; )
L(q)
q(Z)
= 0 s.t.
Z
q(Z)dZ = 1
Variational parameter
q(Z) / p(Z)p(X|Z, ✓)

10.1.1 Factorized distributions (I)
• Mean-ﬁeld methods assume that the distribution is factorized
Restricted class of approximations: every dimension (or subset of
dimensions) of the posterior is independent
• Let Z be partitioned into disjoint groups Zi (i = 1…M)
q(Z) =
MY
i=1
qi(Zi) No restriction on the
functional form of qi(Zi)

Factorized distributions (II)
L(q) =
Z
q(Z) ln
⇢
p(X, Z)
q(Z)
dZ
=
Z
q(Z) {ln p(X, Z) ln q(Z)} dZ
=
Z Y
i
qi(Zi)
! (
ln p(X, Z)
X
i
ln qi(Zi)
)
dZ
=
Z Y
i
qi
!
[ln p(X, Z)] dZ
Z Y
i
qi
!
X
i
ln qi
!
dZ
qi

Factorized distributions (III)
L(q) =
Z Y
i
qi
!
[ln p(X, Z)] dZ
Z Y
i
qi
!
X
i
ln qi
!
dZ
=
Z
qj
8
<
:
Z
ln p(X, Z)
Y
i6=j
qidZi
9
=
;
dZj
Z Y
i
qi
! 8
<
:
ln qj +
X
i6=j
ln qi
9
=
;
dZ
=
Z
qj
8
<
:
Z
ln p(X, Z)
Y
i6=j
qidZi
9
=
;
dZj
Z Y
i
qi
!
ln qjdZ
Z
0
@
Y
i6=j
qi
1
A
0
@
X
i6=j
ln qi
1
A
✓Z
qjdZj
◆
dZi6=j
=
Z
qj (ln ˜p(X, Zj) const) dZj
Z
qj ln qjdZj
Z
0
@
Y
i6=j
qi
1
A
0
@
X
i6=j
ln qi
1
A dZi6=j
• Consider with function qj(Zj)
= 1

Factorized distributions (IV)
L(q) =
Z
qj ln ˜p(X, Zj)dZj
Z
qj ln qjdZj + const
negative KL divergence
where
˜p(X, Zj) = Ei6=j[ln p(X, Z)] + const
Ei6=j[ln p(X, Z)] =
Z
ln p(X, Z)
Y
i6=j
qidZi
and
• Maximize by keeping ﬁxed
• This is same as minimizing KL divergence between
and
L(q) {qi6=j}
˜p(X, Zj)
qj(Zj)
(10.6)
(10.7)
(10.8)

Optimal Solution
(1) Initialize all qj appropriately.
ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const
q⇤
j (Zj) =
exp(Ei6=j[ln p(X, Z)])
R
exp(Ei6=j[ln p(X, Z)])dZj
or
(2) Run below code until convergence.
• foreach qi
• Fixed all qj ≠qi and ﬁnd optimal qi
• Update qi
(10.9)Today’s  
memo
Next slides are
detailed examples

10.1.2. Properties of factorized approximations
Approximate Gaussian Distribution with factorized Gaussian
Consider,
p(z) = N(z|µ, ⇤ 1
)
µ = (µ1, µ2)T
, ⇤ =
✓
⇤11 ⇤12
⇤21 ⇤22
◆
q(z) = q1(z)q2(z)
ln q⇤
1(z1) = Ez2
[ln p(z)] + const
ln q⇤
1(z1) = Ez2 [
1
2
(z1 µ1)2
⇤11 (z1 µ1)⇤12(z2 µ2)] + const
where z = (z1, z2),
Approximate using
Optimal solution from (10.9)
consider only the terms have z1

ln q⇤
1(z1) = Ez2 [
1
2
(z1 µ1)2
⇤11 (z1 µ1)⇤12(z2 µ2)] + const
=
Z
q2(z2)
⇢
1
2
(z1 µ1)2
⇤11 (z1 µ1)⇤12(z2 µ2) dz2 + const
=
1
2
(z1 µ1)2
⇤11 z1⇤12(Ez2 [z2] µ2) + const
q⇤
1(z1) = N(z1|m1, ⇤ 1
11 )
m1 = µ1 ⇤ 1
11 ⇤12(Ez2
[z2] µ2)
quadratic form of z1
Then, we have
(10.11)

q⇤
1(z1) = N(z1|m1, ⇤ 1
11 ) m1 = µ1 ⇤ 1
11 ⇤12(Ez2
[z2] µ2)
m2 = µ2 ⇤ 1
22 ⇤21(Ez1
[z1] µ1)q⇤
2(z2) = N(z2|m2, ⇤ 1
22 )
q(z) = q1(z)q2(z)Optimal solution of (10.12)-(10.15)
Mutual dependency:
• depends on (calculated by )q⇤
1(z1) q⇤
2(z2)
Ez1
[z1]
Ez2
[z2]
• depends on (calculated by )q⇤
1(z1)q⇤
2(z2)
Update alternately until convergence

Fig 10.2 The green contours corresponding to 1, 2
and 3 standard deviations for a correlated
Gaussian distribution p(z) over two variables z1
and z2 . The red contours represent the
corresponding levels for an approximating
distribution q(z) over the same variables given by
the product of two independent univariate
Gaussian
Minimize KL(q||p) Minimize KL(p||q)
• The mean is captured correctly, but the variance is underestimated in the
orthogonal direction
• Optimal solution
(that is the corresponding marginal distribution of p(Z) )
• Considering reverse KL divergence KL(p||q) =
Z
p(Z)[
MX
i=1
ln qi(Zi)]dZ + const
(10.17)

Minimize KL(q||p)
Minimize KL(p||q)
Reverse KL divergence
KL(p||q) =
Z
p(Z)[
MX
i=1
ln qi(Zi)]dZ + const
KL(q||p) =
Z
q(Z) ln
⇢
p(Z)
q(Z)
• If near zero then tends to close to zerop(Z) q(Z)
• If near zero then is not
important
p(Z) q(Z)
• KL divergence is minimized by distributions that
are nonzero in regions when is nonzerop(Z)
q(Z)

More about divergence
Fig 10.3 Blue contour =
bimodal distribution
p(Z). Red contour=
single Gaussian
distribution q(Z) that
best approximates p(Z)
Minimize KL(q||p)
(a)
Minimize KL(p||q) Minimize KL(q||p)
(b) (c)
• KL(p||q) and KL(q||p) belong to the alpha family of divergences
where
• If will underestimate
• If will overestimate
• If it related to Hellinger distance
p(x)
p(x)
D↵(p||q) ! KL(q||p)
D↵(p||q) ! KL(p||q)

10.1.3 The univariate Gaussian (I)
• Goal: to inferrer posterior distribution for mean and precision given
data set
µ ⌧
D = {x1, ..., xN }
• Likelihood function
• Prior
• Approximate
and
(10.21)
(10.22)
(10.23)
(10.24)

10.1.3 The univariate Gaussian (II)
ln q⇤
µ(µ) = E⌧ [ln p(D, µ, ⌧)] + const
= E⌧ [ln (p(D|µ, ⌧)p(µ|⌧)p(⌧))] + const
= E⌧ [ln p(D|µ, ⌧) + ln p(µ|⌧) + ln p(⌧)] + const
= E⌧ [ln p(D|µ, ⌧) + ln p(µ|⌧)] + const
= E⌧
"
⌧
2
NX
n=1
(xn µ)2 0⌧
2
(µ µ0)2
#
+ const
=
E⌧ [⌧]
2
" NX
n=1
(xn µ)2
+ 0(µ µ0)2
#
+ const (10.25)
quadratic form of µ
• Optimal solution (from formula 10.9)

10.1.3 The univariate Gaussian (II)
• Optimal solution for mean
q⇤
µ(µ) = N(µ|µN , 1
N )
8
<
:
µn =
0µ0 + N ¯x
0 + N
N = ( 0 + N)E⌧ [⌧]
where
(10.26)
(10.27)
• Similar with optimal solution of q⌧ (⌧)
q⇤
⌧ (⌧) = Eµ[ln p(D|µ, ⌧) + ln p(µ|⌧) + ln p(⌧)] + const
=
⌧
2
Eµ
" NX
n=1
(xn µ)2
+ 0(µ µ0)2
#
+
N
2
ln ⌧ +
1
2
ln ⌧ + (a0 1) ln ⌧ b0⌧ + const
=
✓
a0 +
N + 1
2
1
◆
ln ⌧
✓
b0 +
1
2
Eµ[...]
◆
⌧ + const (10.28)

10.1.3 The univariate Gaussian (III)
• Optimal solution of q⌧ (⌧)
q⇤
⌧ (⌧) / ⌧
0
@a0+
N + 1
2
1
1
A
exp
✓ ✓
b0 +
1
2
Eµ[...]
◆
⌧
◆
or
q⇤
⌧ (⌧) = Gam(⌧|aN , bN )
where
8
><
>:
aN = a0 +
N + 1
2
bN = b0 +
1
2
Eµ
hPN
n=1(xn µ)2
+ 0(µ µ0)2
i
(10.29)
(10.30)
• Using (10.26)(10.27) and (10.29)(10.30) alternately to compute posterior
by approximate variational inference
p(µ, ⌧|D)

10.1.3 The univariate Gaussian (IV)
p(µ, ⌧|D)
qµ(µ)q⌧ (⌧)
qµ(µ)
Re-estimating
Re-estimating
q⌧ (⌧)
Convergence of
factorized
approximation

10.1.4 Model Comparison
• Prior probabilities on the models be
• Goal: determine where is the observed data
• Approximate
• Maximizing by we get
• Maximizing by we find solutions for different m are coupled
due to the conditioning
• Optimize each individually and subsequently find
p(m)
p(m|X) X
q(Z, m) = q(Z|m)q(m)
ln p(X) = L
X
m
X
Z
q(Z|m)q(m) ln
⇢
p(Z, m|X)
q(Z|m)q(m)
where lower bound L =
X
m
X
Z
q(Z|m)q(m) ln
⇢
p(Z, X, m)
q(Z|m)q(m)
(10.34)
(10.35)
L q(m)
q(Z|m)
q(Z|m) q(m)
q(m) / p(m)exp(Lm) with
Lm
Lm =
X
Z
q(Z|m) ln
⇢
p(Z, X|m)
q(Z|m)

Progress…
10.1. Variational
Inference
10.2. Variational
Distributions
Methods
Regression
10.7. Expectation
Propagation
Regression

10.2 Variational Mixture of Gaussians
10.2 Variational Mixture of Gaussians 36
• Goal: apply variational inference for the Gaussian mixture model
• Problem formation:
• Latent variable zn = {znk} (1-of-K binary vector) corresponding with each
observation xn
Observed data
Hidden variables
X = {x1, ..., xN }
Z = {z1, ..., zN }
• Conditional distribution of Z, given the mixing coeﬃcients π
• Conditional distribution of the observed data vectors, given the latent
variables and the component parameters
p(X|Z, µ, ⇤) =
NY
n=1
KY
k=1
N(xn|µk, ⇤ 1
k )znk
p(Z|⇡) =
NY
n=1
KY
k=1
⇡znk
k (10.37)
(10.38)

10.2 Variational Mixture of Gaussians
• Conjugate prior distributions
• Dirichlet distribution over the mixing coeﬃcients π
p(⇡) = Dir(⇡|↵0) = C(↵0)
KY
k=1
⇡↵0 1
k
• Independent Gaussian-Whishart prior governing the mean and
precision of each Gaussian component
m0 = 0choose by symmetry
Fig 10.5 Directed acyclic graph representing the Bayesian
mixture of Gaussians model
(10.39)
(10.40)

10.2.1. Variational Distribution (I)
• Joint distribution
• Approximate
p(X, Z, ⇡, µ, ⇤) = p(X|Z, µ, ⇤)p(Z|⇡)p(⇡)p(µ|⇤)p(⇤)
q(Z, ⇡, µ, ⇤) = q(Z)q(⇡, µ, ⇤)
• Optimal solution (from formula 10.9)
ln q⇤
(Z) = E⇡,µ,⇤[ln p(X, Z, ⇡, µ, ⇤)] + const
= E⇡[ln p(Z|⇡)] + Eµ,⇤[ln p(X|Z, µ, ⇤)] + const
= E⇡
"
ln
NY
n=1
KY
k=1
⇡znk
k
!#
+ Eµ,⇤
"
ln
NY
n=1
KY
k=1
N(xn|µk, ⇤ 1
k )znk
!#
+ const
=
NX
n=1
KX
k=1
znkE⇡[ln ⇡k]
+
NX
n=1
KX
k=1
znkEµ,⇤

1
2
ln |⇤k|
D
2
ln(2⇡)
1
2
(xn µk)T
⇤k(xn µk) + const
(10.42)
(10.43)
(10.44)
D is the dimensionality of the data variable x

10.2.1 Variational Distribution (II)
• Optimal solution for q(Z)
ln q⇤
(Z) =
NX
n=1
KX
k=1
znk ln ⇢nk + const
where
then,
normalized,
where
also seen as
responsibilities
as in case of EM
(10.45)
(10.46)
(10.47)
(10.48)

10.2.1 Variational Distribution (III)
• Optimal solution for q(⇡, µ, ⇤)
deﬁne: (10.51)
(10.52)
(10.53)
optimal solution:
ln q⇤
(⇡, µ, ⇤) = EZ[ln p(X, Z, ⇡, µ, ⇤)] + const
= EZ[ln (p(X|Z, µ, ⇤)p(Z|⇡)p(⇡)p(µ|⇤)p(⇤))] + const
= EZ[ln p(X|Z, µ, ⇤)] + EZ[ln p(Z|⇡)] + ln p(⇡) + ln p(µ, ⇤) + const
=
NX
n=1
KX
k=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + EZ[ln p(Z|⇡)]
+ ln p(⇡) +
KX
k=1
ln p(µk, ⇤k) + const
(10.54)

10.2.1 Variational Distribution (IV)
=
NX
n=1
KX
k=1
k ) + EZ[ln p(Z|⇡)]
+ ln p(⇡) +
KX
k=1
ln p(µk, ⇤k) + const
ln q⇤
(⇡, µ, ⇤) ⇡
µ, ⇤
something of
+ something of
Then it could be further factorization
q(⇡, µ, ⇤) = q(⇡)
KY
k=1
q(µk, ⇤k)
q⇤
(⇡, µ, ⇤) = q⇤
(⇡)
KY
k=1
q⇤
(µk, ⇤k)
(10.54)
(10.54)
From (10.54) and (10.55) we have
ln q⇤
(⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const
ln q⇤
(µk, ⇤k) = ln p(µk, ⇤k) +
NX
n=1
k ) + const

10.2.1 Variational Distribution (VI)
ln q⇤
(⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const
= EZ
"
ln
NY
n=1
KY
k=1
⇡znk
k
!#
+ ln C(↵0)
KY
k=1
⇡↵0 1
k
!
+ const
=
NX
n=1
KX
k=1
EZ[znk] ln ⇡k + (↵0 1)
KX
k=1
ln ⇡k + const
=
KX
k=1
(Nk + ↵0 1) ln ⇡k + const
= ln
KY
k=1
⇡Nk+↵0 1
k
!
+ const
q⇤
(⇡) = Dir(⇡|↵)
↵k = Nk + ↵0
is recognized as Dirichlet distributionq⇤
(⇡)
(10.56)
(10.57)
(10.58)

10.2.1 Variational Distribution (VII)
ln q⇤
(µk, ⇤k) = ln p(µk, ⇤k) +
NX
n=1
k ) + const
We have Gaussian-Wishart distribution (exercise 10.13)
(10.59)
(10.60) - (10.63)
Analogous to the
M-step of the EM
algorithm

10.2.1 Variational Distribution (VIII)
• Optimize the variational posterior Gaussian mixture distribution
(1) Initialize the responsibilities rnk
(2) Update by (10.51)-(10.53)Nk, ¯xk, Sk
(3) [M step]
• Use (10.57) to find
• Use (10.59) to find
(4) [E step]
• Use (10.64)-(10.66) and (10.46) - (10.49) to update
responsibilities to find
(5) Back to (2) until convergence
q⇤
(⇡)
Use the current
distribution of
parameters to evaluate
responsibilities
Fix responsibilities and
use it to recompute the
variational distribution
over parameters
q⇤
(µk, ⇤k) (k = 1, ..., K)

10.2.1 Variational Distribution (IX)
Figure 10.6 Variational Bayesian mixture
of K = 6 Gaussians applied to the Old
Faithful data set. The ellipses denote the
one standard-deviation density contours
for each of the components, and the
density of red ink inside each ellipse
corresponds to the mean value of the
mixing coeﬃcient for each component.
iteration iterations
iterations iterations
The coeﬃcients of
meaningless distribution tend
to close to zero (disappear)

Compare EM with Variational Bayes
• The same calculate complexity
• As number of data point N → ∞, Bayesian treatment converges to
Maximum likelihood EM algorithm
• The advantage of Variational Bayes
A. Singularities that arise in ML are absent in Bayesian treatment,
removed by the introduction of prior
B. No over-ﬁtting: could be used for determining the number of
components

10.2.2 Variational lower bound
• At each step of the iterative re-estimation procedure the value of this
bound should not decrease
• Useful to test convergence
• To check on the correctness of both mathematical expression and
implementation
• For the variational mixture of Gaussians, the lower bound is given by

10.2.3 Predictive density
• Predictive density , for a new value with corresponding
latent variable
• Depends on the posterior distribution of parameters
• As the posterior distribution is intractable the
variational approximation can be used to obtain an
approximate predictive density
P(ˆx|X) ˆx
ˆz
(10.78)
q(⇡)q(µ, ⇤)
variational approximation

10.2.4 Number of components
• For a given mixture model of K components, each parameter setting is a
member of a family of K! equivalent setting
Figure 10.7
Plot of the variational lower bound L versus the number K of
components in the Gaussian mixture model, for the Old Faithful
data, showing a distinct peak at K = 2 components. For each value of
K, the model is trained from 100 different random starts, and the
results shown as ‘+’ symbols plotted with small random hori- zontal
perturbations so that they can be distinguished. Note that some
solutions find suboptimal local maxima, but that this hap- pens
infrequently.
• Starting with relative large value of K and components with insufficient
contribution are pruned out: the mixing coefficient is driven to zero

10.2.5 Induced factorizations
Induced factorization arises from an interaction between the factorization
assumption in variational posterior and the conditional independence
properties of the true posterior
• For ex: Let A, B, C be disjoint groups of latent variables
• Factorization assumption
• The optimal solution ln q⇤
(A, B) = EC[ln p(A, B|X, C)] + const
q(A, B, C) = q(A, B)q(C)
• We need to determine if .
This is possible iﬀ
• This can also determined from the directed-graph
model

Progress…
10.1. Variational
Inference
10.2. Variational
Distributions
Methods
Regression
10.7. Expectation
Propagation
Regression

10.3 Variational Linear Regression
10.3 Variational Linear Regression 52
• Return to the Bayesian linear regression model (section 3.3)
• Approximated the integration over by making point
estimates obtained by maximizing the log marginal likelihood
↵,
omitted input x
predict value training target value
ˆ↵, ˆ = argmax↵,
ˆ↵, ˆ = argmax↵,
ﬂat
MLE
then

10.3 Variational Linear Regression
• The joint distribution of all the variables
• Prior
(10.90)
(10.89)
(10.87)
(10.88)
• Fully Bayesian approach would integrate over the hyper-parameters as well
as over parameters (this section)
• Suppose that the noise precision parameter is known.
n = (xn)where
p(↵) = Gam(↵|a0, b0) / ↵a0 1
exp( b0↵)
=
1
(2⇡)M/2
1
|⌃|1/2
exp
⇢
1
2
(wT
⌃ 1
w)
⌃ = ↵ 1
Iwhere

10.3.1 Variational Distribution
• Goal: ﬁnd an approximation to the posterior distribution p(w, ↵|t)
• Factorized approximation
q(w, ↵) = q(w)q(↵) (10.91)
• Optimal solution (from 10.9) ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const
ln q⇤
(↵) = Ew[ln (p(t|w)p(w|↵)p(↵))] + const
= ln p(↵) + Ew[ln p(w|↵)] + const
= (a0 1) ln ↵ b0↵ +
M
2
ln ↵
↵
2
Ew[wT
w] + const
(10.92)
then
q⇤
(↵) = Gam(↵|aN , bN ) where(10.93)
(10.94)
(10.95)
M is number of ﬁtting
parameters wi or input
dimension

• Similar optimal solution of q(w)
quadratic form of w
q⇤
(w) = N(w|mN , SN )
then
(10.99) where
(10.96)
(10.97)
(10.98)

• Optimal solution
q⇤
(↵) = Gam(↵|aN , bN ) where
(10.94)
(10.95)
(10.93)
q⇤
(w) = N(w|mN , SN )(10.99) where
(10.100)
(10.101)
Moment
(10.102)
(10.103)

More about variational linear regression
• Predictive distribution over t, given a new input x
• Lower bound
(10.105)
(10.107)

More about variational linear regression
Lower bound
Order M for a polynomial model
Figure 10.9 The lower bound
versus the order M of the
polynomial model, in which
a set of 10 data points is
generated from a polynomial
with M=3 sampled over (-5, 5)
with additive Gaussian noise
of var=0.09. The value of the
bounds gives the log
probability of the model
Peak at M = 3,
corresponding to
the true model

Progress…
10.1. Variational
Inference
10.2. Variational
Regression
Distributions
Methods
Regression
Part I:
Now:
Design the  
10.7. Expectation
Propagation

Progress…
10.1. Variational
Inference
10.2. Variational
Distributions
Methods
Regression
10.7. Expectation
Propagation
Regression

10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 61
• For many of the models in this book, the complete data likelihood is drawn
from the exponential family
• In general, this will not be the case for the marginal likelihood function for
the observed data. Ex: in a mixture of Gaussians.

• Observed data
• Latent variables
X = {x1, ..., xN }
Z = {z1, ..., zN }
• Suppose that the joint distribution is a member of the exponential family
where the conjugate prior for ⌘
(10.113)
p(⌘|⌫0, 0) = f(⌫0, 0)g(⌘)⌫0
exp{⌫0⌘T
0}
(prior number of observations all having the value for the u vector)⌫0 0
(10.114)
• Variational distributions
q(Z, ⌘) = q(Z)q(⌘)

ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const• Optimal solution (from 10.9)
= E⌘[ln p(X, Z|⌘)] + const
=
NX
n=1
ln h(xn, zn) + E⌘[⌘T
]u(xn, zn) + const
sum of independent things
Induced factorization
q⇤
(Z) =
Y
n
q⇤
(zn)
(10.115)
where
(10.116)
ln q⇤
(Z) = E⌘[ln p(X, Z, ⌘)] = E⌘[ln p(X, Z|⌘)p(⌘|⌫0, 0)p(⌫0, 0)]
q⇤
(zn) = h(xn, zn)g(E⌘[⌘]) exp{E⌘[⌘T
]u(xn, zn)}

ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const• Optimal solution (from 10.9)
ln q⇤
(⌘) = EZ[ln p(X, Z, ⌘)] = EZ[ln p(X, Z|⌘)p(⌘|⌫0, 0)p(⌫0, 0)]
= ln p(⌘|⌫0, 0) + EZ[ln p(X, Z|⌘)] + const
= ⌫0 ln g(⌘) + ⌫0⌘T
0 +
NX
n=1
ln g(⌘) + ⌘T
Ezn [u(xn, zn)] + const
(10.118)
q⇤
(⌘) = f(⌫N , N )g(⌘)⌫N
exp{⌫N ⌘T
N }
then
where
(10.119)
(10.120)
(10.121)

Variational message passing
• The joint distribution corresponding to a directed graph
then the optimal solution is
thus the update of the factors in the variational posterior
distribution represents a local calculation on the graph
p(x) =
Y
i
p(xi|pai)
parent set corresponding to node ivariable(s) associated with node i (latent or observed)
• Variational approximation q(x) =
Y
i
qi(xi)
Markov blanket

Variational message passing
• If all the conditional distributions have a conjugate-exponential
structure, then the variational update will be:
• The distribution associated with a particular node can be updated once that
node has received messages from all of its parents and all of its children.
• It requires that the children have already received messages from their co-
parents

Progress…
10.1. Variational
Inference
10.2. Variational
Distributions
Methods
Regression
10.7. Expectation
Propagation
Regression

10.5 Local Variational Methods
10.5 Local Variational Methods 68
• Global methods: approximation to the full posterior
• Local methods: approximation to individual or groups of variables
• Replace the likelihood with a simpler form - lower bound
that makes the expectation easy to compute

Convex duality
Convex function f(x)
One of lower bound
but not the best
The line is moved
vertically to a tangent
Convex function f(x) Concave function f(x)
g(⌘) = max
x
{⌘x f(x)}
f(x) = max
⌘
{⌘x g(⌘)} f(x) = min
⌘
{⌘x g(⌘)}
g(⌘) = min
x
{⌘x f(x)}(10.130)
(10.131)
(10.133)
(10.132)

Original Problem p(y = 1|x) =
1
1 + exp( x)
= (x)
is concave function, then considerf(x) = ln (x)
g(⌘) = min
x
{⌘x f(x)} = ⌘ ln ⌘ (1 ⌘) ln(1 ⌘)
ln (x)  ⌘x g(⌘)
The upper bound
(x)  exp(⌘x g(⌘))
(10.135)
(10.136)
(10.137)
The upper bound
Logistic  
sigmoid

is convex function of the variable x2 , then consider
The stationarity conditions
(10.139)
The lower bound
Logistic  
sigmoid
• Gaussian lower bound (Jakkola and Jordan, 2000)
f(x) = ln(ex/2
+ e x/2
)
g(⌘) = max
x2
n
⌘x2
f(
p
x2)
o
0 = ⌘
d
dx2
d
dx
f(x) = ⌘ +
1
4x
tanh(
x
2
)
⌘ =
1
4⇠
tanh
✓
⇠
2
◆
=
1
2⇠

(⇠)
1
2
= (⇠)
g( (⇠)) = (⇠)⇠2
f(⇠) = (⇠)⇠2
+ ln(e⇠/2
+ e ⇠/2
)
f(x) (⇠)x2
g( (⇠)) = (⇠)x2
+ (⇠)⇠2
ln(e⇠/2
+ e ⇠/2
)
The bound on f(x)
The bound on sigmoid (x) (⇠) exp{(x ⇠)/2 (⇠)(x2
⇠2
)} (10.144)

How the bounds can be used
• Evaluate I =
Z
(a)p(a)da
• The local variational bound (a) f(a, ⇠)
• The variational bound
(intractable)
I
Z
f(a, ⇠)p(a)da = F(⇠)
⇠ is additional parameter (depends on a)where
Finding the compromise to maximize F(⇠)⇠⇤

In Reviews…
Original Problem
Local Bound
Bound with only
linear or quadratic
terms: expectations,
especially against a
Gaussian, are easy to
compute.
p(y = 1|x) =
1
1 + exp( x)
= (x)
(x) (⇠) exp{(x ⇠)/2 (⇠)(x2
⇠2
)}
⇠ is additional parameterwhere

Progress…
10.1. Variational
Inference
10.2. Variational
Distributions
Methods
Regression
10.7. Expectation
Propagation
Regression

10.6 Variational Logistic Regression
10.6 Variational Logistic Regression 75
Return to the Bayesian logistic regression model (section 4.5)
• The posterior distribution
where the prior distribution
p(w) = N(w|w0, S0)
and the likelihood function
p(t|w) =
NY
n=1
yn
tn
{1 yn}1 tn
• Then
where yn = (wT
n)
Maximize the posterior to give wMAP and then
• The Gaussian approximation
p(w) = N(w|wN , SN )
p(w|t) / p(w)p(t|w)

10.6.1 Variational posterior distribution
A practical example of local variational method
• Recap of variational framework: maximize a lower bound on the marginal
likelihood
For the Bayesian logistic regression model, the marginal likelihood is:
The conditional distribution for t
(10.147)
(10.148)
where

• Lower bound on the logistic sigmoid function
where
• We can therefore write
• Bound on the joint distribution of t and w
where
(10.149)
(10.151)
(10.152)
(10.153)
where and each training set observation corresponds with( n, tn) ⇠n

• Lower bound on the log of the joint distribution of t and w
• Hypothesis for the prior p(w): Gaussian with parameters m0 and S0
considered as ﬁxed
Then, the right side of (10.154) becomes as function of w
(10.154)
(10.155)

• Quantity of interest: exact posterior distribution, requires normalization of
the left side in (10.152) but usually intractable
• Work instead with the right side of (10.155): a quadratic function of w which
is a lower bound of p(w, t)
• A Gaussian variational posterior of the form
where
(10.156)
(10.157)
(10.158)

10.6.1 Optimizing the variational parameters
• Determine the variational parameters by maximizing the lower
bound of the marginal likelihood
Two approaches
Substitute (10.152) back into the marginal likelihood
(10.159)
• (1) View w as a latent variable and use the EM algorithm
• (2) Compute and maximize directly using the fact that p(w) is Gaussian
and is a quadratic function of w
Re-estimation equations
(10.164)
(10.163)

10.6.1 Optimizing the variational parameters
• Invoke EM algorithm
1. Initialize values for
2. E step
• Use to calculate the posterior distribution
3. M step
⇠old
• Maximize the complete-data log likelihood
Q(⇠, ⇠old
) = Eq(w) [ln h(w, ⇠)p(w)]
• Solve (stationarity condition)
then (10.163)
(10.160)
(10.162)

Illustration

10.6.3 Inference of hyper parameters
• Allow hyper parameters to be inferred from dataset
• Consider simple Gaussian prior form
• Consider conjugate hyper prior given by a gamma distribution
• The marginal likelihood
where the joint distribution
(10.168)
(10.167)
(10.166)
(10.165)

Combine global and local approach
• (1) Global approach: consider a variational distribution and apply
the standard decomposition
• (2) The lower bound is intractable so apply the local approach as
before to get a lower bound on and on
L(q)
L(q) ln p(t)
• (3) Assume that q is factorized as
(10.169)
(10.172)

• It follows (quadratic function of w)
where
(10.174)
(10.175)
(10.176)
• From (10.153) and (10.165)
(10.153)
(10.165)

where
then
• Similar with from (10.165) and (10.166)
(10.177)
(10.178)
(10.179)
q(↵)
(10.165)
We have
(10.166)p(↵) = Gam(↵|a0, b0) / ↵a0 1
exp( b0↵)
q(↵) = Gam(↵|aN , bN ) =
1
(aN )
abN
N ↵aN 1
e bN ↵

• The variational parameters are obtained by maximizing the lower bound
(10.180)
(10.181)
(10.183)
(10.182)
• Re-estimation equations
where
Q(⇠, ⇠old
) = Eq(w) [ln h(w, ⇠)p(w)] (10.160)
as we done before with

Progress…
10.1. Variational
Inference
10.2. Variational
Distributions
Methods
Regression
10.7. Expectation
Propagation
Regression

Expectation Propagation (Minka, 2001)
10.7 Expectation Propagation 89
• An alternative form of deterministic approximate inference based on the
reverse KL divergence KL(p||q) ( instead of KL(q||p)) where p is the complex
distribution
KL(p||q) =
Z
p(z) ln
p(z)
q(z)
dzKL(q||p) =
Z
q(z) ln
q(z)
p(z)
dz
• Consider ﬁxed distribution p(z) and member of the exponential family q(z).
KL(p||q) = ln g(⌘) ⌘T
Ep(z)[u(z)] + const
⌘
setting gradient to zero
(10.185)
(10.186)
q(z) = h(z)g(⌘) exp{⌘T
u(z)}
• The Kullback-Leibler divergence as function of
(10.184)

Expectation Propagation
• Member of the exponential family q(z).
(10.187)
q(z) = h(z)g(⌘) exp{⌘T
u(z)} (10.184)
then Z
h(z)g(⌘) exp{⌘T
u(z)}dz = 1
taking the gradient of both size
rg(⌘)
Z
h(z) exp{⌘T
u(z)}dz +g(⌘)
Z
h(z) exp{⌘T
u(z)}u(z)dz = 0
1
g(⌘)
rg(⌘) = g(⌘)
Z
h(z) exp{⌘T
u(z)}u(z)dz = Eq(z)[u(z)]
r ln g(⌘) = Eq(z)[u(z)]
From 10.186 we have
Moment matching, setting mean and covariance of q(z) the same as p(z)’s

• Assume the joint distribution of data and hidden variables and
parameters is of the form
D ✓
• Quantities of interest ad posterior distribution
and model evidence
(10.188)
(10.189)
(10.190)

• Expectation propagation is based on an approximation to the posterior
distribution which is also given by a product of factors
where each factor comes from the exponential family
• Ideally, we would like to determine by minimizing the KL divergence
between the true posterior and the approximation
• Minimize the KL divergence between each pair of factors and
independently but the product is usually poor approximation
• EP: optimize each factor in turn using the current values for the remaining
factors (good in logistic type but bad for mixtures type due to multi-modality)

The clutter problem

The clutter problem
• Mixture of Gaussians of the form
where w is the proportion of background cluster and is assumed to be
known. The prior is taken to be Gaussian
• The joint distribution of N observations and is given by
• To apply EP, ﬁrst identify the factors
then, choose the exponential family
• The factor approximation takes the form of exponentials of quadratic
functions
(10.209)
(10.210)
(10.211)
(10.212)
(10.213)

The clutter problem

The clutter problem
• Evaluate the approximation to the model evidence
where
(10.223)
(10.224)

Expectation Propagation on graphs
• The factors are not function of all variables. If the approximating
distribution is fully factorized, EP reduces to Loopy Belief Propagation
• We seek an approximation q(x) that has the same factorization
(10.225)
(10.226)

• Restrict to approximations in which the factors factorize with respect to the
individual variables so that
(10.227)

• Suppose all the factors are initialized and we choose to reﬁne factor
• Minimizing the reverse KL when q factorizes, leads to an optimal solution q
where factors are the marginals of p

Standard belief propagation

Same in the chapter 8

Expectation propagation
• The sum-product BP arises as a special case of EP when a fully factorized
approximating distributions is used
• EP can be seen as a way to generalized this: group factors and update them
together, use partially connected graph
• Q remains: How to choose the best grouping and disconnection?
• Summary: EP and Variational message passing correspond to the
optimization of two diﬀerent KL divergences
• Minka 2005 gives a more general point of view using the family of alpha-divergences that includes both
KL and reverse KL, but also other divergence like Hellinger distance, Chi2-distance...
• He shows that by choosing to optimize one or the other of these divergences, you can derive a broad
range of message passing algorithms including Variational message passing, Loopy BP, EP, Tree-
Reweighted BP, Fractional BP, power EP.

Summary
10.1. Variational
Inference
10.2. Variational
Regression
Distributions
Methods
Regression
Part I:
Part II:
Design the  
10.7. Expectation
Propagation

Approximate Inference (Chapter 10, PRML Reading)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Approximate Inference (Chapter 10, PRML Reading)

Similar to Approximate Inference (Chapter 10, PRML Reading) (20)

More from Ha Phuong

More from Ha Phuong (7)

Recently uploaded

Recently uploaded (20)

Approximate Inference (Chapter 10, PRML Reading)