[Webinar] SpiraTest - Setting New Standards in Quality Assurance
Approximate Inference (Chapter 10, PRML Reading)
1. VC.M. Bishop’s PRML
Chapter 10: Approximate Inference
Tran Quoc Hoan
@k09hthaduonght.wordpress.com/
13 December 2015, PRML Reading, Hasegawa lab., Tokyo
The University of Tokyo
3. Outline
Variational Inference 3
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and
the variational principle
Part II:
Design the
variational algorithms
10.7. Expectation
Propagation
4. Progress…
Variational Inference 4
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
5. Probabilistic Inference
10.1 Variational Inference 5
Any mechanism by which we deduce the probabilities
in our model based data.
Statistical Inference
In probabilistic models, we need reason over the probability of events
Inference links the observed data with our statistical assumptions and allows us
to ask questions of our data: predictions, visualization, model selection.
6. Modeling and Inference
10.1 Variational Inference 6
Posterior
Bayes’ rule in many of inferential problems
Probabilistic modeling will involve:
• Decide on a priori beliefs.
• Posit an explanation of how the observed
data is generated, i.e. provide a
probabilistic description.
=
Likelihood Prior
Marginal likelihood
(Model evidence)
p(z|x)
p(x|z) p(z)
Z
p(x, z)dz
Observed data
Hidden
variable
7. Modeling and Inference
10.1 Variational Inference 7
Most inference problems will be one of:
Marginalization
Expectation
Prediction
Posterior
=
Likelihood Prior
Marginal likelihood
(Model evidence)
Z
p(x, z)dz
p(z|x)
p(x|z) p(z)
Complex form for which
the expectations are not
tractable
8. Importance Sampling
10.1 Variational Inference 8
IntegralBasic idea:
Transform the integral
into an expectation over a
simple, known
distribution
p(z) f(z)
z
q(z)
Conditions:
• q(z) > 0 when f(z)p(z) ≠ 0
• Easy to sample from q(z)
E[f] =
Z
f(z)p(z)dz
E[f] =
Z
f(z)p(z)
q(z)
q(z)
dz
Notice: x is abbreviated in formula
E[f] =
Z
f(z)
p(z)
q(z)
q(z)dz
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
E[f] =
1
S
X
s
w(s)
f(z(s)
)
Proposal
Importance
weight
Monte Carlo
9. Importance Sampling
10.1 Variational Inference 9
p(z) f(z)
z
q(z)
Properties:
• Unbiased estimate of the expectation
• No independent samples from posterior
distribution
• Many draws from proposal needed,
especially in high dimensions
E[f] =
1
S
X
s
w(s)
f(z(s)
)
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
Stochastic
Approximation
Chapter 11
10. Importance Sampling
10.1 Variational Inference 10
p(z) f(z)
z
q(z)
Take inspiration from importance sampling, but instead:
• Obtain a deterministic algorithm
• Scaled up to high-dimensional and large data problems
• Easy convergence assessment
E[f] =
1
S
X
s
w(s)
f(z(s)
)
w(s)
=
p(z(s)
)
q(z(s))
z(s)
⇠ q(z)
Variational
Inference
11. What is a Variational Method?
10.1 Variational Inference 11
Variational Principle
General family of methods for approximating complicated
densities by a simpler class of densities
Approximation class
True posterior
Deterministic approximation procedures
with bounds on probabilities of interest
Fit the variational parameters
12. Variational Calculus
10.1 Variational Inference 12
Functions:
• Variables as input, output is a value
• Full and partial derivatives df/dx
• Ex. Maximize likelihood p(x|µ) w.r.t.
parameters µ
Both types of derivatives are
exploited in variational inference
Functionals:
• Functions as input, output is a value
• Functional derivatives ∂f/∂x
• Ex. Maximize the entropy H[p(x)] w.r.t.
p(x)
Variational method derives from the
Calculus of Variations
13. From IS to Variational Inference
10.1 Variational Inference 13
Integral
Importance weight
Jensen’s inequality
ln p(X) = ln
Z
p(X|Z)p(Z)dZ
ln p(X) = ln
Z
p(X|Z)
p(Z)
q(Z)
q(Z)dZ
ln
Z
p(x)g(x)dx
Z
p(x) ln g(x)dx
ln p(X)
Z
q(Z) ln (p(X|Z)
p(Z)
q(Z)
)dZ
Variational
(evidence) lower
bound
=
Z
q(Z) ln p(X|Z)dZ
Z
q(Z) ln
q(Z)
p(Z)
dZ
Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]=
14. Variational Lower Bound
10.1 Variational Inference 14
F(X, q) =
Reconstruction PenaltyApprox. Posterior
• Penalty: Ensures the explanation of the data q(Z) doesn’t deviate too far from
your beliefs p(Z).
• Reconstruction cost: The expected log-likelihood measure how well samples
from q(Z) are able to explain the data X.
• Approximate posterior distribution q(Z): Best match to true posterior p(Z|X),
one of the unknown inferential quantities of interest to us.
Interpreting the bound:
Eq(Z)[ln p(X|Z)] KL[q(Z)||p(Z)]
15. How low the VLB (ELB)?
10.1 Variational Inference 15
• Variational parameters: Parameters of q(Z) (Ex. if a Gaussian, they’re mean
and variance).
• Integration switched to optimization: optimize q(Z) directly (my thought:
actually it is q(Z|X) ) to minimize
Some comments on q:
ln p(X) F(X, q) =
Z
q(Z) ln p(X)dZ F(X, q)
=
Z
q(Z) ln p(X)dZ
Z
q(Z) ln p(X|Z)dZ +
Z
q(Z) ln
q(Z)
p(Z)
dZ
=
Z
q(Z) ln
p(X)q(Z)
p(X|Z)p(Z)
dZ =
Z
q(Z) ln
q(Z)
p(Z|X)
dZ
= KL[q(Z)||p(Z|X)]
KL[q(Z)||p(Z|X)]
16. From the book
10.1 Variational Inference 16
KL(q k p) =
Z
q(Z) ln
⇢
p(Z|X)
q(Z)
dZ
ln p(X) = L(q) + KL(q k p)
Maximum occurs when
q(Z) = p(Z|X)
Approximate the maximum by variational method
(10.2)
(10.3)
(10.4)
• What exactly is q(z)?
• How do we find the
variational parameters?
• How do we compute
the gradients?
• How do we optimize
the model parameters?
L(q) = F(X, q) =
Z
q(Z) ln
⇢
p(X, Z)
q(Z)
dZ
17. Free-form and Fixed-form
10.1 Variational Inference 17
Free-form variational method solves for the exact distribution
setting the functional derivative to zero
Fixed-form variational method specifies an explicit form of the q-
distribution
The optimal solution is the
true posterior distribution
but solving for the
normalization is original
Ideally rich class of
distribution
q (Z) = f(Z; )
L(q)
q(Z)
= 0 s.t.
Z
q(Z)dZ = 1
Variational parameter
q(Z) / p(Z)p(X|Z, ✓)
18. 10.1.1 Factorized distributions (I)
10.1 Variational Inference 18
• Mean-field methods assume that the distribution is factorized
Restricted class of approximations: every dimension (or subset of
dimensions) of the posterior is independent
• Let Z be partitioned into disjoint groups Zi (i = 1…M)
q(Z) =
MY
i=1
qi(Zi) No restriction on the
functional form of qi(Zi)
19. Factorized distributions (II)
10.1 Variational Inference 19
L(q) =
Z
q(Z) ln
⇢
p(X, Z)
q(Z)
dZ
=
Z
q(Z) {ln p(X, Z) ln q(Z)} dZ
=
Z Y
i
qi(Zi)
! (
ln p(X, Z)
X
i
ln qi(Zi)
)
dZ
=
Z Y
i
qi
!
[ln p(X, Z)] dZ
Z Y
i
qi
!
X
i
ln qi
!
dZ
qi
20. Factorized distributions (III)
10.1 Variational Inference 20
L(q) =
Z Y
i
qi
!
[ln p(X, Z)] dZ
Z Y
i
qi
!
X
i
ln qi
!
dZ
=
Z
qj
8
<
:
Z
ln p(X, Z)
Y
i6=j
qidZi
9
=
;
dZj
Z Y
i
qi
! 8
<
:
ln qj +
X
i6=j
ln qi
9
=
;
dZ
=
Z
qj
8
<
:
Z
ln p(X, Z)
Y
i6=j
qidZi
9
=
;
dZj
Z Y
i
qi
!
ln qjdZ
Z
0
@
Y
i6=j
qi
1
A
0
@
X
i6=j
ln qi
1
A
✓Z
qjdZj
◆
dZi6=j
=
Z
qj (ln ˜p(X, Zj) const) dZj
Z
qj ln qjdZj
Z
0
@
Y
i6=j
qi
1
A
0
@
X
i6=j
ln qi
1
A dZi6=j
• Consider with function qj(Zj)
= 1
21. Factorized distributions (IV)
10.1 Variational Inference 21
L(q) =
Z
qj ln ˜p(X, Zj)dZj
Z
qj ln qjdZj + const
negative KL divergence
where
˜p(X, Zj) = Ei6=j[ln p(X, Z)] + const
Ei6=j[ln p(X, Z)] =
Z
ln p(X, Z)
Y
i6=j
qidZi
and
• Maximize by keeping fixed
• This is same as minimizing KL divergence between
and
L(q) {qi6=j}
˜p(X, Zj)
qj(Zj)
(10.6)
(10.7)
(10.8)
22. Optimal Solution
10.1 Variational Inference 22
(1) Initialize all qj appropriately.
ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const
q⇤
j (Zj) =
exp(Ei6=j[ln p(X, Z)])
R
exp(Ei6=j[ln p(X, Z)])dZj
or
(2) Run below code until convergence.
• foreach qi
• Fixed all qj ≠qi and find optimal qi
• Update qi
(10.9)Today’s
memo
Next slides are
detailed examples
23. 10.1.2. Properties of factorized approximations
10.1 Variational Inference 23
Approximate Gaussian Distribution with factorized Gaussian
Consider,
p(z) = N(z|µ, ⇤ 1
)
µ = (µ1, µ2)T
, ⇤ =
✓
⇤11 ⇤12
⇤21 ⇤22
◆
q(z) = q1(z)q2(z)
ln q⇤
1(z1) = Ez2
[ln p(z)] + const
ln q⇤
1(z1) = Ez2 [
1
2
(z1 µ1)2
⇤11 (z1 µ1)⇤12(z2 µ2)] + const
where z = (z1, z2),
Approximate using
Optimal solution from (10.9)
consider only the terms have z1
26. 10.1 Variational Inference 26
Fig 10.2 The green contours corresponding to 1, 2
and 3 standard deviations for a correlated
Gaussian distribution p(z) over two variables z1
and z2 . The red contours represent the
corresponding levels for an approximating
distribution q(z) over the same variables given by
the product of two independent univariate
Gaussian
Minimize KL(q||p) Minimize KL(p||q)
• The mean is captured correctly, but the variance is underestimated in the
orthogonal direction
• Optimal solution
(that is the corresponding marginal distribution of p(Z) )
• Considering reverse KL divergence KL(p||q) =
Z
p(Z)[
MX
i=1
ln qi(Zi)]dZ + const
10.1.2. Properties of factorized approximations
(10.17)
27. 10.1 Variational Inference 27
Minimize KL(q||p)
Minimize KL(p||q)
Reverse KL divergence
KL(p||q) =
Z
p(Z)[
MX
i=1
ln qi(Zi)]dZ + const
10.1.2. Properties of factorized approximations
KL(q||p) =
Z
q(Z) ln
⇢
p(Z)
q(Z)
• If near zero then tends to close to zerop(Z) q(Z)
• If near zero then is not
important
p(Z) q(Z)
• KL divergence is minimized by distributions that
are nonzero in regions when is nonzerop(Z)
q(Z)
28. More about divergence
10.1 Variational Inference 28
Fig 10.3 Blue contour =
bimodal distribution
p(Z). Red contour=
single Gaussian
distribution q(Z) that
best approximates p(Z)
Minimize KL(q||p)
(a)
Minimize KL(p||q) Minimize KL(q||p)
(b) (c)
• KL(p||q) and KL(q||p) belong to the alpha family of divergences
where
• If will underestimate
• If will overestimate
• If it related to Hellinger distance
p(x)
p(x)
D↵(p||q) ! KL(q||p)
D↵(p||q) ! KL(p||q)
29. 10.1.3 The univariate Gaussian (I)
10.1 Variational Inference 29
• Goal: to inferrer posterior distribution for mean and precision given
data set
µ ⌧
D = {x1, ..., xN }
• Likelihood function
• Prior
• Approximate
and
(10.21)
(10.22)
(10.23)
(10.24)
34. 10.1.4 Model Comparison
10.1 Variational Inference 34
• Prior probabilities on the models be
• Goal: determine where is the observed data
• Approximate
• Maximizing by we get
• Maximizing by we find solutions for different m are coupled
due to the conditioning
• Optimize each individually and subsequently find
p(m)
p(m|X) X
q(Z, m) = q(Z|m)q(m)
ln p(X) = L
X
m
X
Z
q(Z|m)q(m) ln
⇢
p(Z, m|X)
q(Z|m)q(m)
where lower bound L =
X
m
X
Z
q(Z|m)q(m) ln
⇢
p(Z, X, m)
q(Z|m)q(m)
(10.34)
(10.35)
L q(m)
q(Z|m)
q(Z|m) q(m)
q(m) / p(m)exp(Lm) with
Lm
Lm =
X
Z
q(Z|m) ln
⇢
p(Z, X|m)
q(Z|m)
35. Progress…
Variational Inference 35
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
36. 10.2 Variational Mixture of Gaussians
10.2 Variational Mixture of Gaussians 36
• Goal: apply variational inference for the Gaussian mixture model
• Problem formation:
• Latent variable zn = {znk} (1-of-K binary vector) corresponding with each
observation xn
Observed data
Hidden variables
X = {x1, ..., xN }
Z = {z1, ..., zN }
• Conditional distribution of Z, given the mixing coefficients π
• Conditional distribution of the observed data vectors, given the latent
variables and the component parameters
p(X|Z, µ, ⇤) =
NY
n=1
KY
k=1
N(xn|µk, ⇤ 1
k )znk
p(Z|⇡) =
NY
n=1
KY
k=1
⇡znk
k (10.37)
(10.38)
37. 10.2 Variational Mixture of Gaussians
10.2 Variational Mixture of Gaussians 37
• Conjugate prior distributions
• Dirichlet distribution over the mixing coefficients π
p(⇡) = Dir(⇡|↵0) = C(↵0)
KY
k=1
⇡↵0 1
k
• Independent Gaussian-Whishart prior governing the mean and
precision of each Gaussian component
m0 = 0choose by symmetry
Fig 10.5 Directed acyclic graph representing the Bayesian
mixture of Gaussians model
(10.39)
(10.40)
38. 10.2.1. Variational Distribution (I)
10.2 Variational Mixture of Gaussians 38
• Joint distribution
• Approximate
p(X, Z, ⇡, µ, ⇤) = p(X|Z, µ, ⇤)p(Z|⇡)p(⇡)p(µ|⇤)p(⇤)
q(Z, ⇡, µ, ⇤) = q(Z)q(⇡, µ, ⇤)
• Optimal solution (from formula 10.9)
ln q⇤
(Z) = E⇡,µ,⇤[ln p(X, Z, ⇡, µ, ⇤)] + const
= E⇡[ln p(Z|⇡)] + Eµ,⇤[ln p(X|Z, µ, ⇤)] + const
= E⇡
"
ln
NY
n=1
KY
k=1
⇡znk
k
!#
+ Eµ,⇤
"
ln
NY
n=1
KY
k=1
N(xn|µk, ⇤ 1
k )znk
!#
+ const
=
NX
n=1
KX
k=1
znkE⇡[ln ⇡k]
+
NX
n=1
KX
k=1
znkEµ,⇤
1
2
ln |⇤k|
D
2
ln(2⇡)
1
2
(xn µk)T
⇤k(xn µk) + const
(10.42)
(10.43)
(10.44)
D is the dimensionality of the data variable x
39. 10.2.1 Variational Distribution (II)
10.2 Variational Mixture of Gaussians 39
• Optimal solution for q(Z)
ln q⇤
(Z) =
NX
n=1
KX
k=1
znk ln ⇢nk + const
where
then,
normalized,
where
also seen as
responsibilities
as in case of EM
(10.45)
(10.46)
(10.47)
(10.48)
41. 10.2.1 Variational Distribution (IV)
10.2 Variational Mixture of Gaussians 41
=
NX
n=1
KX
k=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + EZ[ln p(Z|⇡)]
+ ln p(⇡) +
KX
k=1
ln p(µk, ⇤k) + const
ln q⇤
(⇡, µ, ⇤) ⇡
µ, ⇤
something of
+ something of
Then it could be further factorization
q(⇡, µ, ⇤) = q(⇡)
KY
k=1
q(µk, ⇤k)
q⇤
(⇡, µ, ⇤) = q⇤
(⇡)
KY
k=1
q⇤
(µk, ⇤k)
(10.54)
(10.54)
From (10.54) and (10.55) we have
ln q⇤
(⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const
ln q⇤
(µk, ⇤k) = ln p(µk, ⇤k) +
NX
n=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + const
42. 10.2.1 Variational Distribution (VI)
10.2 Variational Mixture of Gaussians 42
ln q⇤
(⇡) = EZ[ln p(Z|⇡)] + ln p(⇡) + const
= EZ
"
ln
NY
n=1
KY
k=1
⇡znk
k
!#
+ ln C(↵0)
KY
k=1
⇡↵0 1
k
!
+ const
=
NX
n=1
KX
k=1
EZ[znk] ln ⇡k + (↵0 1)
KX
k=1
ln ⇡k + const
=
KX
k=1
(Nk + ↵0 1) ln ⇡k + const
= ln
KY
k=1
⇡Nk+↵0 1
k
!
+ const
q⇤
(⇡) = Dir(⇡|↵)
↵k = Nk + ↵0
is recognized as Dirichlet distributionq⇤
(⇡)
(10.56)
(10.57)
(10.58)
43. 10.2.1 Variational Distribution (VII)
10.2 Variational Mixture of Gaussians 43
ln q⇤
(µk, ⇤k) = ln p(µk, ⇤k) +
NX
n=1
EZ[znk] ln N(xn|µk, ⇤ 1
k ) + const
We have Gaussian-Wishart distribution (exercise 10.13)
(10.59)
(10.60) - (10.63)
Analogous to the
M-step of the EM
algorithm
44. 10.2.1 Variational Distribution (VIII)
10.2 Variational Mixture of Gaussians 44
• Optimize the variational posterior Gaussian mixture distribution
(1) Initialize the responsibilities rnk
(2) Update by (10.51)-(10.53)Nk, ¯xk, Sk
(3) [M step]
• Use (10.57) to find
• Use (10.59) to find
(4) [E step]
• Use (10.64)-(10.66) and (10.46) - (10.49) to update
responsibilities to find
(5) Back to (2) until convergence
q⇤
(⇡)
Use the current
distribution of
parameters to evaluate
responsibilities
Fix responsibilities and
use it to recompute the
variational distribution
over parameters
q⇤
(µk, ⇤k) (k = 1, ..., K)
45. 10.2.1 Variational Distribution (IX)
10.2 Variational Mixture of Gaussians 45
Figure 10.6 Variational Bayesian mixture
of K = 6 Gaussians applied to the Old
Faithful data set. The ellipses denote the
one standard-deviation density contours
for each of the components, and the
density of red ink inside each ellipse
corresponds to the mean value of the
mixing coefficient for each component.
iteration iterations
iterations iterations
The coefficients of
meaningless distribution tend
to close to zero (disappear)
46. Compare EM with Variational Bayes
10.2 Variational Mixture of Gaussians 46
• The same calculate complexity
• As number of data point N → ∞, Bayesian treatment converges to
Maximum likelihood EM algorithm
• The advantage of Variational Bayes
A. Singularities that arise in ML are absent in Bayesian treatment,
removed by the introduction of prior
B. No over-fitting: could be used for determining the number of
components
47. 10.2.2 Variational lower bound
10.2 Variational Mixture of Gaussians 47
• At each step of the iterative re-estimation procedure the value of this
bound should not decrease
• Useful to test convergence
• To check on the correctness of both mathematical expression and
implementation
• For the variational mixture of Gaussians, the lower bound is given by
48. 10.2.3 Predictive density
10.2 Variational Mixture of Gaussians 48
• Predictive density , for a new value with corresponding
latent variable
• Depends on the posterior distribution of parameters
• As the posterior distribution is intractable the
variational approximation can be used to obtain an
approximate predictive density
P(ˆx|X) ˆx
ˆz
(10.78)
q(⇡)q(µ, ⇤)
variational approximation
49. 10.2.4 Number of components
10.2 Variational Mixture of Gaussians 49
• For a given mixture model of K components, each parameter setting is a
member of a family of K! equivalent setting
Figure 10.7
Plot of the variational lower bound L versus the number K of
components in the Gaussian mixture model, for the Old Faithful
data, showing a distinct peak at K = 2 components. For each value of
K, the model is trained from 100 different random starts, and the
results shown as ‘+’ symbols plotted with small random hori- zontal
perturbations so that they can be distinguished. Note that some
solutions find suboptimal local maxima, but that this hap- pens
infrequently.
• Starting with relative large value of K and components with insufficient
contribution are pruned out: the mixing coefficient is driven to zero
50. 10.2.5 Induced factorizations
10.2 Variational Mixture of Gaussians 50
Induced factorization arises from an interaction between the factorization
assumption in variational posterior and the conditional independence
properties of the true posterior
• For ex: Let A, B, C be disjoint groups of latent variables
• Factorization assumption
• The optimal solution ln q⇤
(A, B) = EC[ln p(A, B|X, C)] + const
q(A, B, C) = q(A, B)q(C)
• We need to determine if .
This is possible iff
• This can also determined from the directed-graph
model
51. Progress…
Variational Inference 51
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
52. 10.3 Variational Linear Regression
10.3 Variational Linear Regression 52
• Return to the Bayesian linear regression model (section 3.3)
• Approximated the integration over by making point
estimates obtained by maximizing the log marginal likelihood
↵,
omitted input x
predict value training target value
ˆ↵, ˆ = argmax↵,
ˆ↵, ˆ = argmax↵,
flat
MLE
then
53. 10.3 Variational Linear Regression
10.3 Variational Linear Regression 53
• The joint distribution of all the variables
• Prior
(10.90)
(10.89)
(10.87)
(10.88)
• Fully Bayesian approach would integrate over the hyper-parameters as well
as over parameters (this section)
• Suppose that the noise precision parameter is known.
n = (xn)where
p(↵) = Gam(↵|a0, b0) / ↵a0 1
exp( b0↵)
=
1
(2⇡)M/2
1
|⌃|1/2
exp
⇢
1
2
(wT
⌃ 1
w)
⌃ = ↵ 1
Iwhere
54. 10.3.1 Variational Distribution
10.3 Variational Linear Regression 54
• Goal: find an approximation to the posterior distribution p(w, ↵|t)
• Factorized approximation
q(w, ↵) = q(w)q(↵) (10.91)
• Optimal solution (from 10.9) ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const
ln q⇤
(↵) = Ew[ln (p(t|w)p(w|↵)p(↵))] + const
= ln p(↵) + Ew[ln p(w|↵)] + const
= (a0 1) ln ↵ b0↵ +
M
2
ln ↵
↵
2
Ew[wT
w] + const
(10.92)
then
q⇤
(↵) = Gam(↵|aN , bN ) where(10.93)
(10.94)
(10.95)
M is number of fitting
parameters wi or input
dimension
55. 10.3.1 Variational Distribution
10.3 Variational Linear Regression 55
• Similar optimal solution of q(w)
quadratic form of w
q⇤
(w) = N(w|mN , SN )
then
(10.99) where
(10.96)
(10.97)
(10.98)
56. 10.3.1 Variational Distribution
10.3 Variational Linear Regression 56
• Optimal solution
q⇤
(↵) = Gam(↵|aN , bN ) where
(10.94)
(10.95)
(10.93)
q⇤
(w) = N(w|mN , SN )(10.99) where
(10.100)
(10.101)
Moment
(10.102)
(10.103)
57. More about variational linear regression
10.3 Variational Linear Regression 57
• Predictive distribution over t, given a new input x
• Lower bound
(10.105)
(10.107)
58. More about variational linear regression
10.3 Variational Linear Regression 58
Lower bound
Order M for a polynomial model
Figure 10.9 The lower bound
versus the order M of the
polynomial model, in which
a set of 10 data points is
generated from a polynomial
with M=3 sampled over (-5, 5)
with additive Gaussian noise
of var=0.09. The value of the
bounds gives the log
probability of the model
Peak at M = 3,
corresponding to
the true model
59. Progress…
Variational Inference 59
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and
the variational principle
Now:
Design the
variational algorithms
10.7. Expectation
Propagation
60. Progress…
Variational Inference 60
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
61. 10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 61
• For many of the models in this book, the complete data likelihood is drawn
from the exponential family
• In general, this will not be the case for the marginal likelihood function for
the observed data. Ex: in a mixture of Gaussians.
62. 10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 62
• Observed data
• Latent variables
X = {x1, ..., xN }
Z = {z1, ..., zN }
• Suppose that the joint distribution is a member of the exponential family
where the conjugate prior for ⌘
(10.113)
p(⌘|⌫0, 0) = f(⌫0, 0)g(⌘)⌫0
exp{⌫0⌘T
0}
(prior number of observations all having the value for the u vector)⌫0 0
(10.114)
• Variational distributions
q(Z, ⌘) = q(Z)q(⌘)
63. 10.4 Exponential Family Distributions
10.4 Exponential Family Distributions 63
ln q⇤
j (Zj) = Ei6=j[ln p(X, Z)] + const• Optimal solution (from 10.9)
= E⌘[ln p(X, Z|⌘)] + const
=
NX
n=1
ln h(xn, zn) + E⌘[⌘T
]u(xn, zn) + const
sum of independent things
Induced factorization
q⇤
(Z) =
Y
n
q⇤
(zn)
(10.115)
where
(10.116)
ln q⇤
(Z) = E⌘[ln p(X, Z, ⌘)] = E⌘[ln p(X, Z|⌘)p(⌘|⌫0, 0)p(⌫0, 0)]
q⇤
(zn) = h(xn, zn)g(E⌘[⌘]) exp{E⌘[⌘T
]u(xn, zn)}
65. Variational message passing
10.4 Exponential Family Distributions 65
• The joint distribution corresponding to a directed graph
then the optimal solution is
thus the update of the factors in the variational posterior
distribution represents a local calculation on the graph
p(x) =
Y
i
p(xi|pai)
parent set corresponding to node ivariable(s) associated with node i (latent or observed)
• Variational approximation q(x) =
Y
i
qi(xi)
Markov blanket
66. Variational message passing
10.4 Exponential Family Distributions 66
• If all the conditional distributions have a conjugate-exponential
structure, then the variational update will be:
• The distribution associated with a particular node can be updated once that
node has received messages from all of its parents and all of its children.
• It requires that the children have already received messages from their co-
parents
67. Progress…
Variational Inference 67
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
68. 10.5 Local Variational Methods
10.5 Local Variational Methods 68
• Global methods: approximation to the full posterior
• Local methods: approximation to individual or groups of variables
• Replace the likelihood with a simpler form - lower bound
that makes the expectation easy to compute
69. Convex duality
10.5 Local Variational Methods 69
Convex function f(x)
One of lower bound
but not the best
The line is moved
vertically to a tangent
Convex function f(x) Concave function f(x)
g(⌘) = max
x
{⌘x f(x)}
f(x) = max
⌘
{⌘x g(⌘)} f(x) = min
⌘
{⌘x g(⌘)}
g(⌘) = min
x
{⌘x f(x)}(10.130)
(10.131)
(10.133)
(10.132)
70. 10.5 Local Variational Methods
10.5 Local Variational Methods 70
Original Problem p(y = 1|x) =
1
1 + exp( x)
= (x)
is concave function, then considerf(x) = ln (x)
g(⌘) = min
x
{⌘x f(x)} = ⌘ ln ⌘ (1 ⌘) ln(1 ⌘)
ln (x) ⌘x g(⌘)
The upper bound
(x) exp(⌘x g(⌘))
(10.135)
(10.136)
(10.137)
The upper bound
Logistic
sigmoid
71. 10.5 Local Variational Methods
10.5 Local Variational Methods 71
is convex function of the variable x2 , then consider
The stationarity conditions
(10.139)
The lower bound
Logistic
sigmoid
• Gaussian lower bound (Jakkola and Jordan, 2000)
f(x) = ln(ex/2
+ e x/2
)
g(⌘) = max
x2
n
⌘x2
f(
p
x2)
o
0 = ⌘
d
dx2
d
dx
f(x) = ⌘ +
1
4x
tanh(
x
2
)
⌘ =
1
4⇠
tanh
✓
⇠
2
◆
=
1
2⇠
(⇠)
1
2
= (⇠)
g( (⇠)) = (⇠)⇠2
f(⇠) = (⇠)⇠2
+ ln(e⇠/2
+ e ⇠/2
)
f(x) (⇠)x2
g( (⇠)) = (⇠)x2
+ (⇠)⇠2
ln(e⇠/2
+ e ⇠/2
)
The bound on f(x)
The bound on sigmoid (x) (⇠) exp{(x ⇠)/2 (⇠)(x2
⇠2
)} (10.144)
72. How the bounds can be used
10.5 Local Variational Methods 72
• Evaluate I =
Z
(a)p(a)da
• The local variational bound (a) f(a, ⇠)
• The variational bound
(intractable)
I
Z
f(a, ⇠)p(a)da = F(⇠)
⇠ is additional parameter (depends on a)where
Finding the compromise to maximize F(⇠)⇠⇤
73. In Reviews…
10.5 Local Variational Methods 73
Original Problem
Local Bound
Bound with only
linear or quadratic
terms: expectations,
especially against a
Gaussian, are easy to
compute.
p(y = 1|x) =
1
1 + exp( x)
= (x)
(x) (⇠) exp{(x ⇠)/2 (⇠)(x2
⇠2
)}
⇠ is additional parameterwhere
74. Progress…
Variational Inference 74
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
75. 10.6 Variational Logistic Regression
10.6 Variational Logistic Regression 75
Return to the Bayesian logistic regression model (section 4.5)
• The posterior distribution
where the prior distribution
p(w) = N(w|w0, S0)
and the likelihood function
p(t|w) =
NY
n=1
yn
tn
{1 yn}1 tn
• Then
where yn = (wT
n)
Maximize the posterior to give wMAP and then
• The Gaussian approximation
p(w) = N(w|wN , SN )
p(w|t) / p(w)p(t|w)
76. 10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 76
A practical example of local variational method
• Recap of variational framework: maximize a lower bound on the marginal
likelihood
For the Bayesian logistic regression model, the marginal likelihood is:
The conditional distribution for t
(10.147)
(10.148)
where
77. 10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 77
• Lower bound on the logistic sigmoid function
where
• We can therefore write
• Bound on the joint distribution of t and w
where
(10.149)
(10.151)
(10.152)
(10.153)
where and each training set observation corresponds with( n, tn) ⇠n
78. 10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 78
• Lower bound on the log of the joint distribution of t and w
• Hypothesis for the prior p(w): Gaussian with parameters m0 and S0
considered as fixed
Then, the right side of (10.154) becomes as function of w
(10.154)
(10.155)
79. 10.6.1 Variational posterior distribution
10.6 Variational Logistic Regression 79
• Quantity of interest: exact posterior distribution, requires normalization of
the left side in (10.152) but usually intractable
• Work instead with the right side of (10.155): a quadratic function of w which
is a lower bound of p(w, t)
• A Gaussian variational posterior of the form
where
(10.156)
(10.157)
(10.158)
80. 10.6.1 Optimizing the variational parameters
10.6 Variational Logistic Regression 80
• Determine the variational parameters by maximizing the lower
bound of the marginal likelihood
Two approaches
Substitute (10.152) back into the marginal likelihood
(10.159)
• (1) View w as a latent variable and use the EM algorithm
• (2) Compute and maximize directly using the fact that p(w) is Gaussian
and is a quadratic function of w
Re-estimation equations
(10.164)
(10.163)
81. 10.6.1 Optimizing the variational parameters
10.6 Variational Logistic Regression 81
• Invoke EM algorithm
1. Initialize values for
2. E step
• Use to calculate the posterior distribution
3. M step
⇠old
• Maximize the complete-data log likelihood
Q(⇠, ⇠old
) = Eq(w) [ln h(w, ⇠)p(w)]
• Solve (stationarity condition)
then (10.163)
(10.160)
(10.162)
83. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 83
• Allow hyper parameters to be inferred from dataset
• Consider simple Gaussian prior form
• Consider conjugate hyper prior given by a gamma distribution
• The marginal likelihood
where the joint distribution
(10.168)
(10.167)
(10.166)
(10.165)
84. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 84
Combine global and local approach
• (1) Global approach: consider a variational distribution and apply
the standard decomposition
• (2) The lower bound is intractable so apply the local approach as
before to get a lower bound on and on
L(q)
L(q) ln p(t)
• (3) Assume that q is factorized as
(10.169)
(10.172)
85. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 85
• It follows (quadratic function of w)
where
(10.174)
(10.175)
(10.176)
• From (10.153) and (10.165)
(10.153)
(10.165)
86. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 86
where
then
• Similar with from (10.165) and (10.166)
(10.177)
(10.178)
(10.179)
q(↵)
(10.165)
We have
(10.166)p(↵) = Gam(↵|a0, b0) / ↵a0 1
exp( b0↵)
q(↵) = Gam(↵|aN , bN ) =
1
(aN )
abN
N ↵aN 1
e bN ↵
87. 10.6.3 Inference of hyper parameters
10.6 Variational Logistic Regression 87
• The variational parameters are obtained by maximizing the lower bound
(10.180)
(10.181)
(10.183)
(10.182)
• Re-estimation equations
where
Q(⇠, ⇠old
) = Eq(w) [ln h(w, ⇠)p(w)] (10.160)
as we done before with
88. Progress…
Variational Inference 88
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
10.7. Expectation
Propagation
10.3. Variational Linear
Regression
89. Expectation Propagation (Minka, 2001)
10.7 Expectation Propagation 89
• An alternative form of deterministic approximate inference based on the
reverse KL divergence KL(p||q) ( instead of KL(q||p)) where p is the complex
distribution
KL(p||q) =
Z
p(z) ln
p(z)
q(z)
dzKL(q||p) =
Z
q(z) ln
q(z)
p(z)
dz
• Consider fixed distribution p(z) and member of the exponential family q(z).
KL(p||q) = ln g(⌘) ⌘T
Ep(z)[u(z)] + const
⌘
setting gradient to zero
(10.185)
(10.186)
q(z) = h(z)g(⌘) exp{⌘T
u(z)}
• The Kullback-Leibler divergence as function of
(10.184)
90. Expectation Propagation
10.7 Expectation Propagation 90
• Member of the exponential family q(z).
(10.187)
q(z) = h(z)g(⌘) exp{⌘T
u(z)} (10.184)
then Z
h(z)g(⌘) exp{⌘T
u(z)}dz = 1
taking the gradient of both size
rg(⌘)
Z
h(z) exp{⌘T
u(z)}dz +g(⌘)
Z
h(z) exp{⌘T
u(z)}u(z)dz = 0
1
g(⌘)
rg(⌘) = g(⌘)
Z
h(z) exp{⌘T
u(z)}u(z)dz = Eq(z)[u(z)]
r ln g(⌘) = Eq(z)[u(z)]
From 10.186 we have
Moment matching, setting mean and covariance of q(z) the same as p(z)’s
91. Expectation Propagation
10.7 Expectation Propagation 91
• Assume the joint distribution of data and hidden variables and
parameters is of the form
D ✓
• Quantities of interest ad posterior distribution
and model evidence
(10.188)
(10.189)
(10.190)
92. Expectation Propagation
10.7 Expectation Propagation 92
• Expectation propagation is based on an approximation to the posterior
distribution which is also given by a product of factors
where each factor comes from the exponential family
• Ideally, we would like to determine by minimizing the KL divergence
between the true posterior and the approximation
• Minimize the KL divergence between each pair of factors and
independently but the product is usually poor approximation
• EP: optimize each factor in turn using the current values for the remaining
factors (good in logistic type but bad for mixtures type due to multi-modality)
96. The clutter problem
10.7 Expectation Propagation 96
• Mixture of Gaussians of the form
where w is the proportion of background cluster and is assumed to be
known. The prior is taken to be Gaussian
• The joint distribution of N observations and is given by
• To apply EP, first identify the factors
then, choose the exponential family
• The factor approximation takes the form of exponentials of quadratic
functions
(10.209)
(10.210)
(10.211)
(10.212)
(10.213)
98. The clutter problem
10.7 Expectation Propagation 98
• Evaluate the approximation to the model evidence
where
(10.223)
(10.224)
99. Expectation Propagation on graphs
10.7 Expectation Propagation 99
• The factors are not function of all variables. If the approximating
distribution is fully factorized, EP reduces to Loopy Belief Propagation
• We seek an approximation q(x) that has the same factorization
(10.225)
(10.226)
100. Expectation Propagation on graphs
10.7 Expectation Propagation 100
• Restrict to approximations in which the factors factorize with respect to the
individual variables so that
(10.227)
101. Expectation Propagation on graphs
10.7 Expectation Propagation 101
• Suppose all the factors are initialized and we choose to refine factor
• Minimizing the reverse KL when q factorizes, leads to an optimal solution q
where factors are the marginals of p
107. Expectation propagation
10.7 Expectation Propagation 107
• The sum-product BP arises as a special case of EP when a fully factorized
approximating distributions is used
• EP can be seen as a way to generalized this: group factors and update them
together, use partially connected graph
• Q remains: How to choose the best grouping and disconnection?
• Summary: EP and Variational message passing correspond to the
optimization of two different KL divergences
• Minka 2005 gives a more general point of view using the family of alpha-divergences that includes both
KL and reverse KL, but also other divergence like Hellinger distance, Chi2-distance...
• He shows that by choosing to optimize one or the other of these divergences, you can derive a broad
range of message passing algorithms including Variational message passing, Loopy BP, EP, Tree-
Reweighted BP, Fractional BP, power EP.
108. Summary
Variational Inference 108
10.1. Variational
Inference
10.2. Variational
Mixture of Gaussians
10.3. Variational Linear
Regression
10.4. Exponential Family
Distributions
10.5. Local Variational
Methods
10.6. Variational Logistic
Regression
Part I:
Probabilistic modeling and
the variational principle
Part II:
Design the
variational algorithms
10.7. Expectation
Propagation