2. ¤ arXiv 2 6
¤ Yingzhen Li Richard E. Turner
¤ University of Cambridge
¤ Li D3 “Stochastic Expectation Propagation” NIPS
¤ Rényi
¤ VAE importance weighted AE[Burda et al., 2015]
¤ Appendix
¤
4. ¤
¤ SVI [Hoffmann et al, 2013]
¤ SEP [Li et al., 2015]
¤ black-box
¤ [Ranganath et al., 2014]
¤ black-box alpha BB-α [Hernandez-Labato et al., 2015]
¤
¤ Importance weighted AE (IWAE)[Burda et al., 2015]
¤ VAE ICLR2016
5. ¤
¤ #(.|/)
¤ ,(.)
¤ KL # /
¤
principle literature [Gr¨unwald, 2007].
2.2 Variational Inference
Next we review the variational inference algorithm [Jordan et al.
perspective, using posterior approximation as a running examp
i.i.d. samples D = {xn}N
n=1 from a probabilistic model p(x|✓) pa
is drawn from a prior p0(✓). Bayesian inference involves comp
parameters given the data,
p(✓|D) =
p(✓, D)
p(D)
=
p0(✓)
QN
n=1 p
p(D)
3
ciple literature [Gr¨unwald, 2007].
Variational Inference
we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisat
pective, using posterior approximation as a running example. Consider observing a dataset of
samples D = {xn}N
n=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ th
awn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of t
meters given the data,
p(✓|D) =
p(✓, D)
p(D)
=
p0(✓)
QN
n=1 p(xn|✓)
p(D)
,
3
(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For
l models, including Bayesian neural networks, the true posterior is typically intractable.
nference introduces an approximation q(✓) to the true posterior, which is obtained by minim
divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
r the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Varia
e sidesteps this di culty by considering an equivalent optimisation problem:
q(✓) = arg max
q2Q
LV I (q; D),
he variational lower-bound or evidence lower-bound (ELBO) LV I (q; D) is defined by
LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]
re p(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For m
werful models, including Bayesian neural networks, the true posterior is typically intractable. Va
al inference introduces an approximation q(✓) to the true posterior, which is obtained by minimi
KL divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
wever the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Variati
rence sidesteps this di culty by considering an equivalent optimisation problem:
q(✓) = arg max
q2Q
LV I(q; D),
re the variational lower-bound or evidence lower-bound (ELBO) LV I(q; D) is defined by
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
p(✓, D)
n called marginal likelihood or model evidence. For many
etworks, the true posterior is typically intractable. Varia-
(✓) to the true posterior, which is obtained by minimising
on family Q:
min
2Q
KL[q(✓)||p(✓|D)]. (8)
able, mainly because of the di cult term p(D). Variational
g an equivalent optimisation problem:
arg max
q2Q
LV I(q; D), (9)
lower-bound (ELBO) LV I(q; D) is defined by
og p(D) KL[q(✓)||p(✓|D)]
p(✓, D) (10)
where p(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood
powerful models, including Bayesian neural networks, the true posterior is
tional inference introduces an approximation q(✓) to the true posterior, wh
the KL divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
However the KL divergence in (8) is also intractable, mainly because of the d
inference sidesteps this di culty by considering an equivalent optimisation
q(✓) = arg max
q2Q
LV I(q; D),
where the variational lower-bound or evidence lower-bound (ELBO) LV I(q
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq
log
p(✓, D)
q(✓)
.
6. VAE
¤ [Kingma et al,. 2014]
¤
¤ ℎ
¤
¤
1 Variational Auto-encoder with R´enyi Divergence
he variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a re
oposed (deep) generative model that parametrizes the variational approximation with a recog
twork. The generative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
).
re we drop the parameters ✓ but keep in mind that they will be learned using approximate max
elihood. However for these models the exact computation of log p(x) requires marginalisation
dden variables and is thus often intractable. Variational expectation-maximisation (EM) me
mes to the rescue by approximating
log p(x) ⇡ LV I(q; x) = Eq(h|x)
log
p(x, h)
q(h|x)
,
here h collects all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is defi
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
).
variational EM, optimisation for q and p are alternated to guarantee convergence. However th
ea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the
ional Auto-encoder with R´enyi Divergence
auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recently
) generative model that parametrizes the variational approximation with a recognition
enerative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
). (14)
he parameters ✓ but keep in mind that they will be learned using approximate maximum
wever for these models the exact computation of log p(x) requires marginalisation of all
s and is thus often intractable. Variational expectation-maximisation (EM) methods
scue by approximating
log p(x) ⇡ LV I(q; x) = Eq(h|x)
log
p(x, h)
q(h|x)
, (15)
s all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is defined as
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
). (16)
EM, optimisation for q and p are alternated to guarantee convergence. However the core
to jointly optimising p and q, which instead has no guarantee of increasing the MLE
on in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. This
the possibility that alternative surrogate functions might return estimates that are tighter
So the VR bound is considered in this context:
L (q; x) =
1
log E
"✓
p(x, h)
◆1 ↵
#
. (17)
variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is
sed (deep) generative model that parametrizes the variational approximation with a r
rk. The generative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
).
we drop the parameters ✓ but keep in mind that they will be learned using approximate
ood. However for these models the exact computation of log p(x) requires marginalisa
n variables and is thus often intractable. Variational expectation-maximisation (EM
to the rescue by approximating
log p(x) ⇡ LV I (q; x) = Eq(h|x)
log
p(x, h)
q(h|x)
,
h collects all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
).
iational EM, optimisation for q and p are alternated to guarantee convergence. Howeve
of VAE is to jointly optimising p and q, which instead has no guarantee of increasing
ive function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2
explores the possibility that alternative surrogate functions might return estimates that
bounds. So the VR bound is considered in this context:
"✓ ◆1 ↵
#
p(x|✓) =
h1,...,hL
p(hL
|✓)p(hL 1
|hL
, ✓) · · · p(x|h1
, ✓). (
Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1
, . . . , hL
} denotes t
stochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. F
convenience, we define h0
= x. Each of the terms p(h`
|h`+1
) may denote a complicated nonline
relationship, for instance one computed by a multilayer neural network. However, it is assum
that sampling and probability evaluation are tractable for each p(h`
|h`+1
). Note that L denot
the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. W
assume the recognition model q(h|x) is defined in terms of an analogous factorization:
q(h|x) = q(h1
|x)q(h2
|h1
) · · · q(hL
|hL 1
), (
where sampling and probability evaluation are tractable for each of the terms in the product.
In this work, we assume the same families of conditional probability distributions as Kingma
Welling (2014). In particular, the prior p(hL
) is fixed to be a zero-mean, unit-variance Gaussia
In general, each of the conditional distributions p(h`
| h`+1
) and q(h`
|h` 1
) is a Gaussian wi
diagonal covariance, where the mean and covariance parameters are computed by a determinis
feed-forward neural network. For real-valued observations, p(x|h1
) is also defined to be such
Gaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean paramete
are computed by a neural network.
The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fro
Jensen’s Inequality:
log p(x) = log Eq(h|x)
p(x, h)
q(h|x)
Eq(h|x)
log
p(x, h)
q(h|x)
= L(x). (
Since L(x) = log p(x) DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off t
7. VAE
¤
reparameterization trick
¤
¤
ed a reparameterization of the recognition distribution in terms
tributions, such that the samples from the recognition model are
s and auxiliary variables. While they presented the reparameter-
tions, for convenience we discuss the special case of Gaussians,
ork. (The general reparameterization trick can be used with our
tribution q(h`
|h` 1
, ✓) always takes the form of a Gaussian
hose mean and covariance are computed from the the states of
2
Under review as a conference paper at ICLR 2016
the hidden units at the previous layer and the model parameters. This can be
by first sampling an auxiliary variable ✏`
⇠ N (0, I), and then applying the d
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓).
The joint recognition distribution q(h|x, ✓) over all latent variables can be
a deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying E
sequence. Since the distribution of ✏ does not depend on ✓, we can reformu
bound L(x) from Eqn. 3 by pushing the gradient operator inside the expecta
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)
log
p(x, h
q(h(✏
= E✏1,...,✏L⇠N (0,I)
r✓ log
p(x, h
q(h(✏
Assuming the mapping h is represented as a deterministic feed-forward neu
✏, the gradient inside the expectation can be computed using standard backp
one approximates the expectation in Eqn. 6 by generating k samples of ✏ a
Carlo estimator
1
k
kX
r✓ log w (x, h(✏i, x, ✓), ✓)
der review as a conference paper at ICLR 2016
hidden units at the previous layer and the model parameters. This can be alternatively ex
first sampling an auxiliary variable ✏`
⇠ N (0, I), and then applying the deterministic m
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓).
joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in t
eterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each
uence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradien
nd L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)
log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
= E✏1,...,✏L⇠N (0,I)
r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
.
uming the mapping h is represented as a deterministic feed-forward neural network, fo
he gradient inside the expectation can be computed using standard backpropagation. In p
approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the
lo estimator
k
Under review as a conference paper at ICLR 2016
the hidden units at the previous layer and the model parameters. This can be alternatively expressed
by first sampling an auxiliary variable ✏`
⇠ N(0, I), and then applying the deterministic mapping
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓). (4)
The joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of
a deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each layer in
sequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the
bound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)
log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
(5)
= E✏1,...,✏L⇠N (0,I)
r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
. (6)
Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed
✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,
one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte
Carlo estimator
1
k
kX
i=1
r✓ log w (x, h(✏i, x, ✓), ✓) (7)
with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that
the VAE update and the basic REINFORCE-like update are both unbiased estimators of the same
e hidden units at the previous layer and the model parameters. This can be alternatively expressed
y first sampling an auxiliary variable ✏`
⇠ N(0, I), and then applying the deterministic mapping
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓). (4)
he joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of
deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each layer in
quence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the
ound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N(0,I)
log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
(5)
= E✏1,...,✏L⇠N(0,I)
r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
. (6)
ssuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed
the gradient inside the expectation can be computed using standard backpropagation. In practice,
ne approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte
arlo estimator
1
k
kX
i=1
r✓ log w (x, h(✏i, x, ✓), ✓) (7)
ith w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that
e VAE update and the basic REINFORCE-like update are both unbiased estimators of the same
adient, but the VAE update tends to have lower variance in practice because it makes use of the
the hidden units at the previous layer and the
by first sampling an auxiliary variable ✏`
⇠
h`
(✏`
, h` 1
, ✓) = ⌃
The joint recognition distribution q(h|x, ✓)
a deterministic mapping h(✏, x, ✓), with ✏
sequence. Since the distribution of ✏ does n
bound L(x) from Eqn. 3 by pushing the gra
r✓ log Eh⇠q(h|x,✓)
p(x, h|✓)
q(h|x, ✓)
=
=
Assuming the mapping h is represented as a
✏, the gradient inside the expectation can be
one approximates the expectation in Eqn. 6
Carlo estimator
1
k
kX
i=1
r✓ lo
with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). T
the VAE update and the basic REINFORCE
gradient, but the VAE update tends to have
8. VAE
VAE
¤ VAE
¤
¤ VAE KL
In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the
parameters. We assume an approximate posterior in the form q (z|x), but please note that the
technique can be applied to the case q (z), i.e. where we do not condition on x, as well. The fully
variational Bayesian method for inferring a posterior over the parameters is given in the appendix.
Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q (z|x) we
can reparameterize the random variable ez ⇠ q (z|x) using a differentiable transformation g (✏, x)
of an (auxiliary) noise variable ✏:
ez = g (✏, x) with ✏ ⇠ p(✏) (4)
See section 2.4 for general strategies for chosing such an approriate distribution p(✏) and function
g (✏, x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t.
q (z|x) as follows:
Eq (z|x(i)) [f(z)] = Ep(✏)
h
f(g (✏, x(i)
))
i
'
1
L
LX
l=1
f(g (✏(l)
, x(i)
)) where ✏(l)
⇠ p(✏) (5)
We apply this technique to the variational lower bound (eq. (2)), yielding our generic Stochastic
Gradient Variational Bayes (SGVB) estimator eLA
(✓, ; x(i)
) ' L(✓, ; x(i)
):
eLA
(✓, ; x(i)
) =
1
L
LX
l=1
log p✓(x(i)
, z(i,l)
) log q (z(i,l)
|x(i)
)
where z(i,l)
= g (✏(i,l)
, x(i)
) and ✏(l)
⇠ p(✏) (6)
3
g r✓,
eLM
(✓, ; XM
, ✏) (Gradients of minibatch estimator (8))
✓, Update parameters using gradients g (e.g. SGD or Adagrad [DHS10])
until convergence of parameters (✓, )
return ✓,
Often, the KL-divergence DKL(q (z|x(i)
)||p✓(z)) of eq. (3) can be integrated analytically (see
appendix B), such that only the expected reconstruction error Eq (z|x(i))
⇥
log p✓(x(i)
|z)
⇤
requires
estimation by sampling. The KL-divergence term can then be interpreted as regularizing , encour-
aging the approximate posterior to be close to the prior p✓(z). This yields a second version of the
SGVB estimator eLB
(✓, ; x(i)
) ' L(✓, ; x(i)
), corresponding to eq. (3), which typically has less
variance than the generic estimator:
eLB
(✓, ; x(i)
) = DKL(q (z|x(i)
)||p✓(z)) +
1
L
LX
l=1
(log p✓(x(i)
|z(i,l)
))
where z(i,l)
= g (✏(i,l)
, x(i)
) and ✏(l)
⇠ p(✏) (7)
Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the
marginal likelihood lower bound of the full dataset, based on minibatches:
L(✓, ; X) ' eLM
(✓, ; XM
) =
N
M
MX
i=1
eL(✓, ; x(i)
) (8)
where the minibatch XM
= {x(i)
}M
i=1 is a randomly drawn sample of M datapoints from the
full dataset X with N datapoints. In our experiments we found that the number of samples L
per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100.
Derivatives r✓,
eL(✓; XM
) can be taken, and the resulting gradients can be used in conjunction
with stochastic optimization methods such as SGD or Adagrad [DHS10]. See algorithm 1 for a
basic approach to compute the stochastic gradients.
A connection with auto-encoders becomes clear when looking at the objective function given at
eq. (7). The first term is (the KL divergence of the approximate posterior from the prior) acts as a
regularizer, while the second term is a an expected negative reconstruction error. The function g (.)
is chosen such that it maps a datapoint x(i)
and a random noise vector ✏(l)
to a sample from the
9. Importance weighted AE IWAE
¤ VAE
¤
¤
¤
¤
¤ k=1 VAE
¤ k
ution must be approximately factorial and predictable with a feed-forward neural n
VAE criterion may be too strict; a recognition network which places only a small
0%) of its samples in the region of high posterior probability region may still be suffi
ming accurate inference. If we lower our standards in this way, this may give us ad
lity to train a generative network whose posterior distributions do not fit the VAE
This is the motivation behind our proposed algorithm, the Importance Weighted Auto
E).
WAE uses the same architecture as the VAE, with both a generative network and a rec
rk. The difference is that it is trained to maximize a different lower bound on log p
ular, we use the following lower bound, corresponding to the k-sample importance w
te of the log-likelihood:
Lk(x) = Eh1,...,hk⇠q(h|x)
"
log
1
k
kX
i=1
p(x, hi)
q(hi|x)
#
.
h1, . . . , hk are sampled independently from the recognition model. The term inside
ponds to the unnormalized importance weights for the joint distribution, which we wil
= p(x, hi)/q(hi|x).
s a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality
at the average importance weights are an unbiased estimator of p(x):
Lk = E
"
log
1
k
kX
wi
#
log E
"
1
k
kX
wi
#
= log p(x),
iew as a conference paper at ICLR 2016
1. For all k, the lower bounds satisfy
log p(x) Lk+1 Lk.
if p(h, x)/q(h|x) is bounded, then Lk approaches log p(x) as k goes to infinity.
e Appendix A.
10. Rényi α
¤ . # ,
¤ 1 1 > 0, 1 ≠ 1
¤ 1 → 1 KL
¤ 1 =
8
9
tributions p and q on a random variable ✓ 2 ⇥:
D↵[p||q] =
1
↵ 1
log
Z
p(✓)↵
q(✓)1 ↵
d✓.
> 1 the definition is valid when it is finite, and for discrete random variables the integr
d by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that
role in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
two distributions p and q on a random variable ✓ 2 ⇥:
D↵[p||q] =
1
↵ 1
log
Z
p(✓)↵
q(✓)1 ↵
d✓.
For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integratio
replaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that pla
crucial role in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
Similar to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
Another special case is ↵ = 1
2 , where the corresponding R´enyi divergence is a function of the squ
2
R p p
the definition is valid when it is finite, and for discrete random variables the int
y summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence th
in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
pecial case is ↵ = 1
2 , where the corresponding R´enyi divergence is a function of
istance Hel2
[p||q] = 1
2
R
(
p
p(✓)
p
q(✓))2
d✓:
D1
2
[p||q] = 2 log(1 Hel2
[p||q]).
ven and Harremo¨es, 2014] the definition (1) is also extended to negative ↵ values,
t is non-positive and is thus no longer a valid divergence measure. The proposed m
11. Rényi
¤ # . / ,(.) KL
¤ Rényi
¤ Rényi α
¤
¤ 1 ≠ 1
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq
log
p(✓, D)
q(✓)
.
ational R´enyi Bound
Section 2.1 that the family of R´enyi divergences includes the KL divergence.
ee-energy approaches be generalised to the R´enyi case? Consider approxima
|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
y the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
"✓ ◆1 ↵
#
LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq
log
p(✓, D)
q(✓)
.
ariational R´enyi Bound
m Section 2.1 that the family of R´enyi divergences includes the KL divergence.
al free-energy approaches be generalised to the R´enyi case? Consider approxima
p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
erify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
log E
"✓
p(✓, D)
◆1 ↵
#
q(✓)
Variational R´enyi Bound
rom Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhap
nal free-energy approaches be generalised to the R´enyi case? Consider approximating th
r p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
verify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
6= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
↵ 1
log Eq
"✓
p(✓, D)
q(✓)p(D)
◆1 ↵
#
=
1
1 ↵
log Eq
"✓
p(✓, D)
q(✓)
◆1 ↵
#
:= L↵(q; D).
me this new objective the variational R´enyi bound (VR). Importantly the following theore
Rényi VR
12. VR
¤ VR
¤
¤
cope if Monte Carlo methods is not resorted to. This section develops a scalable opt
or the VR bound by extending the recent advances of traditional VI. Black-box met
ssed to enable it applications to arbitrary finite ↵ settings.
Monte Carlo Estimation of the VR Bound
se a simple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to app
K:
ˆL↵,K(q; D) =
1
1 ↵
log
1
K
KX
k=1
"✓
p(✓k, D)
q(✓k)
◆1 ↵
#
.
aditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓)
thm. However we can bound the bias by the following theorems proved in the supple
m 2. E{✓k}K
k=1
[ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fix
limiting result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-incre
] [ {|L↵| < +1}.
5
R Bound Optimisation Framework
energy methods sidestep intractabilities in a class of intractable models. Recent wor
proximations based on Monte Carlo to expend the set of models that can be handled.
be deployed on the same model class as Monte Carlo variational methods, but which
Monte Carlo methods is not resorted to. This section develops a scalable optimis
VR bound by extending the recent advances of traditional VI. Black-box method
o enable it applications to arbitrary finite ↵ settings.
Carlo Estimation of the VR Bound
mple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to approxi
ˆL↵,K(q; D) =
1
1 ↵
log
1
K
KX
k=1
"✓
p(✓k, D)
q(✓k)
◆1 ↵
#
.
al VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is i
However we can bound the bias by the following theorems proved in the supplemen
{✓k}K
k=1
[ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵
ng result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-increasin
↵| < +1}.
(a) Sampling approximated VR bounds. (b) Simulated values of divergences.
Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds.
Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha
divergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, where
the only di↵erence is µp = [0, 0] and µq = [1, 1]. Best viewed in colour.
Corollary 1. For K < +1, there exists ↵K < 0 such that for all ↵ ↵K, E{✓k}K
k=1
[ ˆL↵,K(q; D)]
log p(D). Furthermore ↵K is non-decreasing in K, with limK!1 ↵K = 1 and limK!+1 ↵K = 0.
To better understand the above theorems we plot in Figure 2(a) an illustration of the bounding
properties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihood
log p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵ 1 the sampling approximation
ˆL↵,K in expectation under-estimates the exact VR bound L↵ (blue dashed lines), where the approximation
quality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negative
alpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrow
between the two blue dashed lines visualising ˆL↵1,K1 and ˆL↵2,K1 , respectively).
We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and Monte
13. VR
exact approx.
(a) Sampling approximated VR bounds. (b) Simula
Figure 2: (a) An illustration for the bounding properties of sampling ap
Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampl
VR
1 ≤ 1
1
k
14. VR
¤ IWAE
¤ 1
ated VR bounds. (b) Simulated values of divergences.
n for the bounding properties of sampling approximations to the VR bounds.
1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha
e p, q are 2-D Gaussian distributions with identity covariance matrix, where
[0, 0] and µq = [1, 1]. Best viewed in colour.
ˆ
1 ≤ 1
1 = 0
IWAE
15. VR-max
¤ Reparameterization trick
¤
¤
¤ 1 = 1 VAE
¤ 1 → −∞
¤ importance weight
¤ VR-max
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
. (19)
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏
w↵(✏; , D)r log
p(g , D)
q(g )
, (20)
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite samples ✏k ⇠
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1
ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
. (21)
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I(q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
, (22)
which means the resulting algorithm unifies the computation for all finite ↵ settings.
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
. (19)
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏
w↵(✏; , D)r log
p(g , D)
q(g )
, (20)
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite samples ✏k ⇠
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1
ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
. (21)
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I(q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
, (22)
which means the resulting algorithm unifies the computation for all finite ↵ settings.
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bou
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
.
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏
w↵(✏; , D)r log
p(g , D)
q(g )
,
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite sam
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1
ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
.
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite sample
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I (q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
,
which means the resulting algorithm unifies the computation for all finite ↵ settings.
To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏j wit
Algorithm 1 one gradient step for VR-↵/VR-max
1: sample ✏1, ..., ✏K ⇠ p(✏)
2: for all k = 1, ..., K, and n 2 S the current minibatch, compute the u
log ˆw(✏k; xn) = log p(g (✏k), xn) log q(g (
3: choose the sample ✏jn to back-propagate:
if |↵| < 1: jn ⇠ pk where pk / ˆw(✏k; xn)1 ↵
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
r log ˆw(✏jn
; xn)
16. ¤ [Li et al., 2015] EP
¤
¤
¤ M
VR
¤ M
¤ Black-box alpha BB-α VR
or VAEs. Note that VR-max does not compute
ciple (MDL), since MDL approximates the true
upper-bounds the exact log-likelihood function.
scale Learning
hole dataset D. However for large datasets full
[Li et al., 2015] the authors discussed stochastic
tion for large-scale learning. Here we propose
batch training, which directly applies to the VR
“average likelihood” ¯fD(✓) = [
QN
n=1 fn(✓)]
1
N ,
✓) ¯fD(✓)N
. Now we sample M datapoints S =
posterior by minimising the exact VR bound L 1
4.3 Stochastic Approximation for La
So far we discussed the VR bounds computed on t
batch learning will be very ine cient. In the append
EP as a way to approximating the VR bound opt
another stochastic approximation method to enable
bound.
Using the notation fn(✓) = p(xn|✓) and definin
the joint distribution can be rewritten as p(✓, D) =
the minimum description length principle (MDL), since MDL approximates the true
sing the exact VR bound L 1 that upper-bounds the exact log-likelihood function.
c Approximation for Large-scale Learning
the VR bounds computed on the whole dataset D. However for large datasets full
e very ine cient. In the appendix of [Li et al., 2015] the authors discussed stochastic
proximating the VR bound optimisation for large-scale learning. Here we propose
pproximation method to enable minibatch training, which directly applies to the VR
on fn(✓) = p(xn|✓) and defining the “average likelihood” ¯fD(✓) = [
QN
n=1 fn(✓)]
1
N ,
n can be rewritten as p(✓, D) = p0(✓) ¯fD(✓)N
. Now we sample M datapoints S =
7set average likelihood” ¯fS(✓) = [
QM
m=1 fnm
(✓)]
1
M .
xn}. Then we approximate the VR bound (13) by
)↵
p0(✓) ¯fS(✓)N 1 ↵
d✓
0(✓) ¯fS(✓)N
q(✓)
◆1 ↵
].
(23)
wer-bound when ↵ ! 1. For other ↵ 6= 1 settings,
the bias of approximation. This is guaranteed by
{xn1
, ..., xnM
} ⇠ D and define the corresponding “subset average likelihood” ¯fS(✓) = [
QM
m=1 fnm
(✓)]
1
M .
When M = 1 we also write ¯fS(✓) = fn(✓) for S = {xn}. Then we approximate the VR bound (13) by
replacing ¯fD(✓) with ¯fS(✓):
˜L↵(q; S) =
1
1 ↵
log
Z
q(✓)↵
p0(✓) ¯fS(✓)N 1 ↵
d✓
=
1
1 ↵
log Eq[
✓
p0(✓) ¯fS(✓)N
q(✓)
◆1 ↵
].
(23)
This returns a stochastic estimate of the evidence lower-bound when ↵ ! 1. For other ↵ 6= 1 settings,
increasing the size of the minibatch M = |S| reduces the bias of approximation. This is guaranteed by
the following theorem proved in the supplementary.
Theorem 3. If the approximate distribution q(✓) is Gaussian N(µ, ⌃), and the likelihood functions has
an exponential family form p(x|✓) = exp[h✓, (x)i A(✓)], then for ↵ 1 the stochastic approximation
is bounded by
19. 1
¤ 3
¤ VAE 1 = 1
¤ IWAE 1 = 0
¤ VR-max 1 = −∞
¤ 1 = 0 * = 5000
¤ VR-max IWAE
¤ VR-max
¤ VR-max 25hr29min IWAE 61hr16min
e code1
. Note that the original implementation back-
hile VR-max only back-propagates the sample with
h 101 Silhouettes and MNIST. The experiments were
y small Frey Face dataset, while the other two were
onsists of L = 1 or 2 stochastic layers with determin-
rk architecture is detailed in the supplementary. We
n. For MNIST we used settings from [Burda et al.,
and number of epochs. For other two datasets the
the VI setting. We reproduced the experiments for
s included in [Burda et al., 2015] mismatches those
e 1 by computing log p(x) ⇡ ˆL↵,K(q; x) with ↵ = 0.0,
sent some samples from the VR-max trained models
d almost indistinguishable to IWAEs on all the three
ime to run compared to IWAE with a full backward
a Tesla C2075 GPU, and when trained on MNIST
R-max and IWAE took 25hr29min and 61hr16min,
also implemented the single backward pass version
od result for IWAE is -85.02, which is slightly worse
he arguments in Section 4.1 that negative ↵ can be
mputation resources are limited.
alue corresponding to the tightest VR bound becomes
q and the true posterior increases. This is the case
n q is fitted to approximate the typically multimodal
(a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST
Figure 3: Sampled images from the VR-max trained auto-encoders.
Dataset L K VAE IWAE VR-max
Frey Face 1 5 1322.96 1380.30 1377.40
(± std. err.) ±10.03 ±4.60 ±4.59
Caltech 101 1 5 -119.69 -117.89 -118.01
Silhouettes 50 -119.61 -117.21 -117.10
MNIST 1 5 -86.47 -85.41 -85.42
50 -86.35 -84.80 -84.81
2 5 -85.01 -83.92 -84.04
50 -84.78 -83.12 -83.44
Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015].
IWAE results are reproduced using the publicly available implementation.
method was implemented upon the publicly available code1
. Note that the original implementation back-
propagates all the samples to compute gradients, while VR-max only back-propagates the sample with
the largest importance weight.
Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments were