SlideShare a Scribd company logo
1 of 23
Download to read offline
Variational Inference with Rényi
Divergence
D1
¤ arXiv 2 6
¤ Yingzhen Li Richard E. Turner
¤ University of Cambridge
¤ Li D3 “Stochastic Expectation Propagation” NIPS
¤ Rényi
¤ VAE importance weighted AE[Burda et al., 2015]
¤ Appendix
¤
¤
PRML
¤
¤
¤
¤ KL
ln #(%) = ℒ % + *+(,||%)
¤
¤ SVI [Hoffmann et al, 2013]
¤ SEP [Li et al., 2015]
¤ black-box
¤ [Ranganath et al., 2014]
¤ black-box alpha BB-α [Hernandez-Labato et al., 2015]
¤
¤ Importance weighted AE (IWAE)[Burda et al., 2015]
¤ VAE ICLR2016
¤
¤ #(.|/)
¤ ,(.)
¤ KL # /
¤
principle literature [Gr¨unwald, 2007].
2.2 Variational Inference
Next we review the variational inference algorithm [Jordan et al.
perspective, using posterior approximation as a running examp
i.i.d. samples D = {xn}N
n=1 from a probabilistic model p(x|✓) pa
is drawn from a prior p0(✓). Bayesian inference involves comp
parameters given the data,
p(✓|D) =
p(✓, D)
p(D)
=
p0(✓)
QN
n=1 p
p(D)
3
ciple literature [Gr¨unwald, 2007].
Variational Inference
we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisat
pective, using posterior approximation as a running example. Consider observing a dataset of
samples D = {xn}N
n=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ th
awn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of t
meters given the data,
p(✓|D) =
p(✓, D)
p(D)
=
p0(✓)
QN
n=1 p(xn|✓)
p(D)
,
3
(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For
l models, including Bayesian neural networks, the true posterior is typically intractable.
nference introduces an approximation q(✓) to the true posterior, which is obtained by minim
divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
r the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Varia
e sidesteps this di culty by considering an equivalent optimisation problem:
q(✓) = arg max
q2Q
LV I (q; D),
he variational lower-bound or evidence lower-bound (ELBO) LV I (q; D) is defined by
LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]

re p(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For m
werful models, including Bayesian neural networks, the true posterior is typically intractable. Va
al inference introduces an approximation q(✓) to the true posterior, which is obtained by minimi
KL divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
wever the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Variati
rence sidesteps this di culty by considering an equivalent optimisation problem:
q(✓) = arg max
q2Q
LV I(q; D),
re the variational lower-bound or evidence lower-bound (ELBO) LV I(q; D) is defined by
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]

p(✓, D)
n called marginal likelihood or model evidence. For many
etworks, the true posterior is typically intractable. Varia-
(✓) to the true posterior, which is obtained by minimising
on family Q:
min
2Q
KL[q(✓)||p(✓|D)]. (8)
able, mainly because of the di cult term p(D). Variational
g an equivalent optimisation problem:
arg max
q2Q
LV I(q; D), (9)
lower-bound (ELBO) LV I(q; D) is defined by
og p(D) KL[q(✓)||p(✓|D)]

p(✓, D) (10)
where p(D) =
R
p0(✓)
QN
n=1 p(xn|✓)d✓ is often called marginal likelihood
powerful models, including Bayesian neural networks, the true posterior is
tional inference introduces an approximation q(✓) to the true posterior, wh
the KL divergence in some tractable distribution family Q:
q(✓) = arg min
q2Q
KL[q(✓)||p(✓|D)].
However the KL divergence in (8) is also intractable, mainly because of the d
inference sidesteps this di culty by considering an equivalent optimisation
q(✓) = arg max
q2Q
LV I(q; D),
where the variational lower-bound or evidence lower-bound (ELBO) LV I(q
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq

log
p(✓, D)
q(✓)
.
VAE
¤ [Kingma et al,. 2014]
¤
¤ ℎ
¤
¤
1 Variational Auto-encoder with R´enyi Divergence
he variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a re
oposed (deep) generative model that parametrizes the variational approximation with a recog
twork. The generative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
).
re we drop the parameters ✓ but keep in mind that they will be learned using approximate max
elihood. However for these models the exact computation of log p(x) requires marginalisation
dden variables and is thus often intractable. Variational expectation-maximisation (EM) me
mes to the rescue by approximating
log p(x) ⇡ LV I(q; x) = Eq(h|x)

log
p(x, h)
q(h|x)
,
here h collects all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is defi
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
).
variational EM, optimisation for q and p are alternated to guarantee convergence. However th
ea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the
ional Auto-encoder with R´enyi Divergence
auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recently
) generative model that parametrizes the variational approximation with a recognition
enerative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
). (14)
he parameters ✓ but keep in mind that they will be learned using approximate maximum
wever for these models the exact computation of log p(x) requires marginalisation of all
s and is thus often intractable. Variational expectation-maximisation (EM) methods
scue by approximating
log p(x) ⇡ LV I(q; x) = Eq(h|x)

log
p(x, h)
q(h|x)
, (15)
s all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is defined as
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
). (16)
EM, optimisation for q and p are alternated to guarantee convergence. However the core
to jointly optimising p and q, which instead has no guarantee of increasing the MLE
on in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. This
the possibility that alternative surrogate functions might return estimates that are tighter
So the VR bound is considered in this context:
L (q; x) =
1
log E
"✓
p(x, h)
◆1 ↵
#
. (17)
variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is
sed (deep) generative model that parametrizes the variational approximation with a r
rk. The generative model is specified as a hierarchical latent variable model:
p(x) =
X
h(1)...h(L)
p(h(L)
)p(h(L 1)
|h(L)
) · · · p(x|h(1)
).
we drop the parameters ✓ but keep in mind that they will be learned using approximate
ood. However for these models the exact computation of log p(x) requires marginalisa
n variables and is thus often intractable. Variational expectation-maximisation (EM
to the rescue by approximating
log p(x) ⇡ LV I (q; x) = Eq(h|x)

log
p(x, h)
q(h|x)
,
h collects all the hidden variables h(1)
, ..., h(L)
and the approximate posterior q(h|x) is
q(h|x) = q(h(1)
|x)q(h(2)
|h(1)
) · · · q(h(L)
|h(L 1)
).
iational EM, optimisation for q and p are alternated to guarantee convergence. Howeve
of VAE is to jointly optimising p and q, which instead has no guarantee of increasing
ive function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2
explores the possibility that alternative surrogate functions might return estimates that
bounds. So the VR bound is considered in this context:
"✓ ◆1 ↵
#
p(x|✓) =
h1,...,hL
p(hL
|✓)p(hL 1
|hL
, ✓) · · · p(x|h1
, ✓). (
Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1
, . . . , hL
} denotes t
stochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. F
convenience, we define h0
= x. Each of the terms p(h`
|h`+1
) may denote a complicated nonline
relationship, for instance one computed by a multilayer neural network. However, it is assum
that sampling and probability evaluation are tractable for each p(h`
|h`+1
). Note that L denot
the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. W
assume the recognition model q(h|x) is defined in terms of an analogous factorization:
q(h|x) = q(h1
|x)q(h2
|h1
) · · · q(hL
|hL 1
), (
where sampling and probability evaluation are tractable for each of the terms in the product.
In this work, we assume the same families of conditional probability distributions as Kingma
Welling (2014). In particular, the prior p(hL
) is fixed to be a zero-mean, unit-variance Gaussia
In general, each of the conditional distributions p(h`
| h`+1
) and q(h`
|h` 1
) is a Gaussian wi
diagonal covariance, where the mean and covariance parameters are computed by a determinis
feed-forward neural network. For real-valued observations, p(x|h1
) is also defined to be such
Gaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean paramete
are computed by a neural network.
The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fro
Jensen’s Inequality:
log p(x) = log Eq(h|x)

p(x, h)
q(h|x)
Eq(h|x)

log
p(x, h)
q(h|x)
= L(x). (
Since L(x) = log p(x) DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off t
VAE
¤
reparameterization trick
¤
¤
ed a reparameterization of the recognition distribution in terms
tributions, such that the samples from the recognition model are
s and auxiliary variables. While they presented the reparameter-
tions, for convenience we discuss the special case of Gaussians,
ork. (The general reparameterization trick can be used with our
tribution q(h`
|h` 1
, ✓) always takes the form of a Gaussian
hose mean and covariance are computed from the the states of
2
Under review as a conference paper at ICLR 2016
the hidden units at the previous layer and the model parameters. This can be
by first sampling an auxiliary variable ✏`
⇠ N (0, I), and then applying the d
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓).
The joint recognition distribution q(h|x, ✓) over all latent variables can be
a deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying E
sequence. Since the distribution of ✏ does not depend on ✓, we can reformu
bound L(x) from Eqn. 3 by pushing the gradient operator inside the expecta
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)

log
p(x, h
q(h(✏
= E✏1,...,✏L⇠N (0,I)

r✓ log
p(x, h
q(h(✏
Assuming the mapping h is represented as a deterministic feed-forward neu
✏, the gradient inside the expectation can be computed using standard backp
one approximates the expectation in Eqn. 6 by generating k samples of ✏ a
Carlo estimator
1
k
kX
r✓ log w (x, h(✏i, x, ✓), ✓)
der review as a conference paper at ICLR 2016
hidden units at the previous layer and the model parameters. This can be alternatively ex
first sampling an auxiliary variable ✏`
⇠ N (0, I), and then applying the deterministic m
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓).
joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in t
eterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each
uence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradien
nd L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)

log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
= E✏1,...,✏L⇠N (0,I)

r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
.
uming the mapping h is represented as a deterministic feed-forward neural network, fo
he gradient inside the expectation can be computed using standard backpropagation. In p
approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the
lo estimator
k
Under review as a conference paper at ICLR 2016
the hidden units at the previous layer and the model parameters. This can be alternatively expressed
by first sampling an auxiliary variable ✏`
⇠ N(0, I), and then applying the deterministic mapping
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓). (4)
The joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of
a deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each layer in
sequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the
bound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N (0,I)

log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
(5)
= E✏1,...,✏L⇠N (0,I)

r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
. (6)
Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed
✏, the gradient inside the expectation can be computed using standard backpropagation. In practice,
one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte
Carlo estimator
1
k
kX
i=1
r✓ log w (x, h(✏i, x, ✓), ✓) (7)
with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that
the VAE update and the basic REINFORCE-like update are both unbiased estimators of the same
e hidden units at the previous layer and the model parameters. This can be alternatively expressed
y first sampling an auxiliary variable ✏`
⇠ N(0, I), and then applying the deterministic mapping
h`
(✏`
, h` 1
, ✓) = ⌃(h` 1
, ✓)1/2
✏`
+ µ(h` 1
, ✓). (4)
he joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of
deterministic mapping h(✏, x, ✓), with ✏ = (✏1
, . . . , ✏L
), by applying Eqn. 4 for each layer in
quence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the
ound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation:
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
= r✓E✏1,...,✏L⇠N(0,I)

log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
(5)
= E✏1,...,✏L⇠N(0,I)

r✓ log
p(x, h(✏, x, ✓)|✓)
q(h(✏, x, ✓)|x, ✓)
. (6)
ssuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed
the gradient inside the expectation can be computed using standard backpropagation. In practice,
ne approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte
arlo estimator
1
k
kX
i=1
r✓ log w (x, h(✏i, x, ✓), ✓) (7)
ith w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that
e VAE update and the basic REINFORCE-like update are both unbiased estimators of the same
adient, but the VAE update tends to have lower variance in practice because it makes use of the
the hidden units at the previous layer and the
by first sampling an auxiliary variable ✏`
⇠
h`
(✏`
, h` 1
, ✓) = ⌃
The joint recognition distribution q(h|x, ✓)
a deterministic mapping h(✏, x, ✓), with ✏
sequence. Since the distribution of ✏ does n
bound L(x) from Eqn. 3 by pushing the gra
r✓ log Eh⇠q(h|x,✓)

p(x, h|✓)
q(h|x, ✓)
=
=
Assuming the mapping h is represented as a
✏, the gradient inside the expectation can be
one approximates the expectation in Eqn. 6
Carlo estimator
1
k
kX
i=1
r✓ lo
with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). T
the VAE update and the basic REINFORCE
gradient, but the VAE update tends to have
VAE
VAE
¤ VAE
¤
¤ VAE KL
In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the
parameters. We assume an approximate posterior in the form q (z|x), but please note that the
technique can be applied to the case q (z), i.e. where we do not condition on x, as well. The fully
variational Bayesian method for inferring a posterior over the parameters is given in the appendix.
Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q (z|x) we
can reparameterize the random variable ez ⇠ q (z|x) using a differentiable transformation g (✏, x)
of an (auxiliary) noise variable ✏:
ez = g (✏, x) with ✏ ⇠ p(✏) (4)
See section 2.4 for general strategies for chosing such an approriate distribution p(✏) and function
g (✏, x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t.
q (z|x) as follows:
Eq (z|x(i)) [f(z)] = Ep(✏)
h
f(g (✏, x(i)
))
i
'
1
L
LX
l=1
f(g (✏(l)
, x(i)
)) where ✏(l)
⇠ p(✏) (5)
We apply this technique to the variational lower bound (eq. (2)), yielding our generic Stochastic
Gradient Variational Bayes (SGVB) estimator eLA
(✓, ; x(i)
) ' L(✓, ; x(i)
):
eLA
(✓, ; x(i)
) =
1
L
LX
l=1
log p✓(x(i)
, z(i,l)
) log q (z(i,l)
|x(i)
)
where z(i,l)
= g (✏(i,l)
, x(i)
) and ✏(l)
⇠ p(✏) (6)
3
g r✓,
eLM
(✓, ; XM
, ✏) (Gradients of minibatch estimator (8))
✓, Update parameters using gradients g (e.g. SGD or Adagrad [DHS10])
until convergence of parameters (✓, )
return ✓,
Often, the KL-divergence DKL(q (z|x(i)
)||p✓(z)) of eq. (3) can be integrated analytically (see
appendix B), such that only the expected reconstruction error Eq (z|x(i))
⇥
log p✓(x(i)
|z)
⇤
requires
estimation by sampling. The KL-divergence term can then be interpreted as regularizing , encour-
aging the approximate posterior to be close to the prior p✓(z). This yields a second version of the
SGVB estimator eLB
(✓, ; x(i)
) ' L(✓, ; x(i)
), corresponding to eq. (3), which typically has less
variance than the generic estimator:
eLB
(✓, ; x(i)
) = DKL(q (z|x(i)
)||p✓(z)) +
1
L
LX
l=1
(log p✓(x(i)
|z(i,l)
))
where z(i,l)
= g (✏(i,l)
, x(i)
) and ✏(l)
⇠ p(✏) (7)
Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the
marginal likelihood lower bound of the full dataset, based on minibatches:
L(✓, ; X) ' eLM
(✓, ; XM
) =
N
M
MX
i=1
eL(✓, ; x(i)
) (8)
where the minibatch XM
= {x(i)
}M
i=1 is a randomly drawn sample of M datapoints from the
full dataset X with N datapoints. In our experiments we found that the number of samples L
per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100.
Derivatives r✓,
eL(✓; XM
) can be taken, and the resulting gradients can be used in conjunction
with stochastic optimization methods such as SGD or Adagrad [DHS10]. See algorithm 1 for a
basic approach to compute the stochastic gradients.
A connection with auto-encoders becomes clear when looking at the objective function given at
eq. (7). The first term is (the KL divergence of the approximate posterior from the prior) acts as a
regularizer, while the second term is a an expected negative reconstruction error. The function g (.)
is chosen such that it maps a datapoint x(i)
and a random noise vector ✏(l)
to a sample from the
Importance weighted AE IWAE
¤ VAE
¤
¤
¤
¤
¤ k=1 VAE
¤ k
ution must be approximately factorial and predictable with a feed-forward neural n
VAE criterion may be too strict; a recognition network which places only a small
0%) of its samples in the region of high posterior probability region may still be suffi
ming accurate inference. If we lower our standards in this way, this may give us ad
lity to train a generative network whose posterior distributions do not fit the VAE
This is the motivation behind our proposed algorithm, the Importance Weighted Auto
E).
WAE uses the same architecture as the VAE, with both a generative network and a rec
rk. The difference is that it is trained to maximize a different lower bound on log p
ular, we use the following lower bound, corresponding to the k-sample importance w
te of the log-likelihood:
Lk(x) = Eh1,...,hk⇠q(h|x)
"
log
1
k
kX
i=1
p(x, hi)
q(hi|x)
#
.
h1, . . . , hk are sampled independently from the recognition model. The term inside
ponds to the unnormalized importance weights for the joint distribution, which we wil
= p(x, hi)/q(hi|x).
s a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality
at the average importance weights are an unbiased estimator of p(x):
Lk = E
"
log
1
k
kX
wi
#
 log E
"
1
k
kX
wi
#
= log p(x),
iew as a conference paper at ICLR 2016
1. For all k, the lower bounds satisfy
log p(x) Lk+1 Lk.
if p(h, x)/q(h|x) is bounded, then Lk approaches log p(x) as k goes to infinity.
e Appendix A.
Rényi α
¤ . # ,
¤ 1 1 > 0, 1 ≠ 1
¤ 1 → 1 KL
¤ 1 =
8
9
tributions p and q on a random variable ✓ 2 ⇥:
D↵[p||q] =
1
↵ 1
log
Z
p(✓)↵
q(✓)1 ↵
d✓.
> 1 the definition is valid when it is finite, and for discrete random variables the integr
d by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that
role in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
two distributions p and q on a random variable ✓ 2 ⇥:
D↵[p||q] =
1
↵ 1
log
Z
p(✓)↵
q(✓)1 ↵
d✓.
For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integratio
replaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that pla
crucial role in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
Similar to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
Another special case is ↵ = 1
2 , where the corresponding R´enyi divergence is a function of the squ
2
R p p
the definition is valid when it is finite, and for discrete random variables the int
y summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence th
in machine learning and information theory:
D1[p||q] = lim
↵!1
D↵[p||q] =
Z
p(✓) log
p(✓)
q(✓)
d✓ = KL[p||q].
↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵:
D0[p||q] = log
Z
p(✓)>0
q(✓)d✓,
D+1[p||q] = log max
✓2⇥
p(✓)
q(✓)
.
pecial case is ↵ = 1
2 , where the corresponding R´enyi divergence is a function of
istance Hel2
[p||q] = 1
2
R
(
p
p(✓)
p
q(✓))2
d✓:
D1
2
[p||q] = 2 log(1 Hel2
[p||q]).
ven and Harremo¨es, 2014] the definition (1) is also extended to negative ↵ values,
t is non-positive and is thus no longer a valid divergence measure. The proposed m
Rényi
¤ # . / ,(.)	 KL
¤ Rényi
¤ Rényi α
¤
¤ 1 ≠ 1
LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq

log
p(✓, D)
q(✓)
.
ational R´enyi Bound
Section 2.1 that the family of R´enyi divergences includes the KL divergence.
ee-energy approaches be generalised to the R´enyi case? Consider approxima
|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
y the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
"✓ ◆1 ↵
#
LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]
= Eq

log
p(✓, D)
q(✓)
.
ariational R´enyi Bound
m Section 2.1 that the family of R´enyi divergences includes the KL divergence.
al free-energy approaches be generalised to the R´enyi case? Consider approxima
p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
erify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
log E
"✓
p(✓, D)
◆1 ↵
#
q(✓)
Variational R´enyi Bound
rom Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhap
nal free-energy approaches be generalised to the R´enyi case? Consider approximating th
r p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)].
verify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)].
6= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
↵ 1
log Eq
"✓
p(✓, D)
q(✓)p(D)
◆1 ↵
#
=
1
1 ↵
log Eq
"✓
p(✓, D)
q(✓)
◆1 ↵
#
:= L↵(q; D).
me this new objective the variational R´enyi bound (VR). Importantly the following theore
Rényi VR
VR
¤ VR
¤
¤
cope if Monte Carlo methods is not resorted to. This section develops a scalable opt
or the VR bound by extending the recent advances of traditional VI. Black-box met
ssed to enable it applications to arbitrary finite ↵ settings.
Monte Carlo Estimation of the VR Bound
se a simple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to app
K:
ˆL↵,K(q; D) =
1
1 ↵
log
1
K
KX
k=1
"✓
p(✓k, D)
q(✓k)
◆1 ↵
#
.
aditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓)
thm. However we can bound the bias by the following theorems proved in the supple
m 2. E{✓k}K
k=1
[ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fix
limiting result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-incre
] [ {|L↵| < +1}.
5
R Bound Optimisation Framework
energy methods sidestep intractabilities in a class of intractable models. Recent wor
proximations based on Monte Carlo to expend the set of models that can be handled.
be deployed on the same model class as Monte Carlo variational methods, but which
Monte Carlo methods is not resorted to. This section develops a scalable optimis
VR bound by extending the recent advances of traditional VI. Black-box method
o enable it applications to arbitrary finite ↵ settings.
Carlo Estimation of the VR Bound
mple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to approxi
ˆL↵,K(q; D) =
1
1 ↵
log
1
K
KX
k=1
"✓
p(✓k, D)
q(✓k)
◆1 ↵
#
.
al VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is i
However we can bound the bias by the following theorems proved in the supplemen
{✓k}K
k=1
[ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵
ng result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-increasin
↵| < +1}.
(a) Sampling approximated VR bounds. (b) Simulated values of divergences.
Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds.
Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha
divergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, where
the only di↵erence is µp = [0, 0] and µq = [1, 1]. Best viewed in colour.
Corollary 1. For K < +1, there exists ↵K < 0 such that for all ↵ ↵K, E{✓k}K
k=1
[ ˆL↵,K(q; D)] 
log p(D). Furthermore ↵K is non-decreasing in K, with limK!1 ↵K = 1 and limK!+1 ↵K = 0.
To better understand the above theorems we plot in Figure 2(a) an illustration of the bounding
properties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihood
log p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵  1 the sampling approximation
ˆL↵,K in expectation under-estimates the exact VR bound L↵ (blue dashed lines), where the approximation
quality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negative
alpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrow
between the two blue dashed lines visualising ˆL↵1,K1 and ˆL↵2,K1 , respectively).
We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and Monte
VR
exact approx.
(a) Sampling approximated VR bounds. (b) Simula
Figure 2: (a) An illustration for the bounding properties of sampling ap
Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampl
VR
1 ≤ 1
1
k
VR
¤ IWAE
¤ 1
ated VR bounds. (b) Simulated values of divergences.
n for the bounding properties of sampling approximations to the VR bounds.
1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha
e p, q are 2-D Gaussian distributions with identity covariance matrix, where
[0, 0] and µq = [1, 1]. Best viewed in colour.
ˆ
1 ≤ 1
1 = 0
IWAE
VR-max
¤ Reparameterization trick
¤
¤
¤ 1 = 1 VAE
¤ 1 → −∞
¤ importance weight
¤ VR-max
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
. (19)
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏

w↵(✏; , D)r log
p(g , D)
q(g )
, (20)
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite samples ✏k ⇠
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1

ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
. (21)
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I(q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
, (22)
which means the resulting algorithm unifies the computation for all finite ↵ settings.
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
. (19)
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏

w↵(✏; , D)r log
p(g , D)
q(g )
, (20)
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite samples ✏k ⇠
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1

ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
. (21)
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I(q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
, (22)
which means the resulting algorithm unifies the computation for all finite ↵ settings.
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
n2S
r log ˆw(✏jn
; xn)
to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bou
L↵(q ; D) =
1
1 ↵
log E✏
"✓
p(g , D)
q(g )
◆1 ↵
#
.
Then the gradient of the VR bound w.r.t. is
r L↵(q ; D) = E✏

w↵(✏; , D)r log
p(g , D)
q(g )
,
where w↵(✏; , D) /
⇣
p(g ,D)
q(g )
⌘1 ↵
denotes the normalised importance weight. For finite sam
p(✏), k = 1, ..., K the gradient is approximated by
r ˆL↵,K(q ; D) =
1
K
KX
k=1

ˆw↵,kr log
p(g (✏k), D)
q(g (✏k))
.
with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite sample
show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21):
r LV I (q ; D) ⇡
1
K
KX
k=1
r log
p(g (✏k), D)
q(g (✏k))
,
which means the resulting algorithm unifies the computation for all finite ↵ settings.
To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏j wit
Algorithm 1 one gradient step for VR-↵/VR-max
1: sample ✏1, ..., ✏K ⇠ p(✏)
2: for all k = 1, ..., K, and n 2 S the current minibatch, compute the u
log ˆw(✏k; xn) = log p(g (✏k), xn) log q(g (
3: choose the sample ✏jn to back-propagate:
if |↵| < 1: jn ⇠ pk where pk / ˆw(✏k; xn)1 ↵
if ↵ = 1: jn = arg maxk log ˆw(✏k; xn)
4: return the gradients to the optimiser
1
|S|
X
r log ˆw(✏jn
; xn)
¤ [Li et al., 2015] EP
¤
¤
¤ M
VR
¤ M
¤ Black-box alpha BB-α VR
or VAEs. Note that VR-max does not compute
ciple (MDL), since MDL approximates the true
upper-bounds the exact log-likelihood function.
scale Learning
hole dataset D. However for large datasets full
[Li et al., 2015] the authors discussed stochastic
tion for large-scale learning. Here we propose
batch training, which directly applies to the VR
“average likelihood” ¯fD(✓) = [
QN
n=1 fn(✓)]
1
N ,
✓) ¯fD(✓)N
. Now we sample M datapoints S =
posterior by minimising the exact VR bound L 1
4.3 Stochastic Approximation for La
So far we discussed the VR bounds computed on t
batch learning will be very ine cient. In the append
EP as a way to approximating the VR bound opt
another stochastic approximation method to enable
bound.
Using the notation fn(✓) = p(xn|✓) and definin
the joint distribution can be rewritten as p(✓, D) =
the minimum description length principle (MDL), since MDL approximates the true
sing the exact VR bound L 1 that upper-bounds the exact log-likelihood function.
c Approximation for Large-scale Learning
the VR bounds computed on the whole dataset D. However for large datasets full
e very ine cient. In the appendix of [Li et al., 2015] the authors discussed stochastic
proximating the VR bound optimisation for large-scale learning. Here we propose
pproximation method to enable minibatch training, which directly applies to the VR
on fn(✓) = p(xn|✓) and defining the “average likelihood” ¯fD(✓) = [
QN
n=1 fn(✓)]
1
N ,
n can be rewritten as p(✓, D) = p0(✓) ¯fD(✓)N
. Now we sample M datapoints S =
7set average likelihood” ¯fS(✓) = [
QM
m=1 fnm
(✓)]
1
M .
xn}. Then we approximate the VR bound (13) by
)↵
p0(✓) ¯fS(✓)N 1 ↵
d✓
0(✓) ¯fS(✓)N
q(✓)
◆1 ↵
].
(23)
wer-bound when ↵ ! 1. For other ↵ 6= 1 settings,
the bias of approximation. This is guaranteed by
{xn1
, ..., xnM
} ⇠ D and define the corresponding “subset average likelihood” ¯fS(✓) = [
QM
m=1 fnm
(✓)]
1
M .
When M = 1 we also write ¯fS(✓) = fn(✓) for S = {xn}. Then we approximate the VR bound (13) by
replacing ¯fD(✓) with ¯fS(✓):
˜L↵(q; S) =
1
1 ↵
log
Z
q(✓)↵
p0(✓) ¯fS(✓)N 1 ↵
d✓
=
1
1 ↵
log Eq[
✓
p0(✓) ¯fS(✓)N
q(✓)
◆1 ↵
].
(23)
This returns a stochastic estimate of the evidence lower-bound when ↵ ! 1. For other ↵ 6= 1 settings,
increasing the size of the minibatch M = |S| reduces the bias of approximation. This is guaranteed by
the following theorem proved in the supplementary.
Theorem 3. If the approximate distribution q(✓) is Gaussian N(µ, ⌃), and the likelihood functions has
an exponential family form p(x|✓) = exp[h✓, (x)i A(✓)], then for ↵  1 the stochastic approximation
is bounded by
VR
¤
¤
¤
1
¤ 3
¤ VAE 1 = 1
¤ IWAE 1 = 0
¤ VR-max 1 = −∞
¤ 1 = 0 * = 5000
¤ VR-max IWAE
¤ VR-max
¤ VR-max 25hr29min IWAE 61hr16min
e code1
. Note that the original implementation back-
hile VR-max only back-propagates the sample with
h 101 Silhouettes and MNIST. The experiments were
y small Frey Face dataset, while the other two were
onsists of L = 1 or 2 stochastic layers with determin-
rk architecture is detailed in the supplementary. We
n. For MNIST we used settings from [Burda et al.,
and number of epochs. For other two datasets the
the VI setting. We reproduced the experiments for
s included in [Burda et al., 2015] mismatches those
e 1 by computing log p(x) ⇡ ˆL↵,K(q; x) with ↵ = 0.0,
sent some samples from the VR-max trained models
d almost indistinguishable to IWAEs on all the three
ime to run compared to IWAE with a full backward
a Tesla C2075 GPU, and when trained on MNIST
R-max and IWAE took 25hr29min and 61hr16min,
also implemented the single backward pass version
od result for IWAE is -85.02, which is slightly worse
he arguments in Section 4.1 that negative ↵ can be
mputation resources are limited.
alue corresponding to the tightest VR bound becomes
q and the true posterior increases. This is the case
n q is fitted to approximate the typically multimodal
(a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST
Figure 3: Sampled images from the VR-max trained auto-encoders.
Dataset L K VAE IWAE VR-max
Frey Face 1 5 1322.96 1380.30 1377.40
(± std. err.) ±10.03 ±4.60 ±4.59
Caltech 101 1 5 -119.69 -117.89 -118.01
Silhouettes 50 -119.61 -117.21 -117.10
MNIST 1 5 -86.47 -85.41 -85.42
50 -86.35 -84.80 -84.81
2 5 -85.01 -83.92 -84.04
50 -84.78 -83.12 -83.44
Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015].
IWAE results are reproduced using the publicly available implementation.
method was implemented upon the publicly available code1
. Note that the original implementation back-
propagates all the samples to compute gradients, while VR-max only back-propagates the sample with
the largest importance weight.
Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments were
1
¤
(a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST
Figure 3: Sampled images from the VR-max trained auto-encoders.
Dataset L K VAE IWAE VR-max
Frey Face 1 5 1322.96 1380.30 1377.40
(± std. err.) ±10.03 ±4.60 ±4.59
Caltech 101 1 5 -119.69 -117.89 -118.01
Silhouettes 50 -119.61 -117.21 -117.10
MNIST 1 5 -86.47 -85.41 -85.42
50 -86.35 -84.80 -84.81
2 5 -85.01 -83.92 -84.04
50 -84.78 -83.12 -83.44
¤ VR-max
¤ Frey Face
¤ 1
2
¤ UCI
¤
¤ VI[Graves,2011] PBP[Hernandez-Lobato et al., 2015]
¤ BB-α=BO
Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max
Boston -2.903±0.071 -2.574±0.089 -2.549±0.019 -2.457±0.066 -2.468±0.071 -2.469±0.072
Concrete -3.391±0.017 -3.161±0.019 -3.104±0.015 -3.094±0.016 -3.076±0.018 -3.092±0.018
Energy -2.391±0.029 -2.042±0.019 -0.945±0.012 -1.401±0.029 -1.418±0.020 -1.389±0.018
Wine -0.980±0.013 -0.968±0.014 -0.949±0.009 -0.948±0.011 -0.952±0.012 -0.949±0.012
Yacht -3.439±0.163 -1.634±0.016 -1.102±0.039 -1.816±0.011 -1.829±0.014 -1.817±0.013
Protein -2.992±0.006 -2.973±0.003 NA±NA -2.923±0.006 -2.911±0.005 -2.938±0.005
Year -3.622±NA -3.603±NA NA±NA -3.545±NA -3.550±NA -3.542±NA
Table 2: Average test log-likelihood. BB-↵=BO results are not directly comparable and are available
only for small datasets.
Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max
Boston 4.320±0.291 3.104±0.180 3.160±0.109 2.853±0.154 2.852±0.169 2.837±0.181
Concrete 7.128±0.123 5.667±0.093 5.374±0.074 5.343±0.102 5.237±0.114 5.280±0.104
Energy 2.646±0.081 1.804±0.048 0.600±0.018 0.807±0.059 0.883±0.050 0.791±0.041
Wine 0.646±0.008 0.635±0.007 0.632±0.005 0.640±0.009 0.638±0.008 0.639±0.009
Yacht 6.887±0.674 1.015±0.054 0.902±0.051 1.111±0.082 1.239±0.109 1.117±0.085
Protein 4.842±0.003 4.732±0.013 NA±NA 4.505±0.033 4.436±0.030 4.574±0.023
Year 9.034±NA 8.879±NA NA±NA 8.942±NA 9.133±NA 8.949±NA
Table 3: Average Test Error. BB-↵=BO results are not directly comparable and are available only for
small datasets.
¤ Rényi
¤ VI/VB EP BB-α VAE IWAE
VR-max
¤
¤
¤
¤
¤ 1

More Related Content

What's hot

What's hot (20)

Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
Chapter 24 aoa
Chapter 24 aoaChapter 24 aoa
Chapter 24 aoa
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismDS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
 
26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-Means26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-Means
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
Chapter 26 aoa
Chapter 26 aoaChapter 26 aoa
Chapter 26 aoa
 
Chapter 25 aoa
Chapter 25 aoaChapter 25 aoa
Chapter 25 aoa
 
Chapter 23 aoa
Chapter 23 aoaChapter 23 aoa
Chapter 23 aoa
 
[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination
 
Q-Metrics in Theory And Practice
Q-Metrics in Theory And PracticeQ-Metrics in Theory And Practice
Q-Metrics in Theory And Practice
 

Viewers also liked

(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
Masahiro Suzuki
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
Masahiro Suzuki
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
Masahiro Suzuki
 
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Ohsawa Goodfellow
 

Viewers also liked (20)

(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
 
[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読
 
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
 Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De... Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
 
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
 
Deep learning勉強会20121214ochi
Deep learning勉強会20121214ochiDeep learning勉強会20121214ochi
Deep learning勉強会20121214ochi
 
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
 
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
 
Deep Learning を実装する
Deep Learning を実装するDeep Learning を実装する
Deep Learning を実装する
 
JSAI's AI Tool Introduction - Deep Learning, Pylearn2 and Torch7
JSAI's AI Tool Introduction - Deep Learning, Pylearn2 and Torch7JSAI's AI Tool Introduction - Deep Learning, Pylearn2 and Torch7
JSAI's AI Tool Introduction - Deep Learning, Pylearn2 and Torch7
 
Deep Learning 勉強会 (Chapter 7-12)
Deep Learning 勉強会 (Chapter 7-12)Deep Learning 勉強会 (Chapter 7-12)
Deep Learning 勉強会 (Chapter 7-12)
 
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
 
生成モデルの Deep Learning
生成モデルの Deep Learning生成モデルの Deep Learning
生成モデルの Deep Learning
 

Similar to (DL hacks輪読) Variational Inference with Rényi Divergence

Boolean Programs and Quantified Propositional Proof System -
Boolean Programs and Quantified Propositional Proof System - Boolean Programs and Quantified Propositional Proof System -
Boolean Programs and Quantified Propositional Proof System -
Michael Soltys
 
Lecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceLecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inference
asimnawaz54
 

Similar to (DL hacks輪読) Variational Inference with Rényi Divergence (20)

Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
FPDE presentation
FPDE presentationFPDE presentation
FPDE presentation
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
Variational Inference
Variational InferenceVariational Inference
Variational Inference
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
Variational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionVariational Bayes: A Gentle Introduction
Variational Bayes: A Gentle Introduction
 
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
 
Cs229 notes8
Cs229 notes8Cs229 notes8
Cs229 notes8
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Boolean Programs and Quantified Propositional Proof System -
Boolean Programs and Quantified Propositional Proof System - Boolean Programs and Quantified Propositional Proof System -
Boolean Programs and Quantified Propositional Proof System -
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
 
Lecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inferenceLecture 3 qualtifed rules of inference
Lecture 3 qualtifed rules of inference
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TD
 
2 borgs
2 borgs2 borgs
2 borgs
 

More from Masahiro Suzuki (7)

深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

(DL hacks輪読) Variational Inference with Rényi Divergence

  • 1. Variational Inference with Rényi Divergence D1
  • 2. ¤ arXiv 2 6 ¤ Yingzhen Li Richard E. Turner ¤ University of Cambridge ¤ Li D3 “Stochastic Expectation Propagation” NIPS ¤ Rényi ¤ VAE importance weighted AE[Burda et al., 2015] ¤ Appendix ¤
  • 3. ¤ PRML ¤ ¤ ¤ ¤ KL ln #(%) = ℒ % + *+(,||%)
  • 4. ¤ ¤ SVI [Hoffmann et al, 2013] ¤ SEP [Li et al., 2015] ¤ black-box ¤ [Ranganath et al., 2014] ¤ black-box alpha BB-α [Hernandez-Labato et al., 2015] ¤ ¤ Importance weighted AE (IWAE)[Burda et al., 2015] ¤ VAE ICLR2016
  • 5. ¤ ¤ #(.|/) ¤ ,(.) ¤ KL # / ¤ principle literature [Gr¨unwald, 2007]. 2.2 Variational Inference Next we review the variational inference algorithm [Jordan et al. perspective, using posterior approximation as a running examp i.i.d. samples D = {xn}N n=1 from a probabilistic model p(x|✓) pa is drawn from a prior p0(✓). Bayesian inference involves comp parameters given the data, p(✓|D) = p(✓, D) p(D) = p0(✓) QN n=1 p p(D) 3 ciple literature [Gr¨unwald, 2007]. Variational Inference we review the variational inference algorithm [Jordan et al., 1999, Beal, 2003] from an optimisat pective, using posterior approximation as a running example. Consider observing a dataset of samples D = {xn}N n=1 from a probabilistic model p(x|✓) parametrised by a random variable ✓ th awn from a prior p0(✓). Bayesian inference involves computing the posterior distribution of t meters given the data, p(✓|D) = p(✓, D) p(D) = p0(✓) QN n=1 p(xn|✓) p(D) , 3 (D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For l models, including Bayesian neural networks, the true posterior is typically intractable. nference introduces an approximation q(✓) to the true posterior, which is obtained by minim divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. r the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Varia e sidesteps this di culty by considering an equivalent optimisation problem: q(✓) = arg max q2Q LV I (q; D), he variational lower-bound or evidence lower-bound (ELBO) LV I (q; D) is defined by LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)]  re p(D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood or model evidence. For m werful models, including Bayesian neural networks, the true posterior is typically intractable. Va al inference introduces an approximation q(✓) to the true posterior, which is obtained by minimi KL divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. wever the KL divergence in (8) is also intractable, mainly because of the di cult term p(D). Variati rence sidesteps this di culty by considering an equivalent optimisation problem: q(✓) = arg max q2Q LV I(q; D), re the variational lower-bound or evidence lower-bound (ELBO) LV I(q; D) is defined by LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)]  p(✓, D) n called marginal likelihood or model evidence. For many etworks, the true posterior is typically intractable. Varia- (✓) to the true posterior, which is obtained by minimising on family Q: min 2Q KL[q(✓)||p(✓|D)]. (8) able, mainly because of the di cult term p(D). Variational g an equivalent optimisation problem: arg max q2Q LV I(q; D), (9) lower-bound (ELBO) LV I(q; D) is defined by og p(D) KL[q(✓)||p(✓|D)]  p(✓, D) (10) where p(D) = R p0(✓) QN n=1 p(xn|✓)d✓ is often called marginal likelihood powerful models, including Bayesian neural networks, the true posterior is tional inference introduces an approximation q(✓) to the true posterior, wh the KL divergence in some tractable distribution family Q: q(✓) = arg min q2Q KL[q(✓)||p(✓|D)]. However the KL divergence in (8) is also intractable, mainly because of the d inference sidesteps this di culty by considering an equivalent optimisation q(✓) = arg max q2Q LV I(q; D), where the variational lower-bound or evidence lower-bound (ELBO) LV I(q LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) .
  • 6. VAE ¤ [Kingma et al,. 2014] ¤ ¤ ℎ ¤ ¤ 1 Variational Auto-encoder with R´enyi Divergence he variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a re oposed (deep) generative model that parametrizes the variational approximation with a recog twork. The generative model is specified as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). re we drop the parameters ✓ but keep in mind that they will be learned using approximate max elihood. However for these models the exact computation of log p(x) requires marginalisation dden variables and is thus often intractable. Variational expectation-maximisation (EM) me mes to the rescue by approximating log p(x) ⇡ LV I(q; x) = Eq(h|x)  log p(x, h) q(h|x) , here h collects all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is defi q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). variational EM, optimisation for q and p are alternated to guarantee convergence. However th ea of VAE is to jointly optimising p and q, which instead has no guarantee of increasing the ional Auto-encoder with R´enyi Divergence auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is a recently ) generative model that parametrizes the variational approximation with a recognition enerative model is specified as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). (14) he parameters ✓ but keep in mind that they will be learned using approximate maximum wever for these models the exact computation of log p(x) requires marginalisation of all s and is thus often intractable. Variational expectation-maximisation (EM) methods scue by approximating log p(x) ⇡ LV I(q; x) = Eq(h|x)  log p(x, h) q(h|x) , (15) s all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is defined as q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). (16) EM, optimisation for q and p are alternated to guarantee convergence. However the core to jointly optimising p and q, which instead has no guarantee of increasing the MLE on in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2011]. This the possibility that alternative surrogate functions might return estimates that are tighter So the VR bound is considered in this context: L (q; x) = 1 log E "✓ p(x, h) ◆1 ↵ # . (17) variational auto-encoder (VAE) [Kingma and Welling, 2014, Rezende et al., 2014] is sed (deep) generative model that parametrizes the variational approximation with a r rk. The generative model is specified as a hierarchical latent variable model: p(x) = X h(1)...h(L) p(h(L) )p(h(L 1) |h(L) ) · · · p(x|h(1) ). we drop the parameters ✓ but keep in mind that they will be learned using approximate ood. However for these models the exact computation of log p(x) requires marginalisa n variables and is thus often intractable. Variational expectation-maximisation (EM to the rescue by approximating log p(x) ⇡ LV I (q; x) = Eq(h|x)  log p(x, h) q(h|x) , h collects all the hidden variables h(1) , ..., h(L) and the approximate posterior q(h|x) is q(h|x) = q(h(1) |x)q(h(2) |h(1) ) · · · q(h(L) |h(L 1) ). iational EM, optimisation for q and p are alternated to guarantee convergence. Howeve of VAE is to jointly optimising p and q, which instead has no guarantee of increasing ive function in each iteration. Indeed jointly the method is biased [Turner and Sahani, 2 explores the possibility that alternative surrogate functions might return estimates that bounds. So the VR bound is considered in this context: "✓ ◆1 ↵ # p(x|✓) = h1,...,hL p(hL |✓)p(hL 1 |hL , ✓) · · · p(x|h1 , ✓). ( Here, ✓ is a vector of parameters of the variational autoencoder, and h = {h1 , . . . , hL } denotes t stochastic hidden units, or latent variables. The dependence on ✓ is often suppressed for clarity. F convenience, we define h0 = x. Each of the terms p(h` |h`+1 ) may denote a complicated nonline relationship, for instance one computed by a multilayer neural network. However, it is assum that sampling and probability evaluation are tractable for each p(h` |h`+1 ). Note that L denot the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. W assume the recognition model q(h|x) is defined in terms of an analogous factorization: q(h|x) = q(h1 |x)q(h2 |h1 ) · · · q(hL |hL 1 ), ( where sampling and probability evaluation are tractable for each of the terms in the product. In this work, we assume the same families of conditional probability distributions as Kingma Welling (2014). In particular, the prior p(hL ) is fixed to be a zero-mean, unit-variance Gaussia In general, each of the conditional distributions p(h` | h`+1 ) and q(h` |h` 1 ) is a Gaussian wi diagonal covariance, where the mean and covariance parameters are computed by a determinis feed-forward neural network. For real-valued observations, p(x|h1 ) is also defined to be such Gaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean paramete are computed by a neural network. The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived fro Jensen’s Inequality: log p(x) = log Eq(h|x)  p(x, h) q(h|x) Eq(h|x)  log p(x, h) q(h|x) = L(x). ( Since L(x) = log p(x) DKL(q(h|x)||p(h|x)), the training procedure is forced to trade off t
  • 7. VAE ¤ reparameterization trick ¤ ¤ ed a reparameterization of the recognition distribution in terms tributions, such that the samples from the recognition model are s and auxiliary variables. While they presented the reparameter- tions, for convenience we discuss the special case of Gaussians, ork. (The general reparameterization trick can be used with our tribution q(h` |h` 1 , ✓) always takes the form of a Gaussian hose mean and covariance are computed from the the states of 2 Under review as a conference paper at ICLR 2016 the hidden units at the previous layer and the model parameters. This can be by first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the d h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). The joint recognition distribution q(h|x, ✓) over all latent variables can be a deterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying E sequence. Since the distribution of ✏ does not depend on ✓, we can reformu bound L(x) from Eqn. 3 by pushing the gradient operator inside the expecta r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N (0,I)  log p(x, h q(h(✏ = E✏1,...,✏L⇠N (0,I)  r✓ log p(x, h q(h(✏ Assuming the mapping h is represented as a deterministic feed-forward neu ✏, the gradient inside the expectation can be computed using standard backp one approximates the expectation in Eqn. 6 by generating k samples of ✏ a Carlo estimator 1 k kX r✓ log w (x, h(✏i, x, ✓), ✓) der review as a conference paper at ICLR 2016 hidden units at the previous layer and the model parameters. This can be alternatively ex first sampling an auxiliary variable ✏` ⇠ N (0, I), and then applying the deterministic m h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in t eterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying Eqn. 4 for each uence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradien nd L(x) from Eqn. 3 by pushing the gradient operator inside the expectation: r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N (0,I)  log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) = E✏1,...,✏L⇠N (0,I)  r✓ log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) . uming the mapping h is represented as a deterministic feed-forward neural network, fo he gradient inside the expectation can be computed using standard backpropagation. In p approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the lo estimator k Under review as a conference paper at ICLR 2016 the hidden units at the previous layer and the model parameters. This can be alternatively expressed by first sampling an auxiliary variable ✏` ⇠ N(0, I), and then applying the deterministic mapping h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). (4) The joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of a deterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying Eqn. 4 for each layer in sequence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the bound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation: r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N (0,I)  log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) (5) = E✏1,...,✏L⇠N (0,I)  r✓ log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) . (6) Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed ✏, the gradient inside the expectation can be computed using standard backpropagation. In practice, one approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte Carlo estimator 1 k kX i=1 r✓ log w (x, h(✏i, x, ✓), ✓) (7) with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that the VAE update and the basic REINFORCE-like update are both unbiased estimators of the same e hidden units at the previous layer and the model parameters. This can be alternatively expressed y first sampling an auxiliary variable ✏` ⇠ N(0, I), and then applying the deterministic mapping h` (✏` , h` 1 , ✓) = ⌃(h` 1 , ✓)1/2 ✏` + µ(h` 1 , ✓). (4) he joint recognition distribution q(h|x, ✓) over all latent variables can be expressed in terms of deterministic mapping h(✏, x, ✓), with ✏ = (✏1 , . . . , ✏L ), by applying Eqn. 4 for each layer in quence. Since the distribution of ✏ does not depend on ✓, we can reformulate the gradient of the ound L(x) from Eqn. 3 by pushing the gradient operator inside the expectation: r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = r✓E✏1,...,✏L⇠N(0,I)  log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) (5) = E✏1,...,✏L⇠N(0,I)  r✓ log p(x, h(✏, x, ✓)|✓) q(h(✏, x, ✓)|x, ✓) . (6) ssuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed the gradient inside the expectation can be computed using standard backpropagation. In practice, ne approximates the expectation in Eqn. 6 by generating k samples of ✏ and applying the Monte arlo estimator 1 k kX i=1 r✓ log w (x, h(✏i, x, ✓), ✓) (7) ith w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). This is an unbiased estimate of r✓L(x). We note that e VAE update and the basic REINFORCE-like update are both unbiased estimators of the same adient, but the VAE update tends to have lower variance in practice because it makes use of the the hidden units at the previous layer and the by first sampling an auxiliary variable ✏` ⇠ h` (✏` , h` 1 , ✓) = ⌃ The joint recognition distribution q(h|x, ✓) a deterministic mapping h(✏, x, ✓), with ✏ sequence. Since the distribution of ✏ does n bound L(x) from Eqn. 3 by pushing the gra r✓ log Eh⇠q(h|x,✓)  p(x, h|✓) q(h|x, ✓) = = Assuming the mapping h is represented as a ✏, the gradient inside the expectation can be one approximates the expectation in Eqn. 6 Carlo estimator 1 k kX i=1 r✓ lo with w(x, h, ✓) = p(x, h|✓)/q(h|x, ✓). T the VAE update and the basic REINFORCE gradient, but the VAE update tends to have
  • 8. VAE VAE ¤ VAE ¤ ¤ VAE KL In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the parameters. We assume an approximate posterior in the form q (z|x), but please note that the technique can be applied to the case q (z), i.e. where we do not condition on x, as well. The fully variational Bayesian method for inferring a posterior over the parameters is given in the appendix. Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q (z|x) we can reparameterize the random variable ez ⇠ q (z|x) using a differentiable transformation g (✏, x) of an (auxiliary) noise variable ✏: ez = g (✏, x) with ✏ ⇠ p(✏) (4) See section 2.4 for general strategies for chosing such an approriate distribution p(✏) and function g (✏, x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t. q (z|x) as follows: Eq (z|x(i)) [f(z)] = Ep(✏) h f(g (✏, x(i) )) i ' 1 L LX l=1 f(g (✏(l) , x(i) )) where ✏(l) ⇠ p(✏) (5) We apply this technique to the variational lower bound (eq. (2)), yielding our generic Stochastic Gradient Variational Bayes (SGVB) estimator eLA (✓, ; x(i) ) ' L(✓, ; x(i) ): eLA (✓, ; x(i) ) = 1 L LX l=1 log p✓(x(i) , z(i,l) ) log q (z(i,l) |x(i) ) where z(i,l) = g (✏(i,l) , x(i) ) and ✏(l) ⇠ p(✏) (6) 3 g r✓, eLM (✓, ; XM , ✏) (Gradients of minibatch estimator (8)) ✓, Update parameters using gradients g (e.g. SGD or Adagrad [DHS10]) until convergence of parameters (✓, ) return ✓, Often, the KL-divergence DKL(q (z|x(i) )||p✓(z)) of eq. (3) can be integrated analytically (see appendix B), such that only the expected reconstruction error Eq (z|x(i)) ⇥ log p✓(x(i) |z) ⇤ requires estimation by sampling. The KL-divergence term can then be interpreted as regularizing , encour- aging the approximate posterior to be close to the prior p✓(z). This yields a second version of the SGVB estimator eLB (✓, ; x(i) ) ' L(✓, ; x(i) ), corresponding to eq. (3), which typically has less variance than the generic estimator: eLB (✓, ; x(i) ) = DKL(q (z|x(i) )||p✓(z)) + 1 L LX l=1 (log p✓(x(i) |z(i,l) )) where z(i,l) = g (✏(i,l) , x(i) ) and ✏(l) ⇠ p(✏) (7) Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the marginal likelihood lower bound of the full dataset, based on minibatches: L(✓, ; X) ' eLM (✓, ; XM ) = N M MX i=1 eL(✓, ; x(i) ) (8) where the minibatch XM = {x(i) }M i=1 is a randomly drawn sample of M datapoints from the full dataset X with N datapoints. In our experiments we found that the number of samples L per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100. Derivatives r✓, eL(✓; XM ) can be taken, and the resulting gradients can be used in conjunction with stochastic optimization methods such as SGD or Adagrad [DHS10]. See algorithm 1 for a basic approach to compute the stochastic gradients. A connection with auto-encoders becomes clear when looking at the objective function given at eq. (7). The first term is (the KL divergence of the approximate posterior from the prior) acts as a regularizer, while the second term is a an expected negative reconstruction error. The function g (.) is chosen such that it maps a datapoint x(i) and a random noise vector ✏(l) to a sample from the
  • 9. Importance weighted AE IWAE ¤ VAE ¤ ¤ ¤ ¤ ¤ k=1 VAE ¤ k ution must be approximately factorial and predictable with a feed-forward neural n VAE criterion may be too strict; a recognition network which places only a small 0%) of its samples in the region of high posterior probability region may still be suffi ming accurate inference. If we lower our standards in this way, this may give us ad lity to train a generative network whose posterior distributions do not fit the VAE This is the motivation behind our proposed algorithm, the Importance Weighted Auto E). WAE uses the same architecture as the VAE, with both a generative network and a rec rk. The difference is that it is trained to maximize a different lower bound on log p ular, we use the following lower bound, corresponding to the k-sample importance w te of the log-likelihood: Lk(x) = Eh1,...,hk⇠q(h|x) " log 1 k kX i=1 p(x, hi) q(hi|x) # . h1, . . . , hk are sampled independently from the recognition model. The term inside ponds to the unnormalized importance weights for the joint distribution, which we wil = p(x, hi)/q(hi|x). s a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality at the average importance weights are an unbiased estimator of p(x): Lk = E " log 1 k kX wi #  log E " 1 k kX wi # = log p(x), iew as a conference paper at ICLR 2016 1. For all k, the lower bounds satisfy log p(x) Lk+1 Lk. if p(h, x)/q(h|x) is bounded, then Lk approaches log p(x) as k goes to infinity. e Appendix A.
  • 10. Rényi α ¤ . # , ¤ 1 1 > 0, 1 ≠ 1 ¤ 1 → 1 KL ¤ 1 = 8 9 tributions p and q on a random variable ✓ 2 ⇥: D↵[p||q] = 1 ↵ 1 log Z p(✓)↵ q(✓)1 ↵ d✓. > 1 the definition is valid when it is finite, and for discrete random variables the integr d by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that role in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . two distributions p and q on a random variable ✓ 2 ⇥: D↵[p||q] = 1 ↵ 1 log Z p(✓)↵ q(✓)1 ↵ d✓. For ↵ > 1 the definition is valid when it is finite, and for discrete random variables the integratio replaced by summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence that pla crucial role in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. Similar to ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . Another special case is ↵ = 1 2 , where the corresponding R´enyi divergence is a function of the squ 2 R p p the definition is valid when it is finite, and for discrete random variables the int y summation. When ↵ ! 1 it recovers the Kullback-Leibler (KL) divergence th in machine learning and information theory: D1[p||q] = lim ↵!1 D↵[p||q] = Z p(✓) log p(✓) q(✓) d✓ = KL[p||q]. ↵ = 1, for values ↵ = 0, +1 the R´enyi divergence is defined by continuity in ↵: D0[p||q] = log Z p(✓)>0 q(✓)d✓, D+1[p||q] = log max ✓2⇥ p(✓) q(✓) . pecial case is ↵ = 1 2 , where the corresponding R´enyi divergence is a function of istance Hel2 [p||q] = 1 2 R ( p p(✓) p q(✓))2 d✓: D1 2 [p||q] = 2 log(1 Hel2 [p||q]). ven and Harremo¨es, 2014] the definition (1) is also extended to negative ↵ values, t is non-positive and is thus no longer a valid divergence measure. The proposed m
  • 11. Rényi ¤ # . / ,(.) KL ¤ Rényi ¤ Rényi α ¤ ¤ 1 ≠ 1 LV I(q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) . ational R´enyi Bound Section 2.1 that the family of R´enyi divergences includes the KL divergence. ee-energy approaches be generalised to the R´enyi case? Consider approxima |D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. y the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ "✓ ◆1 ↵ # LV I (q; D) = log p(D) KL[q(✓)||p(✓|D)] = Eq  log p(✓, D) q(✓) . ariational R´enyi Bound m Section 2.1 that the family of R´enyi divergences includes the KL divergence. al free-energy approaches be generalised to the R´enyi case? Consider approxima p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. erify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. = 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 log E "✓ p(✓, D) ◆1 ↵ # q(✓) Variational R´enyi Bound rom Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhap nal free-energy approaches be generalised to the R´enyi case? Consider approximating th r p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. verify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. 6= 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 ↵ 1 log Eq "✓ p(✓, D) q(✓)p(D) ◆1 ↵ # = 1 1 ↵ log Eq "✓ p(✓, D) q(✓) ◆1 ↵ # := L↵(q; D). me this new objective the variational R´enyi bound (VR). Importantly the following theore Rényi VR
  • 12. VR ¤ VR ¤ ¤ cope if Monte Carlo methods is not resorted to. This section develops a scalable opt or the VR bound by extending the recent advances of traditional VI. Black-box met ssed to enable it applications to arbitrary finite ↵ settings. Monte Carlo Estimation of the VR Bound se a simple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to app K: ˆL↵,K(q; D) = 1 1 ↵ log 1 K KX k=1 "✓ p(✓k, D) q(✓k) ◆1 ↵ # . aditional VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) thm. However we can bound the bias by the following theorems proved in the supple m 2. E{✓k}K k=1 [ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fix limiting result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-incre ] [ {|L↵| < +1}. 5 R Bound Optimisation Framework energy methods sidestep intractabilities in a class of intractable models. Recent wor proximations based on Monte Carlo to expend the set of models that can be handled. be deployed on the same model class as Monte Carlo variational methods, but which Monte Carlo methods is not resorted to. This section develops a scalable optimis VR bound by extending the recent advances of traditional VI. Black-box method o enable it applications to arbitrary finite ↵ settings. Carlo Estimation of the VR Bound mple Monte Carlo method that uses finite samples ✓k ⇠ q(✓), k = 1, ..., K to approxi ˆL↵,K(q; D) = 1 1 ↵ log 1 K KX k=1 "✓ p(✓k, D) q(✓k) ◆1 ↵ # . al VI, here the Monte Carlo estimate is biased, since the expectation over q(✓) is i However we can bound the bias by the following theorems proved in the supplemen {✓k}K k=1 [ ˆL↵,K(q; D)] as a function of ↵ and K is: 1) non-decreasing in K for fixed ↵ ng result is L↵ for K ! +1 if |p/q| is bounded; 2) continuous and non-increasin ↵| < +1}. (a) Sampling approximated VR bounds. (b) Simulated values of divergences. Figure 2: (a) An illustration for the bounding properties of sampling approximations to the VR bounds. Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha divergence. In this example p, q are 2-D Gaussian distributions with identity covariance matrix, where the only di↵erence is µp = [0, 0] and µq = [1, 1]. Best viewed in colour. Corollary 1. For K < +1, there exists ↵K < 0 such that for all ↵ ↵K, E{✓k}K k=1 [ ˆL↵,K(q; D)]  log p(D). Furthermore ↵K is non-decreasing in K, with limK!1 ↵K = 1 and limK!+1 ↵K = 0. To better understand the above theorems we plot in Figure 2(a) an illustration of the bounding properties. By definition, the exact VR bound is a lower-bound or upper-bound of the log-likelihood log p(D) when ↵ > 0 or ↵ < 0, respectively (red lines). However for ↵  1 the sampling approximation ˆL↵,K in expectation under-estimates the exact VR bound L↵ (blue dashed lines), where the approximation quality can be improved by using more samples (the blue dashed arrow). Thus for finite samples, negative alpha values (↵2 < 0) can be used to improve the accuracy of the approximation (see the red arrow between the two blue dashed lines visualising ˆL↵1,K1 and ˆL↵2,K1 , respectively). We empirically evaluate the theoretical results in Figure 2(b), by computing the exact and Monte
  • 13. VR exact approx. (a) Sampling approximated VR bounds. (b) Simula Figure 2: (a) An illustration for the bounding properties of sampling ap Here ↵2 < 0 < ↵1 < 1 and 1 < K1 < K2 < +1. (b) The bias of sampl VR 1 ≤ 1 1 k
  • 14. VR ¤ IWAE ¤ 1 ated VR bounds. (b) Simulated values of divergences. n for the bounding properties of sampling approximations to the VR bounds. 1 < K1 < K2 < +1. (b) The bias of sampling estimate of (negative) alpha e p, q are 2-D Gaussian distributions with identity covariance matrix, where [0, 0] and µq = [1, 1]. Best viewed in colour. ˆ 1 ≤ 1 1 = 0 IWAE
  • 15. VR-max ¤ Reparameterization trick ¤ ¤ ¤ 1 = 1 VAE ¤ 1 → −∞ ¤ importance weight ¤ VR-max if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X n2S r log ˆw(✏jn ; xn) to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound L↵(q ; D) = 1 1 ↵ log E✏ "✓ p(g , D) q(g ) ◆1 ↵ # . (19) Then the gradient of the VR bound w.r.t. is r L↵(q ; D) = E✏  w↵(✏; , D)r log p(g , D) q(g ) , (20) where w↵(✏; , D) / ⇣ p(g ,D) q(g ) ⌘1 ↵ denotes the normalised importance weight. For finite samples ✏k ⇠ p(✏), k = 1, ..., K the gradient is approximated by r ˆL↵,K(q ; D) = 1 K KX k=1  ˆw↵,kr log p(g (✏k), D) q(g (✏k)) . (21) with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21): r LV I(q ; D) ⇡ 1 K KX k=1 r log p(g (✏k), D) q(g (✏k)) , (22) which means the resulting algorithm unifies the computation for all finite ↵ settings. if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X n2S r log ˆw(✏jn ; xn) to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bound L↵(q ; D) = 1 1 ↵ log E✏ "✓ p(g , D) q(g ) ◆1 ↵ # . (19) Then the gradient of the VR bound w.r.t. is r L↵(q ; D) = E✏  w↵(✏; , D)r log p(g , D) q(g ) , (20) where w↵(✏; , D) / ⇣ p(g ,D) q(g ) ⌘1 ↵ denotes the normalised importance weight. For finite samples ✏k ⇠ p(✏), k = 1, ..., K the gradient is approximated by r ˆL↵,K(q ; D) = 1 K KX k=1  ˆw↵,kr log p(g (✏k), D) q(g (✏k)) . (21) with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite samples. One can show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21): r LV I(q ; D) ⇡ 1 K KX k=1 r log p(g (✏k), D) q(g (✏k)) , (22) which means the resulting algorithm unifies the computation for all finite ↵ settings. if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X n2S r log ˆw(✏jn ; xn) to reduce the clutter of notations. Now we apply the reparameterization trick to the VR bou L↵(q ; D) = 1 1 ↵ log E✏ "✓ p(g , D) q(g ) ◆1 ↵ # . Then the gradient of the VR bound w.r.t. is r L↵(q ; D) = E✏  w↵(✏; , D)r log p(g , D) q(g ) , where w↵(✏; , D) / ⇣ p(g ,D) q(g ) ⌘1 ↵ denotes the normalised importance weight. For finite sam p(✏), k = 1, ..., K the gradient is approximated by r ˆL↵,K(q ; D) = 1 K KX k=1  ˆw↵,kr log p(g (✏k), D) q(g (✏k)) . with ˆw↵,k short-hand for ˆw↵(✏k; , D), the normalised importance weight with finite sample show that it recovers the the stochastic gradients of LV I by setting ↵ = 1 in (21): r LV I (q ; D) ⇡ 1 K KX k=1 r log p(g (✏k), D) q(g (✏k)) , which means the resulting algorithm unifies the computation for all finite ↵ settings. To speed-up learning [Burda et al., 2015] suggested back-propagating only one sample ✏j wit Algorithm 1 one gradient step for VR-↵/VR-max 1: sample ✏1, ..., ✏K ⇠ p(✏) 2: for all k = 1, ..., K, and n 2 S the current minibatch, compute the u log ˆw(✏k; xn) = log p(g (✏k), xn) log q(g ( 3: choose the sample ✏jn to back-propagate: if |↵| < 1: jn ⇠ pk where pk / ˆw(✏k; xn)1 ↵ if ↵ = 1: jn = arg maxk log ˆw(✏k; xn) 4: return the gradients to the optimiser 1 |S| X r log ˆw(✏jn ; xn)
  • 16. ¤ [Li et al., 2015] EP ¤ ¤ ¤ M VR ¤ M ¤ Black-box alpha BB-α VR or VAEs. Note that VR-max does not compute ciple (MDL), since MDL approximates the true upper-bounds the exact log-likelihood function. scale Learning hole dataset D. However for large datasets full [Li et al., 2015] the authors discussed stochastic tion for large-scale learning. Here we propose batch training, which directly applies to the VR “average likelihood” ¯fD(✓) = [ QN n=1 fn(✓)] 1 N , ✓) ¯fD(✓)N . Now we sample M datapoints S = posterior by minimising the exact VR bound L 1 4.3 Stochastic Approximation for La So far we discussed the VR bounds computed on t batch learning will be very ine cient. In the append EP as a way to approximating the VR bound opt another stochastic approximation method to enable bound. Using the notation fn(✓) = p(xn|✓) and definin the joint distribution can be rewritten as p(✓, D) = the minimum description length principle (MDL), since MDL approximates the true sing the exact VR bound L 1 that upper-bounds the exact log-likelihood function. c Approximation for Large-scale Learning the VR bounds computed on the whole dataset D. However for large datasets full e very ine cient. In the appendix of [Li et al., 2015] the authors discussed stochastic proximating the VR bound optimisation for large-scale learning. Here we propose pproximation method to enable minibatch training, which directly applies to the VR on fn(✓) = p(xn|✓) and defining the “average likelihood” ¯fD(✓) = [ QN n=1 fn(✓)] 1 N , n can be rewritten as p(✓, D) = p0(✓) ¯fD(✓)N . Now we sample M datapoints S = 7set average likelihood” ¯fS(✓) = [ QM m=1 fnm (✓)] 1 M . xn}. Then we approximate the VR bound (13) by )↵ p0(✓) ¯fS(✓)N 1 ↵ d✓ 0(✓) ¯fS(✓)N q(✓) ◆1 ↵ ]. (23) wer-bound when ↵ ! 1. For other ↵ 6= 1 settings, the bias of approximation. This is guaranteed by {xn1 , ..., xnM } ⇠ D and define the corresponding “subset average likelihood” ¯fS(✓) = [ QM m=1 fnm (✓)] 1 M . When M = 1 we also write ¯fS(✓) = fn(✓) for S = {xn}. Then we approximate the VR bound (13) by replacing ¯fD(✓) with ¯fS(✓): ˜L↵(q; S) = 1 1 ↵ log Z q(✓)↵ p0(✓) ¯fS(✓)N 1 ↵ d✓ = 1 1 ↵ log Eq[ ✓ p0(✓) ¯fS(✓)N q(✓) ◆1 ↵ ]. (23) This returns a stochastic estimate of the evidence lower-bound when ↵ ! 1. For other ↵ 6= 1 settings, increasing the size of the minibatch M = |S| reduces the bias of approximation. This is guaranteed by the following theorem proved in the supplementary. Theorem 3. If the approximate distribution q(✓) is Gaussian N(µ, ⌃), and the likelihood functions has an exponential family form p(x|✓) = exp[h✓, (x)i A(✓)], then for ↵  1 the stochastic approximation is bounded by
  • 17. VR
  • 19. 1 ¤ 3 ¤ VAE 1 = 1 ¤ IWAE 1 = 0 ¤ VR-max 1 = −∞ ¤ 1 = 0 * = 5000 ¤ VR-max IWAE ¤ VR-max ¤ VR-max 25hr29min IWAE 61hr16min e code1 . Note that the original implementation back- hile VR-max only back-propagates the sample with h 101 Silhouettes and MNIST. The experiments were y small Frey Face dataset, while the other two were onsists of L = 1 or 2 stochastic layers with determin- rk architecture is detailed in the supplementary. We n. For MNIST we used settings from [Burda et al., and number of epochs. For other two datasets the the VI setting. We reproduced the experiments for s included in [Burda et al., 2015] mismatches those e 1 by computing log p(x) ⇡ ˆL↵,K(q; x) with ↵ = 0.0, sent some samples from the VR-max trained models d almost indistinguishable to IWAEs on all the three ime to run compared to IWAE with a full backward a Tesla C2075 GPU, and when trained on MNIST R-max and IWAE took 25hr29min and 61hr16min, also implemented the single backward pass version od result for IWAE is -85.02, which is slightly worse he arguments in Section 4.1 that negative ↵ can be mputation resources are limited. alue corresponding to the tightest VR bound becomes q and the true posterior increases. This is the case n q is fitted to approximate the typically multimodal (a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST Figure 3: Sampled images from the VR-max trained auto-encoders. Dataset L K VAE IWAE VR-max Frey Face 1 5 1322.96 1380.30 1377.40 (± std. err.) ±10.03 ±4.60 ±4.59 Caltech 101 1 5 -119.69 -117.89 -118.01 Silhouettes 50 -119.61 -117.21 -117.10 MNIST 1 5 -86.47 -85.41 -85.42 50 -86.35 -84.80 -84.81 2 5 -85.01 -83.92 -84.04 50 -84.78 -83.12 -83.44 Table 1: Average Test log-likelihood. Results for VAE on MNIST are collected from [Burda et al., 2015]. IWAE results are reproduced using the publicly available implementation. method was implemented upon the publicly available code1 . Note that the original implementation back- propagates all the samples to compute gradients, while VR-max only back-propagates the sample with the largest importance weight. Three datasets are considered: Frey Face, Caltech 101 Silhouettes and MNIST. The experiments were
  • 20. 1 ¤ (a) Frey Face (b) Caltech 101 Silhouettes (c) MNIST Figure 3: Sampled images from the VR-max trained auto-encoders. Dataset L K VAE IWAE VR-max Frey Face 1 5 1322.96 1380.30 1377.40 (± std. err.) ±10.03 ±4.60 ±4.59 Caltech 101 1 5 -119.69 -117.89 -118.01 Silhouettes 50 -119.61 -117.21 -117.10 MNIST 1 5 -86.47 -85.41 -85.42 50 -86.35 -84.80 -84.81 2 5 -85.01 -83.92 -84.04 50 -84.78 -83.12 -83.44
  • 21. ¤ VR-max ¤ Frey Face ¤ 1
  • 22. 2 ¤ UCI ¤ ¤ VI[Graves,2011] PBP[Hernandez-Lobato et al., 2015] ¤ BB-α=BO Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max Boston -2.903±0.071 -2.574±0.089 -2.549±0.019 -2.457±0.066 -2.468±0.071 -2.469±0.072 Concrete -3.391±0.017 -3.161±0.019 -3.104±0.015 -3.094±0.016 -3.076±0.018 -3.092±0.018 Energy -2.391±0.029 -2.042±0.019 -0.945±0.012 -1.401±0.029 -1.418±0.020 -1.389±0.018 Wine -0.980±0.013 -0.968±0.014 -0.949±0.009 -0.948±0.011 -0.952±0.012 -0.949±0.012 Yacht -3.439±0.163 -1.634±0.016 -1.102±0.039 -1.816±0.011 -1.829±0.014 -1.817±0.013 Protein -2.992±0.006 -2.973±0.003 NA±NA -2.923±0.006 -2.911±0.005 -2.938±0.005 Year -3.622±NA -3.603±NA NA±NA -3.545±NA -3.550±NA -3.542±NA Table 2: Average test log-likelihood. BB-↵=BO results are not directly comparable and are available only for small datasets. Dataset VI PBP BB-↵=BO* VR-0.5 VR-0.0 VR-max Boston 4.320±0.291 3.104±0.180 3.160±0.109 2.853±0.154 2.852±0.169 2.837±0.181 Concrete 7.128±0.123 5.667±0.093 5.374±0.074 5.343±0.102 5.237±0.114 5.280±0.104 Energy 2.646±0.081 1.804±0.048 0.600±0.018 0.807±0.059 0.883±0.050 0.791±0.041 Wine 0.646±0.008 0.635±0.007 0.632±0.005 0.640±0.009 0.638±0.008 0.639±0.009 Yacht 6.887±0.674 1.015±0.054 0.902±0.051 1.111±0.082 1.239±0.109 1.117±0.085 Protein 4.842±0.003 4.732±0.013 NA±NA 4.505±0.033 4.436±0.030 4.574±0.023 Year 9.034±NA 8.879±NA NA±NA 8.942±NA 9.133±NA 8.949±NA Table 3: Average Test Error. BB-↵=BO results are not directly comparable and are available only for small datasets.
  • 23. ¤ Rényi ¤ VI/VB EP BB-α VAE IWAE VR-max ¤ ¤ ¤ ¤ ¤ 1