SlideShare a Scribd company logo
1 of 39
Download to read offline
How to Train Deep Variational
Autoencoders and Probabilistic
Ladder Networks
D2
¤ ICML 2016
¤ Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren
Kaae Sønderby, Ole Winther (Technical University of Denmark)
¤ Maaløe VAE
¤ VAE+GAN
¤ state-of-the-art
¤ VAE
¤ Ladder network VAE
¤ warm-up batch normalization
¤ VAE
¤
¤ VAE
¤ VAE
¤
¤
¤
¤
¤
¤
¤
¤
¤
¤ VAE
¤ VAE
¤
¤
¤
¤
¤
¤
¤
¤
¤
Variaitonal Autoencoder
¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14]
¤
¤ !(#|%)
¤ '(#|%)
¤ ( )
*
x ⇠ p✓(x|z)
z ⇠ p✓(z)
q (z|x)
+
¤ ', # 2
¤ -
¤ !.
Processes (Tran
e & Mohamed,
rs (Burda et al.,
nd warm-up al-
suggesting that
o show that the
good or better
model for fur-
ned latent rep-
oposed here are
ations utilizing
tive assessment
s that the multi-
in the datasets
ised learning.
inspired by the
better than the
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
pared to equally or more
ng flexible variational dis-
Gaussian Processes (Tran
s (Rezende & Mohamed,
Autoencoders (Burda et al.,
alization and warm-up al-
rformance, suggesting that
ul. We also show that the
erforms as good or better
interesting model for fur-
dy the learned latent rep-
methods proposed here are
nt representations utilizing
. A qualitative assessment
her indicates that the multi-
el structure in the datasets
emi-supervised learning.
e:
the VAE inspired by the
g as well or better than the
n training increasing both
e across several different
of active stochastic latent
Figure 2. Flow of information in the inference and generative
models of a) probabilistic ladder network and b) VAE. The prob-
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
ve generative performance, measured in terms of
likelihood, when compared to equally or more
ted methods for creating flexible variational dis-
s such as the Variational Gaussian Processes (Tran
15) Normalizing Flows (Rezende & Mohamed,
Importance Weighted Autoencoders (Burda et al.,
We find that batch normalization and warm-up al-
rease the generative performance, suggesting that
thods are broadly useful. We also show that the
stic ladder network performs as good or better
ng VAEs making it an interesting model for fur-
ies. Secondly, we study the learned latent rep-
ons. We find that the methods proposed here are
y for learning rich latent representations utilizing
ayers of latent variables. A qualitative assessment
ent representations further indicates that the multi-
DGMs capture high level structure in the datasets
likely to be useful for semi-supervised learning.
ary our contributions are:
ew parametrization of the VAE inspired by the
der network performing as well or better than the
ent best models.
ovel warm-up period in training increasing both
Figure 2. Flow of information in the inference and ge
models of a) probabilistic ladder network and b) VAE. Th
abilistic ladder network allows direct integration (+ in fig
Eq. (21) ) of bottom-up and top-down information in th
ence model. In the VAE the top-down information is incor
indirectly through the conditional priors in the generative
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓ (z1) or
P✓(x|z1) = B (x|µ✓(z1))
for continuous-valued (Gaussian N ) or binary-
(Bernoulli B) data, respectively. The latent variable
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1)
p✓(zL) = N (zL|0, I) .
The hierarchical specification allows the lower layer
latent variables to be highly correlated but still main
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specifie
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x)
Autoencoders (Burda et al.,
alization and warm-up al-
formance, suggesting that
ul. We also show that the
rforms as good or better
interesting model for fur-
dy the learned latent rep-
methods proposed here are
t representations utilizing
A qualitative assessment
er indicates that the multi-
el structure in the datasets
emi-supervised learning.
e:
the VAE inspired by the
as well or better than the
n training increasing both
e across several different
of active stochastic latent
ence model. In the VAE the top-down information is incorporated
indirectly through the conditional priors in the generative model.
The generative model p✓ is specified as follows:
p✓(x|z1) = N x|µ✓(z1), 2
✓(z1) or (1)
P✓(x|z1) = B (x|µ✓(z1)) (2)
for continuous-valued (Gaussian N) or binary-valued
(Bernoulli B) data, respectively. The latent variables z are
split into L layers zi, i = 1 . . . L:
p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2
✓,i(zi+1) (3)
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
¤ /(0|1, 34
)
¤ ℬ(0|1)
increasing both
several different
stochastic latent
is essential for
stochastic latent
rain a generative
a x using auxil-
model q (z|x)1
to the likelihood
recognition model
decoder.
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
d(y) =MLP(y) (7)
µ(y) =Linear(d(y)) (8)
2
(y) =Softplus(Linear(d(y))) , (9)
where MLP is a two layered multilayer perceptron network,
Linear is a single linear layer, and Softplus applies
log(1 + exp(·)) non linearity to each component of its ar-
gument vector. In our notation, each MLP(·) or Linear(·)
gives a new mapping with its own parameters, so the de-
terministic variable d is used to mark that the MLP-part is
shared between µ and 2
whereas the last Linear layer is
not shared.
nspired by the
better than the
ncreasing both
veral different
ochastic latent
s essential for
tochastic latent
in a generative
x using auxil-
p✓(zL) = N (zL|0, I) . (4)
The hierarchical specification allows the lower layers of the
latent variables to be highly correlated but still maintain the
computational efficiency of fully factorized models.
Each layer in the inference model q (z|x) is specified using
a fully factorized Gaussian distribution:
q (z1|x) = N z1|µ ,1(x), 2
,1(x) (5)
q (zi|zi 1) = N zi|µ ,i(zi 1), 2
,i(zi 1) (6)
for i = 2 . . . L.
Functions µ(·) and 2
(·) in both the generative and the in-
ference models are implemented as:
d(y) =MLP(y) (7)
µ(y) =Linear(d(y)) (8)
2
(y) =Softplus(Linear(d(y))) , (9)
where MLP is a two layered multilayer perceptron network,
Linear is a single linear layer, and Softplus applies
Sigmoid
Abstract
Variational autoencoders are a powerful frame-
work for unsupervised learning. However, pre-
vious work has been restricted to shallow mod-
els with one or two layers of fully factorized
stochastic latent variables, limiting the flexibil-
ity of the latent representation. We propose three
advances in training algorithms of variational au-
toencoders, for the first time allowing to train
deep models of up to five stochastic layers, (1)
using a structure similar to the Ladder network
as the inference model, (2) warm-up period to
support stochastic units staying active in early
training, and (3) use of batch normalization. Us-
ing these improvements we show state-of-the-art
log-likelihood results for generative modeling on
several benchmark datasets.
1. Introduction
The recently introduced variational autoencoder (VAE)
(Kingma & Welling, 2013; Rezende et al., 2014) provides
a framework for deep generative models (DGM). DGMs
have later been shown to be a powerful framework for
semi-supervised learning (Kingma et al., 2014; Maaloee
X
!"
z
z
X
!"
d
"
d
"
a) b)
1
2
1
2
2 2
2
1 1
1
Figure 1. Inference (or encoder/rec
decoder) models. a) VAE inference
der inference model and c) generati
variables sampled from the approxi
with mean and variances parameteri
ables, each conditioned on the l
highly flexible latent distribution
model parameterizations: the fir
the VAE to multiple layers of la
ond is parameterized in such a w
as a probabilistic variational vari
arXiv:1602.02282v1[stat.ML]6
¤
¤
¤
log' 0 ≥ :;< = 0 log
', 0, =
!. = 0
= ℒ(), (; 0)
A,,.ℒ ), (; 0 = A,,.:;< = 0 log
', 0, =
!. = 0
= :/(B,C) A,,.log
', 0, =
!. = 0
=
1
E
F A,,.log
G
HIC
', 0, =(J)
!. =(J) 0
¤ 2
¤ A
¤ B KL
¤ B
¤ &
¤ IWAE A
¤ A
ℒ ), (; 0 = :;< = 0 log
', 0, =
!. = 0
ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
¤
¤ A
¤ B KL
¤ B
' 0|= = ' =O
' =OPC
|=O
'(x|=C
)
! =|0 = ! =C
|0 ! =4
|=C
!(=O
|=OPC
)
:;< = 0 log
RS 0,=
;< = 0
= :;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
−L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
= −L-[!. = 0 ∥ ', =O
] + :;< = 0 log' =OPC
|=O
' x|=C
¤ VAE
¤ VAE
¤
¤
¤
¤
¤
¤
¤
¤
¤
VAE
¤ VAE
¤
¤ CVAE [Kingma+ 2014]
¤ Variational Fair Auto Encoder [Louizos+ 15]
¤ X sensitive '(%|X) MMD
The inference network q (z|x) (3) is used during training of the model using both the labelled and
unlabelled data sets. This approximate posterior is then used as a feature extractor for the labelled
data set, and the features used for training the classifier.
3.1.2 Generative Semi-supervised Model Objective
For this model, we have two cases to consider. In the first case, the label corresponding to a data
point is observed and the variational bound is a simple extension of equation (5):
log p✓(x, y) Eq (z|x,y) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (z|x, y)]= L(x, y), (6)
For the case where the label is missing, it is treated as a latent variable over which we perform
posterior inference and the resulting bound for handling data points with an unobserved label y is:
log p✓(x) Eq (y,z|x) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (y, z|x)]
=
X
y
q (y|x)( L(x, y)) + H(q (y|x)) = U(x). (7)
The bound on the marginal likelihood for the entire dataset is now:
J =
X
(x,y)⇠epl
L(x, y) +
X
x⇠epu
U(x) (8)
The distribution q (y|x) (4) for the missing labels has the form a discriminative classifier, and
we can use this knowledge to construct the best classifier possible as our inference model. This
distribution is also used at test time for predictions of any unseen data.
In the objective function (8), the label predictive distribution q (y|x) contributes only to the second
term relating to the unlabelled data, which is an undesirable property if we wish to use this distribu-
tion as a classifier. Ideally, all model and variational parameters should learn in all cases. To remedy
%
#
X
¤ Auxiliary Deep Generative Models [Maaløe+ 16]
¤ Auxiliary variables [Agakov+
2004 ]
¤
¤ state-of-the-art
Auxiliary Deep Generative Models
ry deep generative models
ngma (2013); Rezende et al. (2014) have cou-
oach of variational inference with deep learn-
e to powerful probabilistic models constructed
nce neural network q(z|x) and a generative
rk p(x|z). This approach can be perceived as
equivalent to the deep auto-encoder, in which
as the encoder and p(x|z) the decoder. How-
ference is that these models ensures efficient
er various continuous distributions in the la-
and complex input datasets x, where the pos-
ution p(x|z) is intractable. Furthermore, the
the variational upper bound are easily defined
agation through the network(s). To keep the
al requirements low the variational distribution
ally chosen to be a diagonal Gaussian, limiting
e power of the inference model.
er we propose a variational auxiliary vari-
ch (Agakov and Barber, 2004) to improve
al distribution: The generative model is ex-
yz
a
x
(a) Generative model P.
yz
a
x
(b) Inference model Q.
Figure 1. Probabilistic graphical model of the ADGM for semi-
supervised learning. The incoming joint connections to each vari-
able are deep neural networks with parameters ✓ and .
2.2. Auxiliary variables
We propose to extend the variational distribution with aux-
iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) such that
the marginal distribution q(z|x) can fit more complicated
posteriors p(z|x). In order to have an unchanged gen-
(x|z). This approach can be perceived as
valent to the deep auto-encoder, in which
e encoder and p(x|z) the decoder. How-
ce is that these models ensures efficient
arious continuous distributions in the la-
complex input datasets x, where the pos-
n p(x|z) is intractable. Furthermore, the
variational upper bound are easily defined
on through the network(s). To keep the
quirements low the variational distribution
chosen to be a diagonal Gaussian, limiting
wer of the inference model.
e propose a variational auxiliary vari-
Agakov and Barber, 2004) to improve
istribution: The generative model is ex-
bles a to p(x, z, a) such that the original
t to marginalization over a: p(x, z, a) =
In the variational distribution, on the
s used such that marginal q(z|x) =
)da is a general non-Gaussian distribution.
specification allows the latent variables to
ough a, while maintaining the computa-
(a) Generative model P. (b) Inference model
Figure 1. Probabilistic graphical model of the ADGM
supervised learning. The incoming joint connections to e
able are deep neural networks with parameters ✓ and .
2.2. Auxiliary variables
We propose to extend the variational distribution w
iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) s
the marginal distribution q(z|x) can fit more com
posteriors p(z|x). In order to have an unchang
erative model, p(x|z), it is required that the joi
p(x, z, a) gives back the original p(x, z) under m
ization over a, thus p(x, z, a) = p(a|x, z)p(x, z
iliary variables are used in the EM algorithm an
sampling and has previously been considered fo
tional learning by Agakov and Barber (2004). R
Ranganath et al. (2015) has proposed to make the
q(z|x) acts as the encoder and p(x|z) the decoder
ever, the difference is that these models ensures e
inference over various continuous distributions in
tent space z and complex input datasets x, where t
terior distribution p(x|z) is intractable. Furtherm
gradients of the variational upper bound are easily
by backpropagation through the network(s). To k
computational requirements low the variational dist
q(z|x) is usually chosen to be a diagonal Gaussian,
the expressive power of the inference model.
In this paper we propose a variational auxiliar
able approach (Agakov and Barber, 2004) to i
the variational distribution: The generative mode
tended with variables a to p(x, z, a) such that the
model is invariant to marginalization over a: p(x,
p(a|x, z)p(x, z). In the variational distribution,
other hand, a is used such that marginal q(zR
q(z|a, x)p(a|x)da is a general non-Gaussian distr
This hierarchical specification allows the latent vari
be correlated through a, while maintaining the co
tional efficiency of fully factorized models (cf. Fig
¤ VAE
¤ VAE
¤
¤
¤
¤
¤
¤
¤
¤
¤
Importance Weighted AE
¤ Importance Weighted Autoencoder [Burda+ 15]
¤
¤ k
¤ Rényi [Li+ 16]
¤ Y
log' 0 ≥ :
;< %(C)
# ;< %(Z)
#
log F
', 0, =([)
!. =([)
0
Z
IC
≥ ℒ(), (; 0)
ecall from Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhaps c
ariational free-energy approaches be generalised to the R´enyi case? Consider approximating the tr
osterior p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0):
q(✓) = arg min
q2Q
D↵[q(✓)||p(✓|D)]. (1
ow we verify the alternative optimization problem
q(✓) = arg max
q2Q
log p(D) D↵[q(✓)||p(✓|D)]. (1
When ↵ 6= 1, the objective can be rewritten as
log p(D)
1
↵ 1
log
Z
q(✓)↵
p(✓|D)1 ↵
d✓
= log p(D)
1
↵ 1
log Eq
"✓
p(✓, D)
q(✓)p(D)
◆1 ↵
#
=
1
1 ↵
log Eq
"✓
p(✓, D)
q(✓)
◆1 ↵
#
:= L↵(q; D).
(1
We name this new objective the variational R´enyi bound (VR). Importantly the following theorem i
rect result of Proposition 1.
heorem 1. The objective L↵(q; D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵| < +1
specially for all 0 < ↵ < 1,
¤ Normalizing flows [Rezende+ 15]
¤
¤ Variational Gaussian Process [Tran+ 15]
¤
Variational Inference with Normalizing Flows
and involve matrix inverses that can be numerically unsta-
ble. We therefore require normalizing flows that allow for
low-cost computation of the determinant, or where the Ja-
cobian is not needed at all.
4.1. Invertible Linear-time Transformations
We consider a family of transformations of the form:
f(z) = z + uh(w>
z + b), (10)
where = {w 2 IRD
, u 2 IRD
, b 2 IR} are free pa-
rameters and h(·) is a smooth element-wise non-linearity,
with derivative h0
(·). For this mapping we can compute
the logdet-Jacobian term in O(D) time (using the matrix
determinant lemma):
(z) = h0
(w>
z + b)w (11)
det @f
@z = | det(I + u (z)>
)| = |1 + u>
(z)|. (12)
From (7) we conclude that the density qK(z) obtained by
transforming an arbitrary initial density q0(z) through the
sequence of maps fk of the form (10) is implicitly given
by:
zK = fK fK 1 . . . f1(z)
ln qK(zK) = ln q0(z)
KX
k=1
ln |1 + u>
k k(zk)|. (13)
The flow defined by the transformation (13) modifies the
initial density q0 by applying a series of contractions and
expansions in the direction perpendicular to the hyperplane
w>
z+b = 0, hence we refer to these maps as planar flows.
As an alternative, we can consider a family of transforma-
tions that modify an initial density q0 around a reference
point z0. The transformation family is:
f(z) = z + h(↵, r)(z z0), (14)
@f d 1 0
K=1 K=2
Planar Radial
q0 K=1 K=2K=10 K=10
UnitGaussianUniform
Figure 1. Effect of normalizing flow on two distributions.
Inference network Generative model
Figure 2. Inference and generative models. Left: Inference net-
work maps the observations to the parameters of the flow; Right:
generative model which receives the posterior samples from the
inference network during training time. Round containers repre-
sent layers of stochastic variables whereas square containers rep-
resent deterministic layers.
4.2. Flow-Based Free Energy Bound
If we parameterize the approximate posterior distribution
with a flow of length K, q (z|x) := qK(zK), the free en-
ergy (3) can be written as an expectation over the initial
distribution q0(z):
F(x) = Eq (z|x)[log q (z|x) log p(x, z)]
= Eq0(z0) [ln qK(zK) log p(x, zK)]
= Eq0(z0) [ln q0(z0)] Eq0(z0) [log p(x, zK)]
" K
#
Variational Inference with Normalizi
distribution :
q(z0
) = q(z) det
@f 1
@z0
= q(z) det
@f
@z
1
, (5)
where the last equality can be seen by applying the chain
rule (inverse function theorem) and is a property of Jaco-
bians of invertible functions. We can construct arbitrarily
complex densities by composing several simple maps and
successively applying (5). The density qK(z) obtained by
successively transforming a random variable z0 with distri-
bution q0 through a chain of K transformations fk is:
zK = fK . . . f2 f1(z0) (6)
ln qK(zK) = ln q0(z0)
KX
k=1
ln det
@fk
@zk
, (7)
where equation (6) will be used throughout the paper as a
shorthand for the composition fK(fK 1(. . . f1(x))). The
path traversed by the random variables zk = fk(zk 1) with
initial distribution q0(z0) is called the flow and the path
formed by the successive distributions qk is a normalizing
flow. A property of such transformations, often referred
to as the law of the unconscious statistician (LOTUS), is
that expectations w.r.t. the transformed density qK can be
computed without explicitly knowing qK. Any expectation
EqK
[h(z)] can be written as an expectation under q0 as:
EqK
[h(z)] = Eq0
[h(fK fK 1 . . . f1(z0))], (8)
which does not require computation of the the logdet-
Jacobian terms when h(z) does not depend on qK.
We can understand the effect of invertible flows as a se-
quence of expansions or contractions on the initial density.
For an expansion, the map z0
= f(z) pulls the points z
away from a region in IRd
, reducing the density in that re-
gion while increasing the density outside the region. Con-
partial diffe
sity q0(z) e
T describes
Langevin F
the Langev
dz
where d⇠(t
E[⇠i(t)⇠j(t
D = GG
random var
Langevin fl
of densities
Kolmogoro
qt(z) of the
@
@t
qt(z)=
In machine
with F(z, t
L(z) is an u
Importantly
is given by
That is, if
evolve its s
sulting poin
e L(z)
, i.e.
plored for s
Teh (2011);
Hamiltonia
described in
space ˜z = (
tonian H(z
used in mac
Under review as a conference paper at ICLR 2016
zifi⇠
✓ D = {(s, t)}
d
(a) VARIATIONAL MODEL
zi x
d
(b) GENERATIVE MODEL
Figure 1: (a) Graphical model of the variational Gaussian process. The VGP generates samples of
latent variables z by evaluating random non-linear mappings of latent inputs ⇠, and then drawing
mean-field samples parameterized by the mapping. These latent variables aim to follow the posterior
distribution for a generative model (b), conditioned on data x.
¤ VAE
¤ VAE
¤
¤
¤
¤
¤
¤
¤
¤
¤
¤ VAE
¤ VAE
¤
¤
¤
¤
¤
¤
¤
¤
¤
¤ 1 2
¤
¤ 3
¤
¤
¤
¤ Ladder network [Valpola, 14][Rasmus+ 15]
probabilistic ladder network
¤ ladder network
¤
ariational Autoencoders and Probabilistic Ladder Networks
s we propose (1)
rm-up period to
rly training, and
zing DGMs and
up to five layers
ls, consisting of
h highly expres-
fficiency of fully
ese models have
ured in terms of
qually or more
variational dis-
Processes (Tran
e & Mohamed,
zd
z
z
a) b)
n
n
nn
"Likelihood"
Deterministic
bottom up
pathway
Stochastic
top down
pathway
"Posterior"
"Prior"
+
"Copy"
Top down pathway
through KL-divergences
in generative model
Bottom up pathway
in inference model
Indirect top
down
information
through prior
Direct flow of
information
zn
Generative
model
"Copy"
Figure 2. Flow of information in the inference and generative
models of a) probabilistic ladder network and b) VAE. The prob-
abilistic ladder network allows direct integration (+ in figure, see
Eq. (21) ) of bottom-up and top-down information in the infer-
Probabilistic ladder network VAE
bottom-up top-down
top-down
Top-down
¤ bottom-up
¤ top-down
:=~;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
=C~! =C|0
=4~! =4|=C
:=~;< = 0 log
R =T R =TUV|=T R x|=V
; =V|0 ; =W|=V ;(=T|=TUV)
' =C
|=4
' =4
' 0|=C=C
~! =C
|0
=4
~! =4
|=C
¤
¤
¤
th different number of latent lay-
Warm-up WU
vides a tractable lower bound
an be used as a training crite-
✓(x, z)
(z|x)
= L(✓, ; x) (10)
|p✓(z)) + Eq (z|x) [p✓(x|z)] ,
(11)
ibler divergence.
e likelihood may be obtained
crease of samples by using the
Burda et al., 2015):
(z(K)|x)
"
log
KX
k=1
p✓(x, z(k)
)
q (z(k)|x)
#
(12)
ve parameters, ✓ and , are
g Eq. (11) using stochastic
se the reparametrization trick
n through the Gaussian latent
, 2013; Rezende et al., 2014).
mance during training. The test set performance was estimated
using 5000 importance weighted samples providing a tighter
bound than the training bound explaining the better performance
here.
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
2
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
2
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
µt,i
2
t,i + µd,i
2
d,i
2
t,i + 2
d,i
,
1
2
t,i + 2
d,i
!
.
(21)
kelihood values for VAEs and the
ith different number of latent lay-
d Warm-up WU
ovides a tractable lower bound
can be used as a training crite-
p✓(x, z)
q (z|x)
= L(✓, ; x) (10)
||p✓(z)) + Eq (z|x) [p✓(x|z)] ,
(11)
eibler divergence.
he likelihood may be obtained
crease of samples by using the
(Burda et al., 2015):
q (z(K)|x)
"
log
KX
k=1
p✓(x, z(k)
)
q (z(k)|x)
#
(12)
ve parameters, ✓ and , are
ng Eq. (11) using stochastic
use the reparametrization trick
on through the Gaussian latent
g, 2013; Rezende et al., 2014).
mated using Monte Carlo sam-
orresponding q distribution.
mance during training. The test set performance was estimated
using 5000 importance weighted samples providing a tighter
bound than the training bound explaining the better performance
here.
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
2
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
2
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
µt,i
2
t,i + µd,i
2
d,i
2
t,i + 2
d,i
,
1
2
t,i + 2
d,i
!
.
(21)
2 2
a tractable lower bound
used as a training crite-
)
)
= L(✓, ; x) (10)
) + Eq (z|x) [p✓(x|z)] ,
(11)
ivergence.
lihood may be obtained
of samples by using the
a et al., 2015):
|x)
"
log
KX
k=1
p✓(x, z(k)
)
q (z(k)|x)
#
(12)
ameters, ✓ and , are
(11) using stochastic
reparametrization trick
ugh the Gaussian latent
3; Rezende et al., 2014).
using Monte Carlo sam-
here.
2015) as the inference model of a VAE, as shown in Figure
1. The generative model is the same as before.
The inference is constructed to first make a deterministic
upward pass:
d1 =MLP(x) (13)
µd,i =Linear(di), i = 1 . . . L (14)
2
d,i =Softplus(Linear(di)), i = 1 . . . L (15)
di =MLP(µd,i 1), i = 2 . . . L (16)
followed by a stochastic downward pass:
q (zL|x) =N µd,L, 2
d,L (17)
ti =MLP(zi+1), i = 1 . . . L 1 (18)
µt,i =Linear(ti) (19)
2
t,i =Softplus(Linear(ti)) (20)
q✓(zi|zi+1, x) =N
µt,i
2
t,i + µd,i
2
d,i
2
t,i + 2
d,i
,
1
2
t,i + 2
d,i
!
.
(21)
PRML
¤ VAE
¤ VAE
¤
¤
¤
¤
¤
¤
¤
¤
¤
¤ B
¤
[MacKey 01]
¤ ^
¤
¤
¤
ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
phenomenon. The probabilistic ladder network provides
a framework with the wanted interaction, while keeping
complications manageable. A further extension could be
to make the inference in k steps over an iterative inference
procedure (Raiko et al., 2014).
2.2. Warm-up from deterministic to variational
autoencoder
The variational training criterion in Eq. (11) contains the
reconstruction term p✓(x|z) and the variational regular-
ization term. The variational regularization term causes
some of the latent units to become inactive during train-
ing (MacKay, 2001) because the approximate posterior for
unit k, q(zi,k| . . . ) is regularized towards its own prior
p(zi,k| . . . ), a phenomenon also recognized in the VAE set-
ting (Burda et al., 2015). This can be seen as a virtue
of automatic relevance determination, but also as a prob-
lem when many units are pruned away early in training
before they learned a useful representation. We observed
that such units remain inactive for the rest of the training,
presumably trapped in a local minima or saddle point at
KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable
to re-activate them.
We propose to alleviate the problem by initializing train-
tr
fr
v
th
o
3
T
M
C
ch
3
al
M
q
6
b
re
phenomenon. The
a framework with
complications man
to make the inferen
procedure (Raiko et
2.2. Warm-up from
autoencoder
The variational trai
reconstruction term
ization term. The
some of the latent
ing (MacKay, 2001
unit k, q(zi,k| . . . )
p(zi,k| . . . ), a pheno
ting (Burda et al.,
of automatic releva
lem when many un
before they learned
that such units rem
presumably trapped
KL(qi,k|pi,k) ⇡ 0,
to re-activate them.
We propose to alle
ing using the recon
training a standard
¤
¤
¤
→Warm-up
¤ Batch Normalization[Ioffe+ 15]
¤
¤ 2
before they learned a useful representation. We observed
that such units remain inactive for the rest of the training,
presumably trapped in a local minima or saddle point at
KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable
to re-activate them.
We propose to alleviate the problem by initializing train-
ing using the reconstruction error only (corresponding to
training a standard deterministic auto-encoder), and then
gradually introducing the variational regularization term:
L(✓, ; x)T = (22)
KL(q (z|x)||p✓(z)) + Eq (z|x) [p✓(x|z)] ,
where is increased linearly from 0 to 1 during the first Nt
epochs of training. We denote this scheme warm-up (ab-
breviated WU in tables and graphs) because the objective
goes from having a delta-function solution (correspond-
ing to zero temperature) and then move towards the fully
stochastic variational objective. A similar idea has previ-
ously been considered in Raiko et al. (2007, Section 6.2),
however here used for Bayesian models trained with a co-
ordinate descent algorithm.
32, 16, 8 and 4, goin
all mappings using
MLP’s between x an
quent layers were c
64 and 32 for all co
bilistic ladder netwo
removing latent vari
sometimes refer to t
the four layer mode
models were trained
& Ba, 2014) optimiz
reported test log-like
(12) with 5000 impo
et al. (2015). The mo
(Bastien et al., 2012
Parmesan3
framewo
For MNIST we used
mean of a Bernoulli
(max(x, 0.1x)) as n
were trained for 200
on the complete trai
Nt = 200. Simila
ple the binarized tra
ages using a Bernou
¤ 3
¤ MNIST
¤ OMNIGLOT [Lake+ 13]
¤ NORB [LeCun+ 04]
¤ 64-32-16-8-4 5
¤ NN 2
¤ 1 512 2 256,128,64,32
¤ tanh leaky rectifiers
¤ k=5000 importance weighted
¤
MNIST
¤ VAE 2
¤ Batch normalization & warm-up
¤ probabilistic ladder network
How to Train Deep Variational Autoencoders and Probabilistic Ladder Netwo
Figure 3. MNIST test-set log-likelihood values for VAEs and the
probabilistic ladder networks with different number of latent lay-
ers, Batch normalizationBN and Warm-up WU
The variational principle provides a tractable lower bound
on the log likelihood which can be used as a training crite-
rion L.
log p(x) E

log
p✓(x, z)
= L(✓, ; x) (10)
200 400 600 800 1000 1
Epoc
-90
-88
-86
-84
-82
L(x)
VAE
VAE+BN
Figure 4. MNIST train (full lines) an
mance during training. The test set
using 5000 importance weighted s
bound than the training bound expla
here.
2015) as the inference model of a
1. The generative model is the sa
The inference is constructed to
¤ MC importance weighted
IW
¤ MC IW
¤ permutation invariant MNIST
¤ -82.90 [Burda+ 15]
¤ -81.90 [Tran+ 15]
¤
tional Autoencoders and Probabilistic Ladder Networks
, where iter-
own signals
e 2. Notably
ce networks
see van den
sion on this
ork provides
hile keeping
on could be
ve inference
nal
contains the
nal regular-
term causes
during train-
Table 1. Fine-tuned test log-likelihood values for 5 layered VAE
and probabilistic ladder networks trained on MNIST. ANN. LR:
Annealed Learning rate, MC: Monte Carlo samples to approxi-
mate Eq(·)[·], IW: Importance weighted samples
FINETUNING NONE
ANN. LR.
MC=1
IW=1
ANN. LR.
MC=10
IW=1
ANN. LR.
MC=1
IW=10
ANN. LR.
MC=10
IW=10
VAE 82.14 81.97 81.84 81.41 81.30
PROB. LADDER 81.87 81.54 81.46 81.35 -81.20
training in deep neural networks by normalizing the outputs
from each layer. We show that batch normalization (abbre-
viated BN in tables and graphs), applied to all layers except
the output layers, is essential for learning deep hierarchies
of latent variables for L > 2.
¤ KL 0.01
¤ VAE
¤ Batch normalization (BN)
¤ Warm-up (WU)
¤ Probabilistic ladder network
How to Train Deep Variational Autoencoders and Probabilistic Ladde
Table 2. Number of active latent units in five layer VAE and prob-
abilistic ladder networks trained on MNIST. A unit was defined
as active if KL(qi,k||pi,k) > 0.01
VAE VAE
+BN
VAE
+BN
+WU
PROB. LADDER
+BN
+WU
LAYER 1 20 20 34 46
LAYER 2 1 9 18 22
LAYER 3 0 3 6 8
LAYER 4 0 3 2 3
LAYER 5 0 2 1 2
TOTAL 21 37 61 81
importance weighted samples to 10 to reduce the variance
in the approximation of the expectations in Eq. (10) and
improve the inference model, respectively.
Models trained on the OMNIGLOT dataset4
, consisting of
28x28 binary images images were trained similar to above
except that the number of training epochs was 1500.
Models trained on the NORB dataset5
, consisting of 32x32
grays-scale images with color-coding rescaled to [0, 1],
Table 3. Test set Log-likelih
OMNIGLOT and NORB da
dataset and the number of la
VAE
OMNIGLOT
64 114.45
64-32 112.60
64-32-16 112.13
64-32-16-8 112.49
64-32-16-8-4 112.10
NORB
64 2630.8
64-32 2830.8
64-32-16 2757.5
64-32-16-8 2832.0
64-32-16-8-4 3064.1
tic layers. The performan
not improve with more th
variables. Contrary to thi
OMNIGLOT NORB
¤ BN WU
¤ NORB ladder
¤
¤ tanh
w to Train Deep Variational Autoencoders and Probabilistic Ladder Networks
ent units in five layer VAE and prob-
ined on MNIST. A unit was defined
> 0.01
AE
BN
VAE
+BN
+WU
PROB. LADDER
+BN
+WU
0 34 46
9 18 22
3 6 8
3 2 3
2 1 2
7 61 81
ples to 10 to reduce the variance
he expectations in Eq. (10) and
del, respectively.
MNIGLOT dataset4
, consisting of
ges were trained similar to above
training epochs was 1500.
RB dataset5
, consisting of 32x32
color-coding rescaled to [0, 1],
tion model with mean and vari-
near and a softplus output layer
were similar to the models above
ngent was used as nonlinearities
rate was 0.002, Nt = 1000 and
ochs were 4000.
Table 3. Test set Log-likelihood values for models trained on the
OMNIGLOT and NORB datasets. The left most column show
dataset and the number of latent variables i each model.
VAE VAE
+BN
VAE
+BN
+WU
PROB. LADDER
+BN
+WU
OMNIGLOT
64 114.45 108.79 104.63
64-32 112.60 106.86 102.03 102.12
64-32-16 112.13 107.09 101.60 -101.26
64-32-16-8 112.49 107.66 101.68 101.27
64-32-16-8-4 112.10 107.94 101.86 101.59
NORB
64 2630.8 3263.7 3481.5
64-32 2830.8 3140.1 3532.9 3522.7
64-32-16 2757.5 3247.3 3346.7 3458.7
64-32-16-8 2832.0 3302.3 3393.6 3499.4
64-32-16-8-4 3064.1 3258.7 3393.6 3430.3
tic layers. The performance of the vanilla VAE model did
not improve with more than two layers of stochastic latent
variables. Contrary to this, models trained with batch nor-
malization and warm-up consistently increase the model
performance for additional layers of stochastic latent vari-
ables. As expected the improvement in performance is de-
creasing for each additional layer, but we emphasize that
the improvements are consistent even for the addition of
¤ VAE
¤
¤ KL
¤ KL 0
¤ BN WU Ladder
To study this effect we calculated the KL-divergence be-
tween q(zi,k|zi 1,k)
tent variable k during training as seen in Figure
To study this effect we calculated the KL-divergence be-
and p(zi|zi+1) for each stochastic la-
during training as seen in Figure
term is zero if the inference model is independent of the
data, i.e. q(zi,k|zi 1,k) = q(zi,k), and hence collapsed
¤ PCA
¤ BN
¤ Ladder
¤
¤ KL
¤ VAE 2
¤ Ladder BN WU
¤
structured high level latent representations that are likely
useful for semi-supervised learning.
The hierarchical latent variable models used here allows
highly flexible distributions of the lower layers conditioned
on the layers above. We measure the divergence between
these conditional distributions and the restrictive mean field
approximation by calculating the KL-divergence between
q(zi|zi 1) and a standard normal distribution for several
models trained on MNIST, see Figure 6 a). As expected
the lower layers have highly non (standard) Gaussian dis-
tributions when conditioned on the layers above. Interest-
ingly the probabilistic ladder network seems to have more
active intermediate layers than t,he VAE with batch nor-
malization and warm-up. Again this might be explained
by the deterministic upward pass easing flow of informa-
tion to the intermediate and upper layers. We further note
that the KL-divergence is approximately zero in the vanilla
VAE model above the second layer confirming the inactiv-
ity of these layers. Figure 6 b) shows generative samples
from the probabilistic ladder network created by injecting
⇤
Bouchard,
Yoshua. Th
arXiv prepr
Burda, Yuri,
lan. Impor
arXiv:1509
Dayan, Peter,
Zemel, Ric
putation, 7
Dieleman, Sa
ren Kaae S
Aaron, and
lease., Aug
10.5281/
Ioffe, Sergey
Acceleratin
covariate sh
Kingma, Di
¤
¤ Probabilistic ladder network
¤
¤ VAE
¤ VAE
¤
¤
¤
¤
¤
¤
¤
¤
¤
¤ VAE 3
¤
¤
¤
¤
¤ MNIST OMNIGLOT state-of-the-art
¤ Probabilistic ladder network
¤ Probabilistic ladder network
¤ VAE
¤ VAE
¤
¤
¤
¤
¤
¤
¤
¤
¤
VAE
¤
¤ VAE
¤
¤ …
¤
→NN
¤ VAE
¤ Lasagne Parmesan
¤
¤
VAE
¤ VAE
¤ VAE Ladder RNN
¤ DL
#recognition model (encoder)
q_h = [Dense(rng,28*28,200,activation=T.nnet.relu),
Dense(rng,200,200,activation=T.nnet.relu)]
q_mean = [Dense(rng,200,50,activation=None)]
q_sigma = [Dense(rng,200,50,activation=T.nnet.softplus)]
q = [Gaussian(q_h,q_mean,q_sigma)]
#generate model (decoder)
p_h = [Dense(rng,50,200,activation=T.nnet.relu),
Dense(rng,200,200,activation=T.nnet.relu)]
p_mean = [Dense(rng,200,28*28,activation=T.nnet.sigmoid)]
p = [Bernoulli(p_h,p_mean)]
#VAE
model = VAE_z_x(q,p,k=1,alpha=1,random=rseed)
#
model.train(train_x)
#
log_likelihood_test = model.log_likelihood_test(test_x,k=1000,mode='iw')
#
sample_x = model.p_sample_mean_x(sample_z)
Renyi
CNN
Lasagne

More Related Content

What's hot

What's hot (20)

[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by Factorising[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by Factorising
 
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
 
猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
 
Transformerを用いたAutoEncoderの設計と実験
Transformerを用いたAutoEncoderの設計と実験Transformerを用いたAutoEncoderの設計と実験
Transformerを用いたAutoEncoderの設計と実験
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
 
[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for Vision[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for Vision
 
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
【DL輪読会】The Forward-Forward Algorithm: Some Preliminary
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
Contrastive learning 20200607
Contrastive learning 20200607Contrastive learning 20200607
Contrastive learning 20200607
 
Transformerを多層にする際の勾配消失問題と解決法について
Transformerを多層にする際の勾配消失問題と解決法についてTransformerを多層にする際の勾配消失問題と解決法について
Transformerを多層にする際の勾配消失問題と解決法について
 
【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)
【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)
【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)
 
【DL輪読会】Transformers are Sample Efficient World Models
【DL輪読会】Transformers are Sample Efficient World Models【DL輪読会】Transformers are Sample Efficient World Models
【DL輪読会】Transformers are Sample Efficient World Models
 
点群深層学習 Meta-study
点群深層学習 Meta-study点群深層学習 Meta-study
点群深層学習 Meta-study
 

Viewers also liked

(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
Masahiro Suzuki
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
Masahiro Suzuki
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
Masahiro Suzuki
 

Viewers also liked (20)

(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読
 
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
 Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De... Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
 
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
 
Deep learning勉強会20121214ochi
Deep learning勉強会20121214ochiDeep learning勉強会20121214ochi
Deep learning勉強会20121214ochi
 
Variational autoencoder talk
Variational autoencoder talkVariational autoencoder talk
Variational autoencoder talk
 
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
 
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
Introduction to "Facial Landmark Detection by Deep Multi-task Learning"
 

Similar to (DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
Cemal Ardil
 

Similar to (DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks (20)

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Quantum Deep Learning
Quantum Deep LearningQuantum Deep Learning
Quantum Deep Learning
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
An introduction to deep learning
An introduction to deep learningAn introduction to deep learning
An introduction to deep learning
 
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels
 
Analysis_molf
Analysis_molfAnalysis_molf
Analysis_molf
 
Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscape
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
 
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
Investigation on the Pattern Synthesis of Subarray Weights for Low EMI Applic...
 
Improving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning AlgorithmImproving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning Algorithm
 
Bistablecamnets
BistablecamnetsBistablecamnets
Bistablecamnets
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
 
graph_embeddings
graph_embeddingsgraph_embeddings
graph_embeddings
 
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
 
Dycops2019
Dycops2019 Dycops2019
Dycops2019
 
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 

Recently uploaded

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks

  • 1. How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks D2
  • 2. ¤ ICML 2016 ¤ Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, Ole Winther (Technical University of Denmark) ¤ Maaløe VAE ¤ VAE+GAN ¤ state-of-the-art ¤ VAE ¤ Ladder network VAE ¤ warm-up batch normalization ¤ VAE ¤
  • 5. Variaitonal Autoencoder ¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14] ¤ ¤ !(#|%) ¤ '(#|%) ¤ ( ) * x ⇠ p✓(x|z) z ⇠ p✓(z) q (z|x) +
  • 6. ¤ ', # 2 ¤ - ¤ !. Processes (Tran e & Mohamed, rs (Burda et al., nd warm-up al- suggesting that o show that the good or better model for fur- ned latent rep- oposed here are ations utilizing tive assessment s that the multi- in the datasets ised learning. inspired by the better than the abilistic ladder network allows direct integration (+ in figure, see Eq. (21) ) of bottom-up and top-down information in the infer- ence model. In the VAE the top-down information is incorporated indirectly through the conditional priors in the generative model. The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓(z1) or (1) P✓(x|z1) = B (x|µ✓(z1)) (2) for continuous-valued (Gaussian N) or binary-valued (Bernoulli B) data, respectively. The latent variables z are split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) (3) p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: pared to equally or more ng flexible variational dis- Gaussian Processes (Tran s (Rezende & Mohamed, Autoencoders (Burda et al., alization and warm-up al- rformance, suggesting that ul. We also show that the erforms as good or better interesting model for fur- dy the learned latent rep- methods proposed here are nt representations utilizing . A qualitative assessment her indicates that the multi- el structure in the datasets emi-supervised learning. e: the VAE inspired by the g as well or better than the n training increasing both e across several different of active stochastic latent Figure 2. Flow of information in the inference and generative models of a) probabilistic ladder network and b) VAE. The prob- abilistic ladder network allows direct integration (+ in figure, see Eq. (21) ) of bottom-up and top-down information in the infer- ence model. In the VAE the top-down information is incorporated indirectly through the conditional priors in the generative model. The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓(z1) or (1) P✓(x|z1) = B (x|µ✓(z1)) (2) for continuous-valued (Gaussian N) or binary-valued (Bernoulli B) data, respectively. The latent variables z are split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) (3) p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) ve generative performance, measured in terms of likelihood, when compared to equally or more ted methods for creating flexible variational dis- s such as the Variational Gaussian Processes (Tran 15) Normalizing Flows (Rezende & Mohamed, Importance Weighted Autoencoders (Burda et al., We find that batch normalization and warm-up al- rease the generative performance, suggesting that thods are broadly useful. We also show that the stic ladder network performs as good or better ng VAEs making it an interesting model for fur- ies. Secondly, we study the learned latent rep- ons. We find that the methods proposed here are y for learning rich latent representations utilizing ayers of latent variables. A qualitative assessment ent representations further indicates that the multi- DGMs capture high level structure in the datasets likely to be useful for semi-supervised learning. ary our contributions are: ew parametrization of the VAE inspired by the der network performing as well or better than the ent best models. ovel warm-up period in training increasing both Figure 2. Flow of information in the inference and ge models of a) probabilistic ladder network and b) VAE. Th abilistic ladder network allows direct integration (+ in fig Eq. (21) ) of bottom-up and top-down information in th ence model. In the VAE the top-down information is incor indirectly through the conditional priors in the generative The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓ (z1) or P✓(x|z1) = B (x|µ✓(z1)) for continuous-valued (Gaussian N ) or binary- (Bernoulli B) data, respectively. The latent variable split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) p✓(zL) = N (zL|0, I) . The hierarchical specification allows the lower layer latent variables to be highly correlated but still main computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specifie a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) Autoencoders (Burda et al., alization and warm-up al- formance, suggesting that ul. We also show that the rforms as good or better interesting model for fur- dy the learned latent rep- methods proposed here are t representations utilizing A qualitative assessment er indicates that the multi- el structure in the datasets emi-supervised learning. e: the VAE inspired by the as well or better than the n training increasing both e across several different of active stochastic latent ence model. In the VAE the top-down information is incorporated indirectly through the conditional priors in the generative model. The generative model p✓ is specified as follows: p✓(x|z1) = N x|µ✓(z1), 2 ✓(z1) or (1) P✓(x|z1) = B (x|µ✓(z1)) (2) for continuous-valued (Gaussian N) or binary-valued (Bernoulli B) data, respectively. The latent variables z are split into L layers zi, i = 1 . . . L: p✓(zi|zi+1) = N zi|µ✓,i(zi+1), 2 ✓,i(zi+1) (3) p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) for i = 2 . . . L. Functions µ(·) and 2 (·) in both the generative and the in- ference models are implemented as:
  • 7. ¤ /(0|1, 34 ) ¤ ℬ(0|1) increasing both several different stochastic latent is essential for stochastic latent rain a generative a x using auxil- model q (z|x)1 to the likelihood recognition model decoder. q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) for i = 2 . . . L. Functions µ(·) and 2 (·) in both the generative and the in- ference models are implemented as: d(y) =MLP(y) (7) µ(y) =Linear(d(y)) (8) 2 (y) =Softplus(Linear(d(y))) , (9) where MLP is a two layered multilayer perceptron network, Linear is a single linear layer, and Softplus applies log(1 + exp(·)) non linearity to each component of its ar- gument vector. In our notation, each MLP(·) or Linear(·) gives a new mapping with its own parameters, so the de- terministic variable d is used to mark that the MLP-part is shared between µ and 2 whereas the last Linear layer is not shared. nspired by the better than the ncreasing both veral different ochastic latent s essential for tochastic latent in a generative x using auxil- p✓(zL) = N (zL|0, I) . (4) The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the computational efficiency of fully factorized models. Each layer in the inference model q (z|x) is specified using a fully factorized Gaussian distribution: q (z1|x) = N z1|µ ,1(x), 2 ,1(x) (5) q (zi|zi 1) = N zi|µ ,i(zi 1), 2 ,i(zi 1) (6) for i = 2 . . . L. Functions µ(·) and 2 (·) in both the generative and the in- ference models are implemented as: d(y) =MLP(y) (7) µ(y) =Linear(d(y)) (8) 2 (y) =Softplus(Linear(d(y))) , (9) where MLP is a two layered multilayer perceptron network, Linear is a single linear layer, and Softplus applies Sigmoid Abstract Variational autoencoders are a powerful frame- work for unsupervised learning. However, pre- vious work has been restricted to shallow mod- els with one or two layers of fully factorized stochastic latent variables, limiting the flexibil- ity of the latent representation. We propose three advances in training algorithms of variational au- toencoders, for the first time allowing to train deep models of up to five stochastic layers, (1) using a structure similar to the Ladder network as the inference model, (2) warm-up period to support stochastic units staying active in early training, and (3) use of batch normalization. Us- ing these improvements we show state-of-the-art log-likelihood results for generative modeling on several benchmark datasets. 1. Introduction The recently introduced variational autoencoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014) provides a framework for deep generative models (DGM). DGMs have later been shown to be a powerful framework for semi-supervised learning (Kingma et al., 2014; Maaloee X !" z z X !" d " d " a) b) 1 2 1 2 2 2 2 1 1 1 Figure 1. Inference (or encoder/rec decoder) models. a) VAE inference der inference model and c) generati variables sampled from the approxi with mean and variances parameteri ables, each conditioned on the l highly flexible latent distribution model parameterizations: the fir the VAE to multiple layers of la ond is parameterized in such a w as a probabilistic variational vari arXiv:1602.02282v1[stat.ML]6
  • 8. ¤ ¤ ¤ log' 0 ≥ :;< = 0 log ', 0, = !. = 0 = ℒ(), (; 0) A,,.ℒ ), (; 0 = A,,.:;< = 0 log ', 0, = !. = 0 = :/(B,C) A,,.log ', 0, = !. = 0 = 1 E F A,,.log G HIC ', 0, =(J) !. =(J) 0
  • 9. ¤ 2 ¤ A ¤ B KL ¤ B ¤ & ¤ IWAE A ¤ A ℒ ), (; 0 = :;< = 0 log ', 0, = !. = 0 ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|=
  • 10. ¤ ¤ A ¤ B KL ¤ B ' 0|= = ' =O ' =OPC |=O '(x|=C ) ! =|0 = ! =C |0 ! =4 |=C !(=O |=OPC ) :;< = 0 log RS 0,= ;< = 0 = :;< = 0 log R =T R =TUV|=T R x|=V ; =V|0 ; =W|=V ;(=T|=TUV) −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|= = −L-[!. = 0 ∥ ', =O ] + :;< = 0 log' =OPC |=O ' x|=C
  • 12. VAE ¤ VAE ¤ ¤ CVAE [Kingma+ 2014] ¤ Variational Fair Auto Encoder [Louizos+ 15] ¤ X sensitive '(%|X) MMD The inference network q (z|x) (3) is used during training of the model using both the labelled and unlabelled data sets. This approximate posterior is then used as a feature extractor for the labelled data set, and the features used for training the classifier. 3.1.2 Generative Semi-supervised Model Objective For this model, we have two cases to consider. In the first case, the label corresponding to a data point is observed and the variational bound is a simple extension of equation (5): log p✓(x, y) Eq (z|x,y) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (z|x, y)]= L(x, y), (6) For the case where the label is missing, it is treated as a latent variable over which we perform posterior inference and the resulting bound for handling data points with an unobserved label y is: log p✓(x) Eq (y,z|x) [log p✓(x|y, z) + log p✓(y) + log p(z) log q (y, z|x)] = X y q (y|x)( L(x, y)) + H(q (y|x)) = U(x). (7) The bound on the marginal likelihood for the entire dataset is now: J = X (x,y)⇠epl L(x, y) + X x⇠epu U(x) (8) The distribution q (y|x) (4) for the missing labels has the form a discriminative classifier, and we can use this knowledge to construct the best classifier possible as our inference model. This distribution is also used at test time for predictions of any unseen data. In the objective function (8), the label predictive distribution q (y|x) contributes only to the second term relating to the unlabelled data, which is an undesirable property if we wish to use this distribu- tion as a classifier. Ideally, all model and variational parameters should learn in all cases. To remedy % # X
  • 13. ¤ Auxiliary Deep Generative Models [Maaløe+ 16] ¤ Auxiliary variables [Agakov+ 2004 ] ¤ ¤ state-of-the-art Auxiliary Deep Generative Models ry deep generative models ngma (2013); Rezende et al. (2014) have cou- oach of variational inference with deep learn- e to powerful probabilistic models constructed nce neural network q(z|x) and a generative rk p(x|z). This approach can be perceived as equivalent to the deep auto-encoder, in which as the encoder and p(x|z) the decoder. How- ference is that these models ensures efficient er various continuous distributions in the la- and complex input datasets x, where the pos- ution p(x|z) is intractable. Furthermore, the the variational upper bound are easily defined agation through the network(s). To keep the al requirements low the variational distribution ally chosen to be a diagonal Gaussian, limiting e power of the inference model. er we propose a variational auxiliary vari- ch (Agakov and Barber, 2004) to improve al distribution: The generative model is ex- yz a x (a) Generative model P. yz a x (b) Inference model Q. Figure 1. Probabilistic graphical model of the ADGM for semi- supervised learning. The incoming joint connections to each vari- able are deep neural networks with parameters ✓ and . 2.2. Auxiliary variables We propose to extend the variational distribution with aux- iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) such that the marginal distribution q(z|x) can fit more complicated posteriors p(z|x). In order to have an unchanged gen- (x|z). This approach can be perceived as valent to the deep auto-encoder, in which e encoder and p(x|z) the decoder. How- ce is that these models ensures efficient arious continuous distributions in the la- complex input datasets x, where the pos- n p(x|z) is intractable. Furthermore, the variational upper bound are easily defined on through the network(s). To keep the quirements low the variational distribution chosen to be a diagonal Gaussian, limiting wer of the inference model. e propose a variational auxiliary vari- Agakov and Barber, 2004) to improve istribution: The generative model is ex- bles a to p(x, z, a) such that the original t to marginalization over a: p(x, z, a) = In the variational distribution, on the s used such that marginal q(z|x) = )da is a general non-Gaussian distribution. specification allows the latent variables to ough a, while maintaining the computa- (a) Generative model P. (b) Inference model Figure 1. Probabilistic graphical model of the ADGM supervised learning. The incoming joint connections to e able are deep neural networks with parameters ✓ and . 2.2. Auxiliary variables We propose to extend the variational distribution w iliary variables a: q(a, z|x) = q(z|a, x)q(a|x) s the marginal distribution q(z|x) can fit more com posteriors p(z|x). In order to have an unchang erative model, p(x|z), it is required that the joi p(x, z, a) gives back the original p(x, z) under m ization over a, thus p(x, z, a) = p(a|x, z)p(x, z iliary variables are used in the EM algorithm an sampling and has previously been considered fo tional learning by Agakov and Barber (2004). R Ranganath et al. (2015) has proposed to make the q(z|x) acts as the encoder and p(x|z) the decoder ever, the difference is that these models ensures e inference over various continuous distributions in tent space z and complex input datasets x, where t terior distribution p(x|z) is intractable. Furtherm gradients of the variational upper bound are easily by backpropagation through the network(s). To k computational requirements low the variational dist q(z|x) is usually chosen to be a diagonal Gaussian, the expressive power of the inference model. In this paper we propose a variational auxiliar able approach (Agakov and Barber, 2004) to i the variational distribution: The generative mode tended with variables a to p(x, z, a) such that the model is invariant to marginalization over a: p(x, p(a|x, z)p(x, z). In the variational distribution, other hand, a is used such that marginal q(zR q(z|a, x)p(a|x)da is a general non-Gaussian distr This hierarchical specification allows the latent vari be correlated through a, while maintaining the co tional efficiency of fully factorized models (cf. Fig
  • 15. Importance Weighted AE ¤ Importance Weighted Autoencoder [Burda+ 15] ¤ ¤ k ¤ Rényi [Li+ 16] ¤ Y log' 0 ≥ : ;< %(C) # ;< %(Z) # log F ', 0, =([) !. =([) 0 Z IC ≥ ℒ(), (; 0) ecall from Section 2.1 that the family of R´enyi divergences includes the KL divergence. Perhaps c ariational free-energy approaches be generalised to the R´enyi case? Consider approximating the tr osterior p(✓|D) by minimizing R´enyi’s ↵-divergence for some selected ↵ 0): q(✓) = arg min q2Q D↵[q(✓)||p(✓|D)]. (1 ow we verify the alternative optimization problem q(✓) = arg max q2Q log p(D) D↵[q(✓)||p(✓|D)]. (1 When ↵ 6= 1, the objective can be rewritten as log p(D) 1 ↵ 1 log Z q(✓)↵ p(✓|D)1 ↵ d✓ = log p(D) 1 ↵ 1 log Eq "✓ p(✓, D) q(✓)p(D) ◆1 ↵ # = 1 1 ↵ log Eq "✓ p(✓, D) q(✓) ◆1 ↵ # := L↵(q; D). (1 We name this new objective the variational R´enyi bound (VR). Importantly the following theorem i rect result of Proposition 1. heorem 1. The objective L↵(q; D) is continuous and non-increasing on ↵ 2 [0, 1] [ {|L↵| < +1 specially for all 0 < ↵ < 1,
  • 16. ¤ Normalizing flows [Rezende+ 15] ¤ ¤ Variational Gaussian Process [Tran+ 15] ¤ Variational Inference with Normalizing Flows and involve matrix inverses that can be numerically unsta- ble. We therefore require normalizing flows that allow for low-cost computation of the determinant, or where the Ja- cobian is not needed at all. 4.1. Invertible Linear-time Transformations We consider a family of transformations of the form: f(z) = z + uh(w> z + b), (10) where = {w 2 IRD , u 2 IRD , b 2 IR} are free pa- rameters and h(·) is a smooth element-wise non-linearity, with derivative h0 (·). For this mapping we can compute the logdet-Jacobian term in O(D) time (using the matrix determinant lemma): (z) = h0 (w> z + b)w (11) det @f @z = | det(I + u (z)> )| = |1 + u> (z)|. (12) From (7) we conclude that the density qK(z) obtained by transforming an arbitrary initial density q0(z) through the sequence of maps fk of the form (10) is implicitly given by: zK = fK fK 1 . . . f1(z) ln qK(zK) = ln q0(z) KX k=1 ln |1 + u> k k(zk)|. (13) The flow defined by the transformation (13) modifies the initial density q0 by applying a series of contractions and expansions in the direction perpendicular to the hyperplane w> z+b = 0, hence we refer to these maps as planar flows. As an alternative, we can consider a family of transforma- tions that modify an initial density q0 around a reference point z0. The transformation family is: f(z) = z + h(↵, r)(z z0), (14) @f d 1 0 K=1 K=2 Planar Radial q0 K=1 K=2K=10 K=10 UnitGaussianUniform Figure 1. Effect of normalizing flow on two distributions. Inference network Generative model Figure 2. Inference and generative models. Left: Inference net- work maps the observations to the parameters of the flow; Right: generative model which receives the posterior samples from the inference network during training time. Round containers repre- sent layers of stochastic variables whereas square containers rep- resent deterministic layers. 4.2. Flow-Based Free Energy Bound If we parameterize the approximate posterior distribution with a flow of length K, q (z|x) := qK(zK), the free en- ergy (3) can be written as an expectation over the initial distribution q0(z): F(x) = Eq (z|x)[log q (z|x) log p(x, z)] = Eq0(z0) [ln qK(zK) log p(x, zK)] = Eq0(z0) [ln q0(z0)] Eq0(z0) [log p(x, zK)] " K # Variational Inference with Normalizi distribution : q(z0 ) = q(z) det @f 1 @z0 = q(z) det @f @z 1 , (5) where the last equality can be seen by applying the chain rule (inverse function theorem) and is a property of Jaco- bians of invertible functions. We can construct arbitrarily complex densities by composing several simple maps and successively applying (5). The density qK(z) obtained by successively transforming a random variable z0 with distri- bution q0 through a chain of K transformations fk is: zK = fK . . . f2 f1(z0) (6) ln qK(zK) = ln q0(z0) KX k=1 ln det @fk @zk , (7) where equation (6) will be used throughout the paper as a shorthand for the composition fK(fK 1(. . . f1(x))). The path traversed by the random variables zk = fk(zk 1) with initial distribution q0(z0) is called the flow and the path formed by the successive distributions qk is a normalizing flow. A property of such transformations, often referred to as the law of the unconscious statistician (LOTUS), is that expectations w.r.t. the transformed density qK can be computed without explicitly knowing qK. Any expectation EqK [h(z)] can be written as an expectation under q0 as: EqK [h(z)] = Eq0 [h(fK fK 1 . . . f1(z0))], (8) which does not require computation of the the logdet- Jacobian terms when h(z) does not depend on qK. We can understand the effect of invertible flows as a se- quence of expansions or contractions on the initial density. For an expansion, the map z0 = f(z) pulls the points z away from a region in IRd , reducing the density in that re- gion while increasing the density outside the region. Con- partial diffe sity q0(z) e T describes Langevin F the Langev dz where d⇠(t E[⇠i(t)⇠j(t D = GG random var Langevin fl of densities Kolmogoro qt(z) of the @ @t qt(z)= In machine with F(z, t L(z) is an u Importantly is given by That is, if evolve its s sulting poin e L(z) , i.e. plored for s Teh (2011); Hamiltonia described in space ˜z = ( tonian H(z used in mac Under review as a conference paper at ICLR 2016 zifi⇠ ✓ D = {(s, t)} d (a) VARIATIONAL MODEL zi x d (b) GENERATIVE MODEL Figure 1: (a) Graphical model of the variational Gaussian process. The VGP generates samples of latent variables z by evaluating random non-linear mappings of latent inputs ⇠, and then drawing mean-field samples parameterized by the mapping. These latent variables aim to follow the posterior distribution for a generative model (b), conditioned on data x.
  • 19. ¤ 1 2 ¤ ¤ 3 ¤ ¤ ¤
  • 20. ¤ Ladder network [Valpola, 14][Rasmus+ 15] probabilistic ladder network ¤ ladder network ¤ ariational Autoencoders and Probabilistic Ladder Networks s we propose (1) rm-up period to rly training, and zing DGMs and up to five layers ls, consisting of h highly expres- fficiency of fully ese models have ured in terms of qually or more variational dis- Processes (Tran e & Mohamed, zd z z a) b) n n nn "Likelihood" Deterministic bottom up pathway Stochastic top down pathway "Posterior" "Prior" + "Copy" Top down pathway through KL-divergences in generative model Bottom up pathway in inference model Indirect top down information through prior Direct flow of information zn Generative model "Copy" Figure 2. Flow of information in the inference and generative models of a) probabilistic ladder network and b) VAE. The prob- abilistic ladder network allows direct integration (+ in figure, see Eq. (21) ) of bottom-up and top-down information in the infer- Probabilistic ladder network VAE bottom-up top-down top-down
  • 21. Top-down ¤ bottom-up ¤ top-down :=~;< = 0 log R =T R =TUV|=T R x|=V ; =V|0 ; =W|=V ;(=T|=TUV) =C~! =C|0 =4~! =4|=C :=~;< = 0 log R =T R =TUV|=T R x|=V ; =V|0 ; =W|=V ;(=T|=TUV) ' =C |=4 ' =4 ' 0|=C=C ~! =C |0 =4 ~! =4 |=C
  • 22. ¤ ¤ ¤ th different number of latent lay- Warm-up WU vides a tractable lower bound an be used as a training crite- ✓(x, z) (z|x) = L(✓, ; x) (10) |p✓(z)) + Eq (z|x) [p✓(x|z)] , (11) ibler divergence. e likelihood may be obtained crease of samples by using the Burda et al., 2015): (z(K)|x) " log KX k=1 p✓(x, z(k) ) q (z(k)|x) # (12) ve parameters, ✓ and , are g Eq. (11) using stochastic se the reparametrization trick n through the Gaussian latent , 2013; Rezende et al., 2014). mance during training. The test set performance was estimated using 5000 importance weighted samples providing a tighter bound than the training bound explaining the better performance here. 2015) as the inference model of a VAE, as shown in Figure 1. The generative model is the same as before. The inference is constructed to first make a deterministic upward pass: d1 =MLP(x) (13) µd,i =Linear(di), i = 1 . . . L (14) 2 d,i =Softplus(Linear(di)), i = 1 . . . L (15) di =MLP(µd,i 1), i = 2 . . . L (16) followed by a stochastic downward pass: q (zL|x) =N µd,L, 2 d,L (17) ti =MLP(zi+1), i = 1 . . . L 1 (18) µt,i =Linear(ti) (19) 2 t,i =Softplus(Linear(ti)) (20) q✓(zi|zi+1, x) =N µt,i 2 t,i + µd,i 2 d,i 2 t,i + 2 d,i , 1 2 t,i + 2 d,i ! . (21) kelihood values for VAEs and the ith different number of latent lay- d Warm-up WU ovides a tractable lower bound can be used as a training crite- p✓(x, z) q (z|x) = L(✓, ; x) (10) ||p✓(z)) + Eq (z|x) [p✓(x|z)] , (11) eibler divergence. he likelihood may be obtained crease of samples by using the (Burda et al., 2015): q (z(K)|x) " log KX k=1 p✓(x, z(k) ) q (z(k)|x) # (12) ve parameters, ✓ and , are ng Eq. (11) using stochastic use the reparametrization trick on through the Gaussian latent g, 2013; Rezende et al., 2014). mated using Monte Carlo sam- orresponding q distribution. mance during training. The test set performance was estimated using 5000 importance weighted samples providing a tighter bound than the training bound explaining the better performance here. 2015) as the inference model of a VAE, as shown in Figure 1. The generative model is the same as before. The inference is constructed to first make a deterministic upward pass: d1 =MLP(x) (13) µd,i =Linear(di), i = 1 . . . L (14) 2 d,i =Softplus(Linear(di)), i = 1 . . . L (15) di =MLP(µd,i 1), i = 2 . . . L (16) followed by a stochastic downward pass: q (zL|x) =N µd,L, 2 d,L (17) ti =MLP(zi+1), i = 1 . . . L 1 (18) µt,i =Linear(ti) (19) 2 t,i =Softplus(Linear(ti)) (20) q✓(zi|zi+1, x) =N µt,i 2 t,i + µd,i 2 d,i 2 t,i + 2 d,i , 1 2 t,i + 2 d,i ! . (21) 2 2 a tractable lower bound used as a training crite- ) ) = L(✓, ; x) (10) ) + Eq (z|x) [p✓(x|z)] , (11) ivergence. lihood may be obtained of samples by using the a et al., 2015): |x) " log KX k=1 p✓(x, z(k) ) q (z(k)|x) # (12) ameters, ✓ and , are (11) using stochastic reparametrization trick ugh the Gaussian latent 3; Rezende et al., 2014). using Monte Carlo sam- here. 2015) as the inference model of a VAE, as shown in Figure 1. The generative model is the same as before. The inference is constructed to first make a deterministic upward pass: d1 =MLP(x) (13) µd,i =Linear(di), i = 1 . . . L (14) 2 d,i =Softplus(Linear(di)), i = 1 . . . L (15) di =MLP(µd,i 1), i = 2 . . . L (16) followed by a stochastic downward pass: q (zL|x) =N µd,L, 2 d,L (17) ti =MLP(zi+1), i = 1 . . . L 1 (18) µt,i =Linear(ti) (19) 2 t,i =Softplus(Linear(ti)) (20) q✓(zi|zi+1, x) =N µt,i 2 t,i + µd,i 2 d,i 2 t,i + 2 d,i , 1 2 t,i + 2 d,i ! . (21) PRML
  • 24. ¤ B ¤ [MacKey 01] ¤ ^ ¤ ¤ ¤ ℒ ), (; 0 = −L-[!. = 0 ∥ ', = ] + :;< = 0 log', 0|= phenomenon. The probabilistic ladder network provides a framework with the wanted interaction, while keeping complications manageable. A further extension could be to make the inference in k steps over an iterative inference procedure (Raiko et al., 2014). 2.2. Warm-up from deterministic to variational autoencoder The variational training criterion in Eq. (11) contains the reconstruction term p✓(x|z) and the variational regular- ization term. The variational regularization term causes some of the latent units to become inactive during train- ing (MacKay, 2001) because the approximate posterior for unit k, q(zi,k| . . . ) is regularized towards its own prior p(zi,k| . . . ), a phenomenon also recognized in the VAE set- ting (Burda et al., 2015). This can be seen as a virtue of automatic relevance determination, but also as a prob- lem when many units are pruned away early in training before they learned a useful representation. We observed that such units remain inactive for the rest of the training, presumably trapped in a local minima or saddle point at KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable to re-activate them. We propose to alleviate the problem by initializing train- tr fr v th o 3 T M C ch 3 al M q 6 b re phenomenon. The a framework with complications man to make the inferen procedure (Raiko et 2.2. Warm-up from autoencoder The variational trai reconstruction term ization term. The some of the latent ing (MacKay, 2001 unit k, q(zi,k| . . . ) p(zi,k| . . . ), a pheno ting (Burda et al., of automatic releva lem when many un before they learned that such units rem presumably trapped KL(qi,k|pi,k) ⇡ 0, to re-activate them. We propose to alle ing using the recon training a standard
  • 25. ¤ ¤ ¤ →Warm-up ¤ Batch Normalization[Ioffe+ 15] ¤ ¤ 2 before they learned a useful representation. We observed that such units remain inactive for the rest of the training, presumably trapped in a local minima or saddle point at KL(qi,k|pi,k) ⇡ 0, with the optimization algorithm unable to re-activate them. We propose to alleviate the problem by initializing train- ing using the reconstruction error only (corresponding to training a standard deterministic auto-encoder), and then gradually introducing the variational regularization term: L(✓, ; x)T = (22) KL(q (z|x)||p✓(z)) + Eq (z|x) [p✓(x|z)] , where is increased linearly from 0 to 1 during the first Nt epochs of training. We denote this scheme warm-up (ab- breviated WU in tables and graphs) because the objective goes from having a delta-function solution (correspond- ing to zero temperature) and then move towards the fully stochastic variational objective. A similar idea has previ- ously been considered in Raiko et al. (2007, Section 6.2), however here used for Bayesian models trained with a co- ordinate descent algorithm. 32, 16, 8 and 4, goin all mappings using MLP’s between x an quent layers were c 64 and 32 for all co bilistic ladder netwo removing latent vari sometimes refer to t the four layer mode models were trained & Ba, 2014) optimiz reported test log-like (12) with 5000 impo et al. (2015). The mo (Bastien et al., 2012 Parmesan3 framewo For MNIST we used mean of a Bernoulli (max(x, 0.1x)) as n were trained for 200 on the complete trai Nt = 200. Simila ple the binarized tra ages using a Bernou
  • 26. ¤ 3 ¤ MNIST ¤ OMNIGLOT [Lake+ 13] ¤ NORB [LeCun+ 04] ¤ 64-32-16-8-4 5 ¤ NN 2 ¤ 1 512 2 256,128,64,32 ¤ tanh leaky rectifiers ¤ k=5000 importance weighted ¤
  • 27. MNIST ¤ VAE 2 ¤ Batch normalization & warm-up ¤ probabilistic ladder network How to Train Deep Variational Autoencoders and Probabilistic Ladder Netwo Figure 3. MNIST test-set log-likelihood values for VAEs and the probabilistic ladder networks with different number of latent lay- ers, Batch normalizationBN and Warm-up WU The variational principle provides a tractable lower bound on the log likelihood which can be used as a training crite- rion L. log p(x) E  log p✓(x, z) = L(✓, ; x) (10) 200 400 600 800 1000 1 Epoc -90 -88 -86 -84 -82 L(x) VAE VAE+BN Figure 4. MNIST train (full lines) an mance during training. The test set using 5000 importance weighted s bound than the training bound expla here. 2015) as the inference model of a 1. The generative model is the sa The inference is constructed to
  • 28. ¤ MC importance weighted IW ¤ MC IW ¤ permutation invariant MNIST ¤ -82.90 [Burda+ 15] ¤ -81.90 [Tran+ 15] ¤ tional Autoencoders and Probabilistic Ladder Networks , where iter- own signals e 2. Notably ce networks see van den sion on this ork provides hile keeping on could be ve inference nal contains the nal regular- term causes during train- Table 1. Fine-tuned test log-likelihood values for 5 layered VAE and probabilistic ladder networks trained on MNIST. ANN. LR: Annealed Learning rate, MC: Monte Carlo samples to approxi- mate Eq(·)[·], IW: Importance weighted samples FINETUNING NONE ANN. LR. MC=1 IW=1 ANN. LR. MC=10 IW=1 ANN. LR. MC=1 IW=10 ANN. LR. MC=10 IW=10 VAE 82.14 81.97 81.84 81.41 81.30 PROB. LADDER 81.87 81.54 81.46 81.35 -81.20 training in deep neural networks by normalizing the outputs from each layer. We show that batch normalization (abbre- viated BN in tables and graphs), applied to all layers except the output layers, is essential for learning deep hierarchies of latent variables for L > 2.
  • 29. ¤ KL 0.01 ¤ VAE ¤ Batch normalization (BN) ¤ Warm-up (WU) ¤ Probabilistic ladder network How to Train Deep Variational Autoencoders and Probabilistic Ladde Table 2. Number of active latent units in five layer VAE and prob- abilistic ladder networks trained on MNIST. A unit was defined as active if KL(qi,k||pi,k) > 0.01 VAE VAE +BN VAE +BN +WU PROB. LADDER +BN +WU LAYER 1 20 20 34 46 LAYER 2 1 9 18 22 LAYER 3 0 3 6 8 LAYER 4 0 3 2 3 LAYER 5 0 2 1 2 TOTAL 21 37 61 81 importance weighted samples to 10 to reduce the variance in the approximation of the expectations in Eq. (10) and improve the inference model, respectively. Models trained on the OMNIGLOT dataset4 , consisting of 28x28 binary images images were trained similar to above except that the number of training epochs was 1500. Models trained on the NORB dataset5 , consisting of 32x32 grays-scale images with color-coding rescaled to [0, 1], Table 3. Test set Log-likelih OMNIGLOT and NORB da dataset and the number of la VAE OMNIGLOT 64 114.45 64-32 112.60 64-32-16 112.13 64-32-16-8 112.49 64-32-16-8-4 112.10 NORB 64 2630.8 64-32 2830.8 64-32-16 2757.5 64-32-16-8 2832.0 64-32-16-8-4 3064.1 tic layers. The performan not improve with more th variables. Contrary to thi
  • 30. OMNIGLOT NORB ¤ BN WU ¤ NORB ladder ¤ ¤ tanh w to Train Deep Variational Autoencoders and Probabilistic Ladder Networks ent units in five layer VAE and prob- ined on MNIST. A unit was defined > 0.01 AE BN VAE +BN +WU PROB. LADDER +BN +WU 0 34 46 9 18 22 3 6 8 3 2 3 2 1 2 7 61 81 ples to 10 to reduce the variance he expectations in Eq. (10) and del, respectively. MNIGLOT dataset4 , consisting of ges were trained similar to above training epochs was 1500. RB dataset5 , consisting of 32x32 color-coding rescaled to [0, 1], tion model with mean and vari- near and a softplus output layer were similar to the models above ngent was used as nonlinearities rate was 0.002, Nt = 1000 and ochs were 4000. Table 3. Test set Log-likelihood values for models trained on the OMNIGLOT and NORB datasets. The left most column show dataset and the number of latent variables i each model. VAE VAE +BN VAE +BN +WU PROB. LADDER +BN +WU OMNIGLOT 64 114.45 108.79 104.63 64-32 112.60 106.86 102.03 102.12 64-32-16 112.13 107.09 101.60 -101.26 64-32-16-8 112.49 107.66 101.68 101.27 64-32-16-8-4 112.10 107.94 101.86 101.59 NORB 64 2630.8 3263.7 3481.5 64-32 2830.8 3140.1 3532.9 3522.7 64-32-16 2757.5 3247.3 3346.7 3458.7 64-32-16-8 2832.0 3302.3 3393.6 3499.4 64-32-16-8-4 3064.1 3258.7 3393.6 3430.3 tic layers. The performance of the vanilla VAE model did not improve with more than two layers of stochastic latent variables. Contrary to this, models trained with batch nor- malization and warm-up consistently increase the model performance for additional layers of stochastic latent vari- ables. As expected the improvement in performance is de- creasing for each additional layer, but we emphasize that the improvements are consistent even for the addition of
  • 31. ¤ VAE ¤ ¤ KL ¤ KL 0 ¤ BN WU Ladder To study this effect we calculated the KL-divergence be- tween q(zi,k|zi 1,k) tent variable k during training as seen in Figure To study this effect we calculated the KL-divergence be- and p(zi|zi+1) for each stochastic la- during training as seen in Figure term is zero if the inference model is independent of the data, i.e. q(zi,k|zi 1,k) = q(zi,k), and hence collapsed
  • 32. ¤ PCA ¤ BN ¤ Ladder
  • 33. ¤ ¤ KL ¤ VAE 2 ¤ Ladder BN WU ¤ structured high level latent representations that are likely useful for semi-supervised learning. The hierarchical latent variable models used here allows highly flexible distributions of the lower layers conditioned on the layers above. We measure the divergence between these conditional distributions and the restrictive mean field approximation by calculating the KL-divergence between q(zi|zi 1) and a standard normal distribution for several models trained on MNIST, see Figure 6 a). As expected the lower layers have highly non (standard) Gaussian dis- tributions when conditioned on the layers above. Interest- ingly the probabilistic ladder network seems to have more active intermediate layers than t,he VAE with batch nor- malization and warm-up. Again this might be explained by the deterministic upward pass easing flow of informa- tion to the intermediate and upper layers. We further note that the KL-divergence is approximately zero in the vanilla VAE model above the second layer confirming the inactiv- ity of these layers. Figure 6 b) shows generative samples from the probabilistic ladder network created by injecting ⇤ Bouchard, Yoshua. Th arXiv prepr Burda, Yuri, lan. Impor arXiv:1509 Dayan, Peter, Zemel, Ric putation, 7 Dieleman, Sa ren Kaae S Aaron, and lease., Aug 10.5281/ Ioffe, Sergey Acceleratin covariate sh Kingma, Di
  • 36. ¤ VAE 3 ¤ ¤ ¤ ¤ ¤ MNIST OMNIGLOT state-of-the-art ¤ Probabilistic ladder network ¤ Probabilistic ladder network
  • 38. VAE ¤ ¤ VAE ¤ ¤ … ¤ →NN ¤ VAE ¤ Lasagne Parmesan ¤ ¤
  • 39. VAE ¤ VAE ¤ VAE Ladder RNN ¤ DL #recognition model (encoder) q_h = [Dense(rng,28*28,200,activation=T.nnet.relu), Dense(rng,200,200,activation=T.nnet.relu)] q_mean = [Dense(rng,200,50,activation=None)] q_sigma = [Dense(rng,200,50,activation=T.nnet.softplus)] q = [Gaussian(q_h,q_mean,q_sigma)] #generate model (decoder) p_h = [Dense(rng,50,200,activation=T.nnet.relu), Dense(rng,200,200,activation=T.nnet.relu)] p_mean = [Dense(rng,200,28*28,activation=T.nnet.sigmoid)] p = [Bernoulli(p_h,p_mean)] #VAE model = VAE_z_x(q,p,k=1,alpha=1,random=rseed) # model.train(train_x) # log_likelihood_test = model.log_likelihood_test(test_x,k=1000,mode='iw') # sample_x = model.p_sample_mean_x(sample_z) Renyi CNN Lasagne