SlideShare a Scribd company logo
1 of 36
Download to read offline
ICLR2016 VAEまとめ
鈴⽊雅⼤
今回の発表について
¤ 今⽇の内容
¤ ICLRで発表されたVAE関連を中⼼に発表します.
¤ ICLR 2016
¤ 2016年5⽉2⽇~4⽇
¤ プエルトリコ,サンフアン
¤ 発表数:
¤ 会議トラック:80
¤ ワークショップ:55
ICLR2016のトレンド
Reinforcement Learning
Unsupervised Learning
Incorporating Structure
Compressing Networks
Incorporating Structure
Initializing Networks
Backprop Tricks
Attention
Deep Metric Learning
Computer Vision Applications
Visualizing Networks
Do Deep Convolutional Nets Really Need to be Deep?
Training-Free Methods
Geometric Methods Gaussian Processes and Auto Encoders
ResNet
http://www.computervisionblog.com
ICLRにおけるVAE論⽂
¤ ICLRに採録されたVAE(もしくはVAEに関連する)論⽂は5本
¤ Importance Weighted Autoencoders
¤ The Variational Fair Autoencoder
¤ Generating Images from Captions with Attention
¤ Variational Gaussian Process
¤ Variationally Auto-Encoded Deep Gaussian Processes
¤ VAEを基礎から説明しつつ,これらの論⽂の説明をします.
識別モデルと⽣成モデル
¤ データを分けることのみに興味があるのが識別モデル
¤ 深層NNやSVMなどは識別モデル(正確には識別関数)
¤ データの⽣成源も考えるのが⽣成モデル
識別モデル(識別関数) ⽣成モデル
𝑝 𝐶# 𝑥 =
𝑝(𝑥, 𝐶#)
𝑝(𝑥)
𝑝 𝐶# 𝑥
分類確率(事後確率)をモデル化
データの分布(同時分布)をモデル化
変分推論とVAE
背景知識:変分推論
¤ ⽣成モデルの学習=データから分布𝑝(𝑥)をモデル化したい
➡尤度𝑝(𝑥)を最⼤化することで求められる.
¤ 𝑝 𝑥 = ∫ 𝑝 𝑥, 𝑧 𝑑𝑧のように潜在変数𝑧もモデル化した場合・・・
¤ そのまま最⼤化できない.
¤ よって,代わりに対数尤度を常に下から抑える下界を最⼤化する.
¤ 𝑝 𝑧|𝑥 を近似する分布𝑞(𝑧|𝑥)を考える.
¤ このとき,対数尤度は次のように分解できる.
log	𝑝 𝑥 = 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))
下界 真の分布と近似分布の差
必ず0以上になる
452 9. MIXTURE MODELS AND EM
Figure 9.12 Illustration of the E step of
the EM algorithm. The q
distribution is set equal to
the posterior distribution for
the current parameter val-
ues θold
, causing the lower
bound to move up to the
same value as the log like-
lihood function, with the KL
divergence vanishing. ln p(X|θold
)L(q, θold
)
KL(q||p) = 0
shown in Figure 9.13. If we substitute q(Z) = p(Z|X, θold
) into (9.71), we see that,
after the E step, the lower bound takes the form
L(q, θ) =
Z
p(Z|X, θold
) ln p(X, Z|θ) −
Z
p(Z|X, θold
) ln p(Z|X, θold
)
= Q(θ, θold
) + const (9.74)
where the constant is simply the negative entropy of the q distribution and is there-
fore independent of θ. Thus in the M step, the quantity that is being maximized is the
expectation of the complete-data log likelihood, as we saw earlier in the case of mix-
tures of Gaussians. Note that the variable θ over which we are optimizing appears
only inside the logarithm. If the joint distribution p(Z, X|θ) comprises a member of
the exponential family, or a product of such members, then we see that the logarithm
will cancel the exponential and lead to an M step that will be typically much simpler
than the maximization of the corresponding incomplete-data log likelihood function
p(X|θ).
The operation of the EM algorithm can also be viewed in the space of parame-
ters, as illustrated schematically in Figure 9.14. Here the red curve depicts the (in-
Figure 9.13 Illustration of the M step of the EM
algorithm. The distribution q(Z)
is held fixed and the lower bound
L(q, θ) is maximized with respect
to the parameter vector θ to give
a revised value θnew
. Because the
KL divergence is nonnegative, this
causes the log likelihood ln p(X|θ)
to increase by at least as much as
the lower bound does.
ln p(X|θnew
)L(q, θnew
)
KL(q||p)
Variational Autioencoder
¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14]
¤ 確率分布を多層ニューラルネットワークで表現した⽣成モデル
¤ 単純のため,潜在変数は𝑧のみとする
¤ 最⼤化する下界は
𝐿 𝐱 = 𝐸89 𝐳 𝐱 log
𝑝; 𝐱, 𝐳
𝑞< 𝐳 𝐱
=
1
𝑇
? log
@
ABC
𝑝; 𝐱, 𝐳(𝒕)
𝑞< 𝐳(𝒕) 𝐱
𝑥
𝑧
𝑞(𝑧|𝑥)
近似分布
=エンコーダーと考える
ただし,𝐳(𝒕) = 𝝁 + diag 𝝈 ⨀𝜺, 𝜺~𝑵(𝟎, I)
reparameterization
trick
𝑥~𝑝(𝑥|𝑧)
𝑧~𝑝(𝑧)
デコーダーと考える
VAEのモデル化
¤ ニューラルネットワークによってモデル化する
・・・
・・・
・・・ ・・・
・・・
・・・
・・・
推論モデル
sampling
z(l)
= µ + σ ⊙ ϵ(l)
, ϵ(l)
∼ N(0, I).
ost significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there are
gative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Same
, we call qφ(z|x, w) as encoder and both pθx (x|z(l)
) and pθw (w|z(l)
) as decoder.
parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws the
which is same as Fig. 1 but represented by deep neural networks.
ering the encoder qφ(z|x, w) as a Gaussian distribution, we can estimate mean and variance
istribution by neural networks as follows:
y(x) = MLPφx (x)
y(w) = MLPφw (w)
µφ = Linear(y(x), y(w))
log σ2
φ = Tanh(Linear(y(x), y(w)), (9)
MLPφx and MLPφw mean deep neural networks corresponding each modality. Moreover,
r and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network has
e input layers, which are corresponding to a and b.
e each modality has different feature representation, we should make different networks for
coder, pθx (x|z) and pθw (w|z). The type of distribution and the network architecture depend
epresentation of each modality, e.g., Gaussian distribution when the representation of modal-
ntinuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw (w|z) is Bernoulli
tion B(w|µθw
), the parameter of Bernoulli distribution µθw
can estimate as follows:
y(z) = MLPθw (z)
µθ = Linear(y(z)) (10)
that the decoder is Gaussian distribution, you can estimate the parameter of this distribution
where z(l)
= µ + σ ⊙ ϵ(l)
, ϵ(l)
∼ N(0, I).
The most significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there are
two negative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Same
as VAE, we call qφ(z|x, w) as encoder and both pθx (x|z(l)
) and pθw (w|z(l)
) as decoder.
We can parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws the
model which is same as Fig. 1 but represented by deep neural networks.
Considering the encoder qφ(z|x, w) as a Gaussian distribution, we can estimate mean and variance
of the distribution by neural networks as follows:
y(x) = MLPφx (x)
y(w) = MLPφw (w)
µφ = Linear(y(x), y(w))
log σ2
φ = Tanh(Linear(y(x), y(w)), (9)
where MLPφx and MLPφw mean deep neural networks corresponding each modality. Moreover,
Linear and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network has
multiple input layers, which are corresponding to a and b.
Because each modality has different feature representation, we should make different networks for
each decoder, pθx (x|z) and pθw (w|z). The type of distribution and the network architecture depend
on the representation of each modality, e.g., Gaussian distribution when the representation of modal-
ity is continuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw (w|z) is Bernoulli
distribution B(w|µθw
), the parameter of Bernoulli distribution µθw
can estimate as follows:
y(z) = MLPθw (z)
µθ = Linear(y(z)) (10)
In case that the decoder is Gaussian distribution, you can estimate the parameter of this distribution
in the same way as Eq. (9), except that the input of the Linear network is single.
The main advantage of this model is following:
Figure 2: The network architecture of MVAE. This represents the sam
Hense,
L(x, w) = −DKL(qφ(z|x, w)||p(z))
+Eqφ(z|x,w)[log pθx (x|z)] + Eqφ(z|x,w)[log p
By SGBM algorithm, the estimator of the lower bound is as follows:
ˆL(x, w) = −DKL(qφ(z|x, w)||p(z))
+
1
L
L
l=1
log pθx (x|z(l)
) + log pθw (w|z(
where z(l)
= µ + σ ⊙ ϵ(l)
, ϵ(l)
∼ N(0, I).
The most significant difference from the estimator of VAE’s lower bound (E
two negative reconstruction terms in Eq. (8). These terms are correspondent
as VAE, we call qφ(z|x, w) as encoder and both pθx (x|z(l)
) and pθw (w|z(l)
We can parameterize encoder and decoder distribution as deep neural netw
model which is same as Fig. 1 but represented by deep neural networks.
$ #% $&
rk architecture of MVAE. This represents the same model as Fig.1.
= −DKL(qφ(z|x, w)||p(z))
+Eqφ(z|x,w)[log pθx (x|z)] + Eqφ(z|x,w)[log pθw (w|z)] (7)
estimator of the lower bound is as follows:
w) = −DKL(qφ(z|x, w)||p(z))
+
1
L
L
l=1
log pθx (x|z(l)
) + log pθw (w|z(l)
), (8)
, ϵ(l)
∼ N(0, I).
ence from the estimator of VAE’s lower bound (Eq. (3)) is that there are
・・・ ・・・
settings and found that their proposed model can extract better representations than sing
settings.
Srivastava & Salakhutdinov (2012) used deep restricted Boltzmann machines (RBM), w
earliest deep generative model, to multimodal learning settings. The same as one by usin
they jointed latent variables in multiple networks and tried to extract high-level feature
timodal features: images and texts. In their experiment, they showed that their model ou
Ngiam et al. (2011). It suggests that deep generative models may extract better represen
discriminative ones.
2.2 VARIATIONAL AUTOENCODERS
Variational autoencoders (VAE) (Welling, 2014; Rezende et al., 2014) are recent propose
erative models.
Given observation variables x and corresponding latent variables z, we consider their
processes as follow:
z ∼ p(z); x ∼ pθ(x|z), ˆx ˆw
where θ means the model parameter of p.
In varaitional inference, we consider qφ(z|x), where φ is the model parameter of q, in
proximate the posterior distribution pθ(z|x). The goal of this problem is that maximize t
𝑞(𝑧|𝑥)
𝑥~𝑝(𝑥|𝑧)
𝑧~𝑝(𝑧)
⽣成モデル
良いモデルを学習するには?
¤ 最⼤化しているのは下界だが,本当は(対数)尤度を最⼤化したい
¤ 下界が対数尤度をうまく近似できればよい
¤ これは近似分布がどれだけ近似できるか次第
𝐿 𝑥 = log	𝑝 𝑥 − 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))
¤ しかし実際には,近似分布はVAEの下界によって制約を受けてしまう.
¤ 事後サンプルが少しでも𝑥を説明できないと,⼤きな制約となる.
¤ 解決策:
¤ より対数尤度を近似するような新たな下界を考える
近似分布が真の事後分布を近似できれば
KL距離は0になる=下界と対数尤度が等しくなる
Importance Weighted AE
¤ Importance Weighted Autoencoders [Bruda+ 15; ICLR 2016]
¤ 次のような新たな下界を提案
¤ サンプル数kによる重要度重み推定量
𝐿# 𝑥 = 𝐸RS,…,RU~89 𝑧 𝑥 log
1
𝑘
?
𝑝; 𝐱, 𝐳(𝐤)
𝑞< 𝐳(𝐤)
𝐱
X
#BC
¤ この下界は,次の関係が証明されている
log	𝑝 𝑥 ≥ 𝐿#ZC 𝑥 ≥ 𝐿# 𝑥 ≥ 𝐿C 𝑥 = 𝐿 𝑥
¤ サンプル数を増やすだけで,制約が緩和され,真の下界に近づく.
IWAE : 実験結果
¤ テスト尤度が向上していることが確認できる
Under review as a conference paper at ICLR 2016
MNIST OMNIGLOT
VAE IWAE VAE IWAE
# stoch.
layers k NLL
active
units NLL
active
units NLL
active
units NLL
active
units
1 1 86.76 19 86.76 19 108.11 28 108.11 28
5 86.47 20 85.54 22 107.62 28 106.12 34
50 86.35 20 84.78 25 107.80 28 104.67 41
2 1 85.33 16+5 85.33 16+5 107.58 28+4 107.56 30+5
5 85.01 17+5 83.89 21+5 106.31 30+5 104.79 38+6
50 84.78 17+5 82.90 26+7 106.30 30+5 103.38 44+7
Table 1: Results on density estimation and the number of active latent dimensions. For models with two latent
layers, “k1+k2” denotes k1 active units in the first layer and k2 in the second layer. The generative performance
of IWAEs improved with increasing k, while that of VAEs benefitted only slightly. Two-layer models achieved
better generative performance than one-layer models.
The log-likelihood results are reported in Table 1. Our VAE results are comparable to those previ-
ously reported in the literature. We observe that training a VAE with k > 1 helped only slightly. By
contrast, using multiple samples improved the IWAE results considerably on both datasets. Note that
条件付きVAEと半教師あり学習
VAEによる半教師あり学習
¤ Semi-Supervised Learning with Deep Generative Models
[Kingma+ 2014 ; NIPS 2014]
¤ 条件付きVAE(CVAE)による半教師あり学習
¤ 条件付きVAEの下界は 𝐿 𝐱|𝒚 = 𝐸89 𝐳 𝐱, 𝐲 log
]^ 𝐱,𝐳|𝐲
89 𝐳 𝐱, 𝐲
¤ よって,下界は
𝐿 𝐱 + 𝐿 𝐱|𝐲 + 𝛼𝔼[−log𝑞< 𝐲 𝐱 ]
¤ 最後の項はラベル予測するモデル
𝑧
𝑥
𝑦
ラベル
CVAE半教師あり学習の問題点
¤ ラベルと潜在変数はモデル上独⽴になっている
¤ しかし,近似分布が𝑞< 𝐳 𝐱, 𝐲 となっているため,𝐲と𝐳に依存関係が⽣じ
てしまう
¤ データの情報を保持しながら,ラベルとは独⽴な潜在変数を獲得したい.
¤ 𝐲をドメインと考えれば,ドメインを除去した表現が獲得できるはず.
𝑧
𝑥
𝑦
Variational Fair Autoencoder
¤ The Variational Fair Autoencoder [Louizos+ 15; ICLR 2016]
¤ 𝑥と𝑠(sensitive変数.前ページでいうラベル)を独⽴にするために,次の
maximum mean discrepancy(MMD)を⼩さくするようにする.
¤ s=0とs=1のときの潜在変数の差がなくなるようにする.
¤ これをVAEの下界に追加する.
¤ MMDは通常カーネルの計算に持っていく.
¤ しかし,SGDで⾼次元のグラム⾏列を計算するのは⼤変なので,写像
を次の形で求める
(7)
Asymptotically, for a universal kernel such as the Gaussian kernel k(x, x0
) = e kx x0
k2
,
`MMD(X, X0
) is 0 if and only if P0 = P1. Equivalently, minimizing MMD can be viewed as
matching all of the moments of P0 and P1. Therefore, we can use it as an extra “regularizer” and
force the model to try to match the moments between the marginal posterior distributions of our
latent variables, i.e., q (z1|s = 0) and q (z1|s = 1) (in the case of binary nuisance information
s1
). By adding the MMD penalty into the lower bound of our aforementioned VAE architecture we
obtain our proposed model, the “Variational Fair Autoencoder” (VFAE):
FVFAE( , ✓; xn, xm, sn, sm, yn) = FVAE( , ✓; xn, xm, sn, sm, yn) `MMD(Z1s=0, Z1s=1) (8)
where:
`MMD(Z1s=0, Z1s=1) = k E˜p(x|s=0)[Eq(z1|x,s=0)[ (z1)]] E˜p(x|s=1)[Eq(z1|x,s=1)[ (z1)]]k2
(9)
2.4 FAST MMD VIA RANDOM FOURIER FEATURES
A naive implementation of MMD in minibatch stochastic gradient descent would require computing
the M ⇥M Gram matrix for each minibatch during training, where M is the minibatch size. Instead,
we can use random kitchen sinks (Rahimi & Recht, 2009) to compute a feature expansion such that
computing the estimator (6) approximates the full MMD (7). To compute this, we draw a random
K ⇥ D matrix W, where K is the dimensionality of x, D is the number of random features and
each entry of W is drawn from a standard isotropic Gaussian. The feature expansion is then given
as:
W(x) =
r
2
D
cos
✓r
2
xW + b
◆
. (10)
where b is a D-dimensional uniform random vector with entries in [0, 2⇡]. Zhao & Meng (2015)
have successfully applied the idea of using random kitchen sinks to approximate MMD. This esti-
mator is fairly accurate, and is typically much faster than the full MMD penalty. We use D = 500
et al., 2006):
`MMD(X, X0
) =
1
N2
0
N0X
n=1
N0X
m=1
k(xn, xm) +
1
N2
1
N1X
n=1
N1X
m=1
k(x0
n, x0
m)
2
N0N1
N0X
n=1
N1X
m=1
k(xn, x0
m).
(7)
Asymptotically, for a universal kernel such as the Gaussian kernel k(x, x0
) = e kx x0
k2
,
`MMD(X, X0
) is 0 if and only if P0 = P1. Equivalently, minimizing MMD can be viewed as
matching all of the moments of P0 and P1. Therefore, we can use it as an extra “regularizer” and
force the model to try to match the moments between the marginal posterior distributions of our
latent variables, i.e., q (z1|s = 0) and q (z1|s = 1) (in the case of binary nuisance information
s1
). By adding the MMD penalty into the lower bound of our aforementioned VAE architecture we
obtain our proposed model, the “Variational Fair Autoencoder” (VFAE):
FVFAE( , ✓; xn, xm, sn, sm, yn) = FVAE( , ✓; xn, xm, sn, sm, yn) `MMD(Z1s=0, Z1s=1) (8)
where:
`MMD(Z1s=0, Z1s=1) = k E˜p(x|s=0)[Eq(z1|x,s=0)[ (z1)]] E˜p(x|s=1)[Eq(z1|x,s=1)[ (z1)]]k2
(9)
2.4 FAST MMD VIA RANDOM FOURIER FEATURES
A naive implementation of MMD in minibatch stochastic gradient descent would require computing
the M ⇥M Gram matrix for each minibatch during training, where M is the minibatch size. Instead,
we can use random kitchen sinks (Rahimi & Recht, 2009) to compute a feature expansion such that
computing the estimator (6) approximates the full MMD (7). To compute this, we draw a random
K ⇥ D matrix W, where K is the dimensionality of x, D is the number of random features and
each entry of W is drawn from a standard isotropic Gaussian. The feature expansion is then given
as:
W(x) =
r
2
D
cos
✓r
2
xW + b
◆
. (10)
where b is a D-dimensional uniform random vector with entries in [0, 2⇡]. Zhao & Meng (2015)
have successfully applied the idea of using random kitchen sinks to approximate MMD. This esti-
mator is fairly accurate, and is typically much faster than the full MMD penalty. We use D = 500
in our experiments.
実験:公平性の検証
¤ zからsの情報がなくなっているかどうかを検証
¤ zからsを分類したときの正解率で評価
Under review as a conference paper at ICLR 2016
(a) Adult dataset
(b) German dataset
(c) Health dataset
Figure 3: Fair classification results. Columns correspond to each evaluation scenario (in order):
Random/RF/LR accuracy on s, Discrimination/Discrimination prob. against s and Random/Model
実験:ドメイン適応の検証
¤ 異なるドメイン間でのドメイン適応
¤ 半教師あり学習で実験(⽬標ドメインのラベルがない).
¤ the Amazon reviews dataset
¤ 𝑦はセンチメント(positiveかnegative)
¤ 結果:
¤ 12のうち9が既存研究([Ganin+ 15])を上回った
Under review as a conference paper at ICLR 2016
2 LEARNING INVARIANT REPRESENTATIONS
x
zs
N
Figure 1: Unsupervised model
x
z1s
z2
y
N
Figure 2: Semi-supervised model
2.1 UNSUPERVISED MODEL
Factoring out undesired variations from the data can be easily formulated as a general probabili
model which admits two distinct (independent) “sources”; an observed variable s, which denotes
variations that we want to remove, and a continuous latent variable z which models all the remain
information. This generative process can be formally defined as:
z ⇠ p(z); x ⇠ p✓(x|z, s)
where p✓(x|z, s) is an appropriate probability distribution for the data we are modelling. With
formulation we explicitly encode a notion of ‘invariance’ in our model, since the latent repres
Under review as a conference paper at ICLR 2016
is concerned, we compared against a recent neural network based state of the art method for domain
adaptation, Domain Adversarial Neural Network (DANN) (Ganin et al., 2015). As we can observe
in table 1, our accuracy on the labels y is higher on 9 out of the 12 domain adaptation tasks whereas
on the remaining 3 it is quite similar to the DANN architecture.
Table 1: Results on the Amazon reviews dataset. The DANN column is taken directly from Ganin
et al. (2015) (the column that uses the original representation as input).
Source - Target
S Y
RF LR VFAE DANN
books - dvd 0.535 0.564 0.799 0.784
books - electronics 0.541 0.562 0.792 0.733
books - kitchen 0.537 0.583 0.816 0.779
dvd - books 0.537 0.563 0.755 0.723
dvd - electronics 0.538 0.566 0.786 0.754
dvd - kitchen 0.543 0.589 0.822 0.783
electronics - books 0.562 0.590 0.727 0.713
electronics - dvd 0.556 0.586 0.765 0.738
electronics - kitchen 0.536 0.570 0.850 0.854
kitchen - books 0.560 0.593 0.720 0.709
kitchen - dvd 0.561 0.599 0.733 0.740
kitchen - electronics 0.533 0.565 0.838 0.843
CVAEの活⽤
¤ 条件付きVAEは,ラベル等に条件づけられた画像を⽣成できる
¤ 学習サンプルに存在していないデータも⽣成可能
¤ 数字ラベルで条件付け[Kingma+ 2014 ; NIPS 2014]
(a) Handwriting styles for MNIST obtained by fixing the class label and varying the 2D latent variable z
(b) MNIST analogies (c) SVHN analogies
Figure 1: (a) Visualisation of handwriting styles learned by the model with 2D z-space. (b,c)
Analogical reasoning with generative semi-supervised models using a high-dimensional z-space.
The leftmost columns show images from the test set. The other columns show analogical fantasies
of x by the generative model, where the latent variable z of each row is set to the value inferred from
Conditional alignDRAW
¤ Generating Images from Captions with Attention [Mansimov+
16 ; ICLR 2016]
¤ DRAWにbidirectional RNNで条件づけたモデル
¤ DRAW [Gregor+ 14]
¤ VAEの枠組みでRNNを使えるようにしたもの.
¤ 各時間ステップで画像を上書きしていく
¤ 前のステップとの差分をみることで注意(attention)をモデル化
Recurrent Neural Network For Image Generation
onstructs scenes
s emitted by the
encoder.
s step by step is
the scene while
e past few years
captured by a se-
, than by a sin-
chelle & Hinton,
; Ranzato, 2014;
et al., 2014; Ser-
ed by sequential
read
x
zt zt+1
P(x|z1:T )write
encoder
RNN
sample
decoder
RNN
read
x
write
encoder
RNN
sample
decoder
RNN
ct 1 ct cT
henc
t 1
hdec
t 1
Q(zt|x, z1:t 1) Q(zt+1|x, z1:t)
. . .
decoding
(generative model)
encoding
(inference)
encoder
FNN
sample
decoder
FNN
z
P(x|z)
Figure 2. Left: Conventional Variational Auto-Encoder. Dur-
DRAW: A Recurrent Neural Network For Image Generati
Time
Figure 7. MNIST generation sequences for DRAW without at-
tention. Notice how the network first generates a very blurry im-
Conditional alignDRAW
¤ Conditional alignDRAWの全体像
¤ DRAWにbidirectional RNNで条件づけたモデル
¤ Bidirectional RNNの出⼒を重み付け和したもので条件付ける.
Published as a conference paper at ICLR 2016
Figure 2: AlignDRAW model for generating images by learning an alignment between the input captions and
generating canvas. The caption is encoded using the Bidirectional RNN (left). The generative RNN takes a
latent sequence z1:T sampled from the prior along with the dynamic caption representation s1:T to generate
the canvas matrix cT , which is then used to generate the final image x (right). The inference RNN is used to
compute approximate posterior Q over the latent sequence.
3.2 IMAGE MODEL: THE CONDITIONAL DRAW NETWORK
To generate an image x conditioned on the caption information y, we extended the DRAW net-
work (Gregor et al., 2015) to include caption representation hlang
at each step, as shown in Fig. 2.
The conditional DRAW network is a stochastic recurrent neural network that consists of a sequence
of latent variables Zt 2 RD
, t = 1, .., T, where the output is accumulated over all T time-steps. For
simplicity in notation, the images x 2 Rh⇥w
are assumed to have size h-by-w and only one color
𝑦
𝑥
𝑧
3.3 LEARNING
The model is trained to maximize a variational lower bound L on the marginal likelihood of the
correct image x given the input caption y:
L =
X
Z
Q(Z | x, y) log P(x | y, Z) DKL (Q(Z | x, y) k P(Z | y))  log P(x | y). (9)
Similar to the DRAW model, the inference recurrent network produces an approximate posterior
Q(Z1:T | x, y) via a read operator, which reads a patch from an input image x using two arrays of
1D Gaussian filters (inverse of write from section 3.2) at each time-step t. Specifically,
ˆxt = x (ct 1), (10)
rt = read(xt, ˆxt, hgen
t 1), (11)
hinfer
t = LSTM infer
(hinfer
t 1 , [rt, hgen
t 1]), (12)
Q(Zt|x, y, Z1:t 1) = N
⇣
µ(hinfer
t ), (hinfer
t )
⌘
, (13)
where ˆx is the error image and hinfer
0 is initialized to the learned bias b. Note that the inference
LSTM infer
takes as its input both the output of the read operator rt 2 Rp⇥p
, which depends on
the original input image x, and the previous state of the generative decoder hgen
t 1, which depends
on the latent sample history z1:t 1 and dynamic sentence representation st 1 (see Eq. 3). Hence,
the approximate posterior Q will depend on the input image x, the corresponding caption y, and the
latent history Z1:t 1, except for the first step Q(Z1|x), which depends only on x.
The terms in the variational lower bound Eq. 9 can be rearranged using the law of total expectation.
Therefore, the variational bound L is calculated as follows:
L =EQ(Z1:T | y,x)
"
log p(x | y, Z1:T )
TX
t=2
DKL (Q(Zt | Z1:t 1, y, x) k P(Zt | Z1:t 1, y))
#
DKL (Q(Z1 | x) k P(Z1)) . (14)
1
We also experimented with a conditional Gaussian observation model, but it worked worse compared to
the Bernoulli model.
実験:キャプション付きMNIST
¤ キャプション付きのMNISTで学習
¤ キャプションはMNISTの場所を指定
¤ 左が訓練データにあるもの,右はないもの.
¤ 複数の数字でも適切に⽣成されている.
Published as a conference paper at ICLR 2016
Figure 6: Examples of generating 60 ⇥ 60 MNIST images corresponding to respective captions. The captions
on the left column were part of the training set. The digits described in the captions on the right column were
hidden during training for the respective configurations.
APPENDIX A: MNIST WITH CAPTIONS
As an additional experiment, we trained our model on the MNIST dataset with artificial captions.
Either one or two digits from the MNIST training dataset were placed on a 60 ⇥ 60 blank image.
One digit was placed in one of the four (top-left, top-right, bottom-left or bottom-right) corners
of the image. Two digits were either placed horizontally or vertically in non-overlapping fashion.
The corresponding artificial captions specified the identity of each digit along with their relative
positions, e.g. “The digit three is at the top of the digit one”, or “The digit seven is at the bottom left
of the image”.
The generated images together with the attention alignments are displayed in Figure 6. The model
実験:MSCOCOデータセット
¤ キャプションの⼀部(下線部)だけを変換
¤ 存在していないキャプションから⽣成
Published as a conference paper at ICLR 2016
A yellow school bus
parked in a parking lot.
A red school bus parked
in a parking lot.
A green school bus
parked in a parking lot.
A blue school bus parked
in a parking lot.
The decadent chocolate
desert is on the table.
A bowl of bananas is on
the table.
A vintage photo of a cat. A vintage photo of a dog.
Figure 3: Top: Examples of changing the color while keeping the caption fixed. Bottom: Examples of changing
the object while keeping the caption fixed. The shown images are the probabilities (cT ). Best viewed in
colour.
The expectation can be approximated by L Monte Carlo samples ˜z1:T from Q(Z1:T | y, x):
L ⇡
1
L
LX
l=1
"
log p(x | y, ˜zl
1:T )
TX
t=2
DKL Q(Zt | ˜zl
1:t 1, y, x) k P(Zt | ˜zl
1:t 1, y)
#
DKL (Q(Z1 | x) k P(Z1)) . (15)
The model can be trained using stochastic gradient descent. In all of our experiments, we used
only a single sample from Q(Z1:T | y, x) for parameter learning. Training details, hyperparameter
settings, and the overall model architecture are specified in Appendix B. The code is available at
https://github.com/emansim/text2image.
3.4 GENERATING IMAGES FROM CAPTIONS
During the image generation step, we discard the inference network and instead sample from the
prior distribution. Due to the blurriness of samples generated by the DRAW model, we perform an
additional post processing step where we use an adversarial network trained on residuals of a Lapla-
Published as a conference paper at ICLR 2016
A stop sign is flying in
blue skies.
A herd of elephants fly-
ing in the blue skies.
A toilet seat sits open in
the grass field.
A person skiing on sand
clad vast desert.
Figure 1: Examples of generated images based on captions that describe novel scene compositions that are
highly unlikely to occur in real life. The captions describe a common object doing unusual things or set in a
strange location.
ガウス過程とVAE
ガウス過程
¤ ガウス過程とは・・・
¤ 関数の確率分布
¤ D次元の⼊⼒ベクトルのデータセット に対する関数の出⼒
ベクトル の同時分布が常にガウス分布
¤ 平均ベクトルは ,共分散⾏列は で完全に記
述される
an Processes
ew the predictive equations and marginal likelihood for Gaussian processes
e associated computational requirements, following the notational conven-
n et al. (2015). See, for example, Rasmussen and Williams (2006) for a
discussion of GPs.
ataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
x an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
tion of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
ector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
function and covariance kernel of the Gaussian process. The kernel, k , is
y . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
ibution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
mple, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
We briefly review the predictive equations and marginal likelihood for Gaussian proc
(GPs), and the associated computational requirements, following the notational con
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) f
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimen
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ,
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) ,
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determ
from the mean function and covariance kernel of the Gaussian process. The kernel, k
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2)
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is give
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) ,
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel
functions, to produce scalable deep kernels.
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
and X. µX⇤
is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated
show that the proposed model outperforms state of the art stand-alone deep learning archi-
tectures and Gaussian processes with advanced kernel learning procedures on a wide range
of datasets, demonstrating its practical significance. We achieve scalability while retaining
non-parametric model structure by leveraging the very recent KISS-GP approach (Wilson
and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel
functions, to produce scalable deep kernels.
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
and X. µX⇤
is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated
at training inputs X. All covariance (kernel) matrices implicitly depend on the kernel
hyperparameters .
15.2. GPs for regression 517
−5 0 5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
(a)
−5 0 5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)
Figure 15.2 Left: some functions sampled from a GP prior with SE kernel. Right: some samples from a GP
posterior, after conditioning on 5 noise-free observations. The shaded area represents E [f(x)]±2std(f(x).
ns and marginal likelihood for Gaussian processes
al requirements, following the notational conven-
example, Rasmussen and Williams (2006) for a
ctor) vectors X = {x1, . . . , xn}, each of dimension
ets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
has a joint Gaussian distribution,
. . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
ariance matrix, (KX,X)ij = k (xi, xj), determined
kernel of the Gaussian process. The kernel, k , is
Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
ed at the n⇤ test points indexed by X⇤, is given by
N(E[f⇤], cov(f⇤)) , (2)
X⇤,X[KX,X + 2
I] 1
y ,
KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
x of covariances between the GP evaluated at X⇤
nd KX,X is the n ⇥ n covariance matrix evaluated
深層ガウス過程
¤ より複雑なサンプルを表現するため,process compositionによって
多層化する[Lawrence & Moore, 07]
➡ 深層ガウス過程(deep GP)
¤ 以下のように,多層グラフィカルモデルを考える
¤ ここでは𝑌がデータ,𝑋が潜在変数.
Published as a conference paper at ICLR 2016
X3 X2
f1 ⇠ GP
X1
f2 ⇠ GP
Y
f3 ⇠ GP
Figure 1: A deep Gaussian process with two hidden layers.
2 DEEP GAUSSIAN PROCESSES
Gaussian processes provide flexible, non-parametric, probabilistic approaches to function estima-
tion. However, their tractability comes at a price: they can only represent a restricted class of
functions. Indeed, even though sophisticated definitions and combinations of covariance functions
can lead to powerful models (Durrande et al., 2011; G¨onen & Alpaydin, 2011; Hensman et al.,
2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu-
tion of instantiations of the latent function remains; this limits the applicability of the models. One
line of recent research to address this limitation focused on function composition (Snelson et al.,
2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process instead
employs process composition (Lawrence & Moore, 2007; Damianou et al., 2011; L´azaro-Gredilla,
2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).
A deep GP is a deep directed graphical model that consists of multiple layers of latent variables
and employs Gaussian processes to govern the mapping between consecutive layers (Lawrence &
Moore, 2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observed
inputs (if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, consider
a set of data Y 2 RN⇥D
with N datapoints and D dimensions. A deep GP then defines L layers of
latent variables, {Xl}L
l=1, Xl 2 RN⇥Ql
through the following nested noise model definition:
Y = f1(X1) + ✏1, ✏1 ⇠ N(0, 2
1I) (1)
Xl 1 = fl(Xl) + ✏l, ✏l ⇠ N(0, 2
l I), l = 2 . . . L (2)
where the functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠
GP(0, kl(x, x0
)). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as
a fairly uninformative prior which also provides soft regularization, i.e. XL ⇠ N(0, I). In the
supervised learning scenario, the inputs of the top hidden layer is observed and govern its hidden
outputs.
The expressive power of a deep GP is significantly greater than that of a standard GP, because the
successive warping of latent variables through the hierarchy allows for modeling non-stationarities
and sophisticated, non-parametric functional “features” (see Figure 2). Similarly to how a GP is
EEP GAUSSIAN PROCESSES
n processes provide flexible, non-parametric, probabilistic approaches to function estima-
However, their tractability comes at a price: they can only represent a restricted class of
ns. Indeed, even though sophisticated definitions and combinations of covariance functions
d to powerful models (Durrande et al., 2011; G¨onen & Alpaydin, 2011; Hensman et al.,
Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu-
nstantiations of the latent function remains; this limits the applicability of the models. One
recent research to address this limitation focused on function composition (Snelson et al.,
Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process instead
s process composition (Lawrence & Moore, 2007; Damianou et al., 2011; L´azaro-Gredilla,
Damianou & Lawrence, 2013; Hensman & Lawrence, 2014).
GP is a deep directed graphical model that consists of multiple layers of latent variables
ploys Gaussian processes to govern the mapping between consecutive layers (Lawrence &
2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observed
if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, consider
data Y 2 RN⇥D
with N datapoints and D dimensions. A deep GP then defines L layers of
ariables, {Xl}L
l=1, Xl 2 RN⇥Ql
through the following nested noise model definition:
Y = f1(X1) + ✏1, ✏1 ⇠ N(0, 2
1I) (1)
Xl 1 = fl(Xl) + ✏l, ✏l ⇠ N(0, 2
l I), l = 2 . . . L (2)
he functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠
kl(x, x0
)). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as
uninformative prior which also provides soft regularization, i.e. XL ⇠ N(0, I). In the
sed learning scenario, the inputs of the top hidden layer is observed and govern its hidden
pressive power of a deep GP is significantly greater than that of a standard GP, because the
ive warping of latent variables through the hierarchy allows for modeling non-stationarities
histicated, non-parametric functional “features” (see Figure 2). Similarly to how a GP is
t of an infinitely wide neural network, a deep GP is the limit where the parametric function
ition of a deep neural network turns into a process composition. Specifically, a deep neural
2 DEEP GAUSSIAN PROCESSES
Gaussian processes provide flexible, non-parame
tion. However, their tractability comes at a pric
functions. Indeed, even though sophisticated defi
can lead to powerful models (Durrande et al., 2
2013; Duvenaud et al., 2013; Wilson & Adams, 2
tion of instantiations of the latent function remain
line of recent research to address this limitation
2004; Calandra et al., 2014). Inspired by deep n
employs process composition (Lawrence & Moor
2012; Damianou & Lawrence, 2013; Hensman &
A deep GP is a deep directed graphical model th
and employs Gaussian processes to govern the m
Moore, 2007; Damianou, 2015). Observed output
inputs (if any) are placed in the upper-most layer, a
a set of data Y 2 RN⇥D
with N datapoints and D
latent variables, {Xl}L
l=1, Xl 2 RN⇥Ql
through th
Y = f1(X1) + ✏1, ✏1
Xl 1 = fl(Xl) + ✏l, ✏l ⇠
where the functions fl are drawn from Gaussian pr
GP(0, kl(x, x0
)). In the unsupervised case, the
a fairly uninformative prior which also provides
supervised learning scenario, the inputs of the to
outputs.
The expressive power of a deep GP is significantl
successive warping of latent variables through the
and sophisticated, non-parametric functional “fea
the limit of an infinitely wide neural network, a de
composition of a deep neural network turns into a
network can be written as:
深層ガウス過程からのサンプリング
VAE-DGP
¤ DGPで変分推論する枠組みは提案されている[Damianou &
Lawrence 13]が,少ないデータでしか学習できなかった.
¤ 共分散⾏列の逆数や,膨⼤なパラメータのため.
¤ DGPの推論をVAEの識別モデル(エンコーダー)と考える.
¤ 制約が加わり,パラメータを減らして推論が速くなる.
¤ 従来のDGPより過学習を抑えられる.
➡VAE-DGP
Variationally Auto-Encoded Deep Gaussian Processes [Dai+ 15; ICLR 2016]
Published as a conference paper at ICLR 2016
X3 X2
f1 ⇠ GP
X1
f2 ⇠ GP
Y
f3 ⇠ GP
{g1(y(n)
)}N
n=1{g2(µ
(n)
1 )}N
n=1{g3(µ
(n)
2 )}N
n=1
Figure 3: A deep Gaussian process with three hidden layers and back-constraints.
実験:⽋損補間
¤ テストデータの⽋損補間
¤ 各例の右端が元画像
ed as a conference paper at ICLR 2016
(a) (b) (c)
5: (a) The samples generated from VAE-DGP trained on the combination of Frey faces and
ces (Frey-Yale). (b) Imputation from the test set of Frey-Yale. (c) Imputation from the test
VHN. The gray color indicates the missing area. The 1st column shows the input images,
column show the imputed images and 3rd column shows the original full images.
KF1F1 , KU1U1 are the covariance matrices of F1 and U1 respectively, KF1U1 is the cross-
nce matrix between F1 and U1, and 1 = Tr(hKF1F1 iq(X1)), 1 = hKF1U1 iq(X1) and
⌦
K>
F1U1
KF1U1
↵
q(X )
, and ⇤1 = KU1U1
+ 1. This enables data-parallelism by dis-
Published as a conference paper at ICLR 2016
5.1 UNSUPERVISED LEARNING
Model MNIST
DBN 138±2
Stacked CAE 121 ± 1.6
Deep GSN 214 ± 1.1
Adversarial nets 225 ± 2
GMMN+AE 282 ± 2
VAE-DGP (5) 301.67
VAE-DGP (10-50) 674.86
VAE-DGP (5-20-50) 723.65
Table 1: Log-likelihood for the MNIST test
data with different models. The baselines are
DBN and Stacked CAE (Bengio et al., 2013),
Deep GSN (Bengio et al., 2014), Adversarial
nets (Goodfellow et al., 2014) and GMMN+AE
(Li et al., 2015).
Figure 6: Samples of imputation on the test
sets. The gray color indicates the missing
area. The 1st column shows the input im-
ages, the 2nd column show the imputed im-
ages and 3rd column shows the original full
images.
We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey
faces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images,
which are resized to 20 ⇥ 28. We take the last 200 frames from the Frey faces and 300 images
randomly from Yale faces as the test set and use the rest for training. The intensity of the original
gray-scale images are normalized to [0, 1]. The applied VAE-DGP has two hidden layers (a 2D top
hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the
layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with
実験:精度評価
¤ 対数尤度(MNIST)
¤ 教師あり学習(回帰)
¤ データセット:
¤ The Abalone dataset
¤ The Creep dataset
Published as a conference paper at ICLR 2016
5.1 UNSUPERVISED LEARNING
Model MNIST
DBN 138±2
Stacked CAE 121 ± 1.6
Deep GSN 214 ± 1.1
Adversarial nets 225 ± 2
GMMN+AE 282 ± 2
VAE-DGP (5) 301.67
VAE-DGP (10-50) 674.86
VAE-DGP (5-20-50) 723.65
Table 1: Log-likelihood for the MNIST test
data with different models. The baselines are
DBN and Stacked CAE (Bengio et al., 2013),
Deep GSN (Bengio et al., 2014), Adversarial
nets (Goodfellow et al., 2014) and GMMN+AE
(Li et al., 2015).
Figure 6: Samples of imputation on the test
sets. The gray color indicates the missing
area. The 1st column shows the input im-
ages, the 2nd column show the imputed im-
ages and 3rd column shows the original full
images.
We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey
faces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images,
which are resized to 20 ⇥ 28. We take the last 200 frames from the Frey faces and 300 images
randomly from Yale faces as the test set and use the rest for training. The intensity of the original
gray-scale images are normalized to [0, 1]. The applied VAE-DGP has two hidden layers (a 2D top
hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the
layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with
widths (500-300). As a generative model, we can draw samples from the learned model by sampling
first from the prior distribution of the top hidden layer (a 2D unit Gaussian distribution in this case)
and layer-wise downwards. The generated images are shown in Figure 5a.
To evaluate the ability of our model learning the data distribution, we train the VAE-DGP on MNIST
(LeCun et al., 1998). We use the whole training set for learning, which consists of 60,000 28 ⇥ 28
images. The intensity of the original gray-scale images are normalized to [0, 1]. We train our model
with three different model settings (one, two and three hidden layers). The trained models are
Published as a conference paper at ICLR 2016
Figure 7: Bayesian optimization experiments for
Model Abalone
VEA-DGP 825.31 ± 64.35
GP 888.96 ± 78.22
Lin. Reg. 917.31 ± 53.76
Model Creep
VEA-DGP 575.39 ± 29.10
GP 602.11 ± 29.59
Lin. Reg. 1865.76 ± 23.36
Table 2: MSE obtained from our VEA-DGP,
standard GP and linear regression for the
Abalone and Creep benchmarks.
変分推論における平均場近似
¤ VAEでは近似分布は𝑞(𝑧|𝑥)と考えてきた
¤ 𝑞(𝑧|𝑥)はニューラルネットワークで表現
¤ ⼀般的に近似分布は平均場近似によって近似される.
¤ もっとリッチな近似分布を考えることもできる
¤ パラメータ𝜆を確率変数として事前分布を考える(階層変分モデル)
log	𝑝 𝑥
= 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))
4
Variational Models
• We want to compute posterior p(z|x) (z: latent variables, x: data)
• Variational inference seeks to minimize 

for a family q(z; )
KL(q(z; )||p(z|x))
• Maximizing evidence lower bound (ELBO)
log p(x) Eq(z; )[log p(x|z)] KL(q(z; )||p(z))
• (Common) Mean-field distribution q(z; ) =
Y
i
q(zi; i)
• Hierarchical variational models
• (Newer) Interpret the family as a variational model for posterior
latent variables z (introducing new latent variables)[1]
Lawrence, N. (2000). Variational Inference in Probabilistic Models. PhD thesis.
変分ガウス過程
¤ The Variational Gaussain Process [Tran+ 15; ICLR 2016]
¤ とても強⼒な変分モデルを提案
¤ を変分データ(パラメータ)とし,次のような𝑧の
⽣成過程を考える.
¤ 潜在変数
¤ ⾮線形写像をDによって条件づけられたガウス過程から⽣成
¤ 潜在変数zを⽣成
7
Variational Gaussian Processes
7
Variational Gaussian Processes
Variational Gaussian Processes
7
Variational Gaussian Processes
Variational Gaussian Processes
変分ガウス過程の尤度
¤ 全ページの⽣成過程から,潜在変数𝑧の周辺尤度は
¤ このようにしてモデル化した近似分布は,𝑝(𝑧|𝑥)がどんな分布であろ
うと, とするパラメータが存在する
(Universal Approximation)
¤ つまり,これまでのどの⼿法よりも限りなく柔軟なモデルとなる.
variational distribution. (This idea appears in a different context in Blei & Lafferty (2006).) The
VGP specifies the following generative process for posterior latent variables z:
1. Draw latent input ⇠ 2 Rc
: ⇠ ⇠ N(0, I).
2. Draw non-linear mapping f : Rc
! Rd
conditioned on D: f ⇠
Qd
i=1 GP(0, K⇠⇠) | D.
3. Draw approximate posterior samples z 2 supp(p): z = (z1, . . . , zd) ⇠
Qd
i=1 q(fi(⇠)).
Figure 1 displays a graphical model for the VGP. Marginalizing over all non-linear mappings and
latent inputs, the VGP is
qVGP(z; ✓, D) =
ZZ " dY
i=1
q(zi | fi(⇠))
# " dY
i=1
GP(fi; 0, K⇠⇠) | D
#
N(⇠; 0, I) df d⇠, (4)
which is parameterized by kernel hyperparameters ✓ and variational data.
As a variational model, the VGP forms an infinite ensemble of mean-field distributions. A mean-field
distribution is specified conditional on a fixed function f(·) and input ⇠; the d outputs fi(⇠) = i are
the mean-field’s parameters. The VGP is a form of a hierarchical variational model (Ranganath et al.,
2015); it places a continuous Bayesian nonparametric prior over mean-field parameters.
Note that the VGP evaluates the d draws from a GP at the same latent input ⇠, which induces cor-
relation between their outputs, the mean-field parameters. In turn, this induces correlation between
latent variables of the variational model, correlations that are not captured in classical mean-field.
Finally, the complex non-linear mappings drawn from the GP make the VGP a flexible model for
complex discrete and continuous posteriors.
We emphasize that the VGP needs variational data because—unlike typical GP regression—there is
no observed data available to learn a distribution over non-linear mappings. The variational data
再掲
log	𝑝 𝑥
= 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥))
quence of domain mappings during inference, from variational latent variable space
r latent variable space Q to data space P. We perform variational inference in the
e and auxiliary inference in the variational space.
resses the task of posterior inference by learning f⇤
: conditional on variational data
ameters to learn, the distribution of the GP learns to concentrate around this optimal
ng inference. This perspective provides intuition behind the following result.
Universal approximation). Let q(z; ✓, D) denote the variational Gaussian process. For
distribution p(z | x) with a finite number of latent variables and continuous quantile
rse CDF), there exist a set of parameters (✓, D) such that
KL(q(z; ✓, D) k p(z | x)) = 0.
x B for a proof. Theorem 1 states that any posterior distribution with strictly posi-
an be represented by a VGP. Thus the VGP is a flexible model for learning posterior
BOX INFERENCE
IONAL OBJECTIVE
下界
¤ 学習では次の下界を最⼤化する
¤ イメージとしては次のような感じ
¤ 近似モデルでxからzを⽣成
¤ 補助モデルでxとzから写像と潜在変数を⽣成
3 BLACK BOX INFERENCE
3.1 VARIATIONAL OBJECTIVE
We derive an algorithm for performing black box inference over a wide class of generative models.
The original ELBO (Eq.1) is analytically intractable due to the log density log qVGP(z) (Eq.4). We
derive a tractable variational objective inspired by auto-encoders.
Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtracting
an expected KL divergence term from the ELBO:
log p(x) EqVGP
[log p(x | z)] KL(qVGP(z)kp(z)) EqVGP
h
KL(q(⇠, f | z)kr(⇠, f | z))
i
,
where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently by Sal-
imans et al. (2015) and Ranganath et al. (2015). Variational inference is performed in the posterior
latent variable space, minimizing KL(qkp) to learn the variational model; for this to occur auxil-
iary inference is performed in the variational latent variable space, minimizing KL(qkr) to learn an
auxiliary model. See Figure 2.
Unlike previous approaches, we rewrite this variational objective to connect to auto-encoders:
eL = EqVGP
[log p(x | z)] EqVGP
h
KL(q(z | f(⇠))kp(z)) + KL(q(⇠, f)kr(⇠, f | z))
i
, (5)
where the KL divergences are now taken over tractable distributions (see Appendix C). In auto-
encoder parlance, we maximize the expected negative reconstruction error, regularized by an ex-
pected divergence between the variational model and the original model’s prior, and an expected
divergence between the auxiliary model and the variational model’s prior. This is simply a nested
instantiation of the variational auto-encoder bound (Kingma & Welling, 2014): a KL divergence
between the inference model and a prior is taken as regularizers on both the posterior and variational
spaces. This interpretation justifies the previously proposed bound for variational models; as we
shall see, it also enables lower variance gradients during stochastic optimization.
derive a tractable variational objective inspired by auto-encoders.
Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtracting
an expected KL divergence term from the ELBO:
log p(x) EqVGP
[log p(x | z)] KL(qVGP(z)kp(z)) EqVGP
h
KL(q(⇠, f | z)kr(⇠, f | z))
i
,
where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently by Sal-
imans et al. (2015) and Ranganath et al. (2015). Variational inference is performed in the posterior
latent variable space, minimizing KL(qkp) to learn the variational model; for this to occur auxil-
iary inference is performed in the variational latent variable space, minimizing KL(qkr) to learn an
auxiliary model. See Figure 2.
Unlike previous approaches, we rewrite this variational objective to connect to auto-encoders:
eL = EqVGP
[log p(x | z)] EqVGP
h
KL(q(z | f(⇠))kp(z)) + KL(q(⇠, f)kr(⇠, f | z))
i
, (5)
where the KL divergences are now taken over tractable distributions (see Appendix C). In auto-
encoder parlance, we maximize the expected negative reconstruction error, regularized by an ex-
pected divergence between the variational model and the original model’s prior, and an expected
divergence between the auxiliary model and the variational model’s prior. This is simply a nested
instantiation of the variational auto-encoder bound (Kingma & Welling, 2014): a KL divergence
between the inference model and a prior is taken as regularizers on both the posterior and variational
spaces. This interpretation justifies the previously proposed bound for variational models; as we
shall see, it also enables lower variance gradients during stochastic optimization.
5
再構成誤差 正規化項
補助モデル
Under review as a conference paper at ICLR 2016
3.2 AUTO-ENCODING VARIATIONAL MODELS
Inference networks provide a flexible parameterization of approximating
in Helmholtz machines (Hinton & Zemel, 1994), deep Boltzmann machin
Larochelle, 2010), and variational auto-encoders (Kingma & Welling, 2014; R
replaces local variational parameters with global parameters coming from a ne
ically, for latent variables zn which correspond to a data point xn, an infere
a neural network which takes xn as input and its local variational parameter
amortizes inference by only defining a set of global parameters.
To auto-encode the VGP we specify inference networks to parameterize bo
auxiliary models. Unique from other auto-encoder approaches, we let the aux
observed data point xn and variational data point zn as input:
xn 7! q(zn | xn; ✓n), xn, zn 7! r(⇠n, fn | xn, zn; n
where q has local variational parameters given by the variational data Dn, a
fully factorized Gaussian with local variational parameters n = (µn 2 R
Note that by letting r’s inference network take both xn and zn as input, w
explicit specification of r(✏, f | z). This idea was first suggested but not imple
et al. (2015).
w as a conference paper at ICLR 2016
ENCODING VARIATIONAL MODELS
tworks provide a flexible parameterization of approximating distributions as used
z machines (Hinton & Zemel, 1994), deep Boltzmann machines (Salakhutdinov &
010), and variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014). It
l variational parameters with global parameters coming from a neural network. Specif-
ent variables zn which correspond to a data point xn, an inference network specifies
work which takes xn as input and its local variational parameters n as output. This
erence by only defining a set of global parameters.
ode the VGP we specify inference networks to parameterize both the variational and
dels. Unique from other auto-encoder approaches, we let the auxiliary model take both
a point xn and variational data point zn as input:
xn 7! q(zn | xn; ✓n), xn, zn 7! r(⇠n, fn | xn, zn; n),
local variational parameters given by the variational data Dn, and r is specified as a
ed Gaussian with local variational parameters n = (µn 2 Rc+d
, 2
n 2 Rc+d
). 1
letting r’s inference network take both x and z as input, we avoid the restrictive
実験:対数尤度
¤ 前⼈未到の70代に突⼊
¤ ⽣成部分のモデルをDRAW, 近似分布をVGPとしたモデルが⼀番良い
Published as a conference paper at ICLR 2016
Model log p(x) 
DLGM + VAE [1] 86.76
DLGM + HVI (8 leapfrog steps) [2] 85.51 88.30
DLGM + NF (k = 80) [3] 85.10
EoNADE-5 2hl (128 orderings) [4] 84.68
DBN 2hl [5] 84.55
DARN 1hl [6] 84.13
Convolutional VAE + HVI [2] 81.94 83.49
DLGM 2hl + IWAE (k = 50) [1] 82.90
DRAW [7] 80.97
DLGM 1hl + VGP 84.79
DLGM 2hl + VGP 81.32
DRAW + VGP 79.88
Table 1: Negative predictive log-likelihood for binarized MNIST. Previous best results are
[1] (Burda et al., 2016), [2] (Salimans et al., 2015), [3] (Rezende & Mohamed, 2015), [4] (Raiko
et al., 2014), [5] (Murray & Salakhutdinov, 2009), [6] (Gregor et al., 2014), [7] (Gregor et al., 2015).
まとめ
¤ 今回はICLR2016のVAE研究を中⼼に
¤ 変分推論とVAE
¤ 条件付きVAEと半教師あり学習
¤ ガウス過程とVAE
についてまとめた
¤ 感想
¤ ICLRの傾向がなんとなくわかった
¤ まとめるのが難しかった

More Related Content

What's hot

[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展Deep Learning JP
 
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured PredictionDeep Learning JP
 
[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by Factorising[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by FactorisingDeep Learning JP
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language SupervisionDeep Learning JP
 
Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Yamato OKAMOTO
 
【DL輪読会】時系列予測 Transfomers の精度向上手法
【DL輪読会】時系列予測 Transfomers の精度向上手法【DL輪読会】時系列予測 Transfomers の精度向上手法
【DL輪読会】時系列予測 Transfomers の精度向上手法Deep Learning JP
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational AutoencoderDeep Learning JP
 
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation LearningDeep Learning JP
 
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and EditingDeep Learning JP
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
 
[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係についてDeep Learning JP
 
PRML第6章「カーネル法」
PRML第6章「カーネル法」PRML第6章「カーネル法」
PRML第6章「カーネル法」Keisuke Sugawara
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)Masahiro Suzuki
 
変分推論と Normalizing Flow
変分推論と Normalizing Flow変分推論と Normalizing Flow
変分推論と Normalizing FlowAkihiro Nitta
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理Taiji Suzuki
 
劣モジュラ最適化と機械学習1章
劣モジュラ最適化と機械学習1章劣モジュラ最適化と機械学習1章
劣モジュラ最適化と機械学習1章Hakky St
 

What's hot (20)

[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展
 
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
 
[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by Factorising[DL輪読会]Disentangling by Factorising
[DL輪読会]Disentangling by Factorising
 
coordinate descent 法について
coordinate descent 法についてcoordinate descent 法について
coordinate descent 法について
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
 
実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE
 
Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)
 
数式からみるWord2Vec
数式からみるWord2Vec数式からみるWord2Vec
数式からみるWord2Vec
 
【DL輪読会】時系列予測 Transfomers の精度向上手法
【DL輪読会】時系列予測 Transfomers の精度向上手法【DL輪読会】時系列予測 Transfomers の精度向上手法
【DL輪読会】時系列予測 Transfomers の精度向上手法
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
 
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
 
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
 
[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について
 
PRML第6章「カーネル法」
PRML第6章「カーネル法」PRML第6章「カーネル法」
PRML第6章「カーネル法」
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
変分推論と Normalizing Flow
変分推論と Normalizing Flow変分推論と Normalizing Flow
変分推論と Normalizing Flow
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理
 
ELBO型VAEのダメなところ
ELBO型VAEのダメなところELBO型VAEのダメなところ
ELBO型VAEのダメなところ
 
劣モジュラ最適化と機械学習1章
劣モジュラ最適化と機械学習1章劣モジュラ最適化と機械学習1章
劣モジュラ最適化と機械学習1章
 

Similar to Iclr2016 vaeまとめ

(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Olga Zinkevych
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
Convolution Neural Networks
Convolution Neural NetworksConvolution Neural Networks
Convolution Neural NetworksAhmedMahany
 
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSSLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSIJCI JOURNAL
 
An introduction to deep learning
An introduction to deep learningAn introduction to deep learning
An introduction to deep learningVan Thanh
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAngie Miller
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelstuxette
 
An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...IJMIT JOURNAL
 
International Journal of Managing Information Technology (IJMIT)
International Journal of Managing Information Technology (IJMIT)International Journal of Managing Information Technology (IJMIT)
International Journal of Managing Information Technology (IJMIT)IJMIT JOURNAL
 

Similar to Iclr2016 vaeまとめ (20)

(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
Deepwalk vs Node2vec
Deepwalk vs Node2vecDeepwalk vs Node2vec
Deepwalk vs Node2vec
 
Deepwalk vs Node2vec
Deepwalk vs Node2vecDeepwalk vs Node2vec
Deepwalk vs Node2vec
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
 
Slides4
Slides4Slides4
Slides4
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Dycops2019
Dycops2019 Dycops2019
Dycops2019
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Convolution Neural Networks
Convolution Neural NetworksConvolution Neural Networks
Convolution Neural Networks
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSSLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
 
An introduction to deep learning
An introduction to deep learningAn introduction to deep learning
An introduction to deep learning
 
Quantum Deep Learning
Quantum Deep LearningQuantum Deep Learning
Quantum Deep Learning
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer Design
 
06075626 (1)
06075626 (1)06075626 (1)
06075626 (1)
 
06075626
0607562606075626
06075626
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...An improved spfa algorithm for single source shortest path problem using forw...
An improved spfa algorithm for single source shortest path problem using forw...
 
International Journal of Managing Information Technology (IJMIT)
International Journal of Managing Information Technology (IJMIT)International Journal of Managing Information Technology (IJMIT)
International Journal of Managing Information Technology (IJMIT)
 

More from Deep Learning JP

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving PlannersDeep Learning JP
 
【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについてDeep Learning JP
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...Deep Learning JP
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-ResolutionDeep Learning JP
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxivDeep Learning JP
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLMDeep Learning JP
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...Deep Learning JP
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place RecognitionDeep Learning JP
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?Deep Learning JP
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究についてDeep Learning JP
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )Deep Learning JP
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...Deep Learning JP
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"Deep Learning JP
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "Deep Learning JP
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"Deep Learning JP
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...Deep Learning JP
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...Deep Learning JP
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...Deep Learning JP
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...Deep Learning JP
 

More from Deep Learning JP (20)

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
 
【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Iclr2016 vaeまとめ

  • 2. 今回の発表について ¤ 今⽇の内容 ¤ ICLRで発表されたVAE関連を中⼼に発表します. ¤ ICLR 2016 ¤ 2016年5⽉2⽇~4⽇ ¤ プエルトリコ,サンフアン ¤ 発表数: ¤ 会議トラック:80 ¤ ワークショップ:55
  • 3. ICLR2016のトレンド Reinforcement Learning Unsupervised Learning Incorporating Structure Compressing Networks Incorporating Structure Initializing Networks Backprop Tricks Attention Deep Metric Learning Computer Vision Applications Visualizing Networks Do Deep Convolutional Nets Really Need to be Deep? Training-Free Methods Geometric Methods Gaussian Processes and Auto Encoders ResNet http://www.computervisionblog.com
  • 4. ICLRにおけるVAE論⽂ ¤ ICLRに採録されたVAE(もしくはVAEに関連する)論⽂は5本 ¤ Importance Weighted Autoencoders ¤ The Variational Fair Autoencoder ¤ Generating Images from Captions with Attention ¤ Variational Gaussian Process ¤ Variationally Auto-Encoded Deep Gaussian Processes ¤ VAEを基礎から説明しつつ,これらの論⽂の説明をします.
  • 5. 識別モデルと⽣成モデル ¤ データを分けることのみに興味があるのが識別モデル ¤ 深層NNやSVMなどは識別モデル(正確には識別関数) ¤ データの⽣成源も考えるのが⽣成モデル 識別モデル(識別関数) ⽣成モデル 𝑝 𝐶# 𝑥 = 𝑝(𝑥, 𝐶#) 𝑝(𝑥) 𝑝 𝐶# 𝑥 分類確率(事後確率)をモデル化 データの分布(同時分布)をモデル化
  • 7. 背景知識:変分推論 ¤ ⽣成モデルの学習=データから分布𝑝(𝑥)をモデル化したい ➡尤度𝑝(𝑥)を最⼤化することで求められる. ¤ 𝑝 𝑥 = ∫ 𝑝 𝑥, 𝑧 𝑑𝑧のように潜在変数𝑧もモデル化した場合・・・ ¤ そのまま最⼤化できない. ¤ よって,代わりに対数尤度を常に下から抑える下界を最⼤化する. ¤ 𝑝 𝑧|𝑥 を近似する分布𝑞(𝑧|𝑥)を考える. ¤ このとき,対数尤度は次のように分解できる. log 𝑝 𝑥 = 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥)) 下界 真の分布と近似分布の差 必ず0以上になる 452 9. MIXTURE MODELS AND EM Figure 9.12 Illustration of the E step of the EM algorithm. The q distribution is set equal to the posterior distribution for the current parameter val- ues θold , causing the lower bound to move up to the same value as the log like- lihood function, with the KL divergence vanishing. ln p(X|θold )L(q, θold ) KL(q||p) = 0 shown in Figure 9.13. If we substitute q(Z) = p(Z|X, θold ) into (9.71), we see that, after the E step, the lower bound takes the form L(q, θ) = Z p(Z|X, θold ) ln p(X, Z|θ) − Z p(Z|X, θold ) ln p(Z|X, θold ) = Q(θ, θold ) + const (9.74) where the constant is simply the negative entropy of the q distribution and is there- fore independent of θ. Thus in the M step, the quantity that is being maximized is the expectation of the complete-data log likelihood, as we saw earlier in the case of mix- tures of Gaussians. Note that the variable θ over which we are optimizing appears only inside the logarithm. If the joint distribution p(Z, X|θ) comprises a member of the exponential family, or a product of such members, then we see that the logarithm will cancel the exponential and lead to an M step that will be typically much simpler than the maximization of the corresponding incomplete-data log likelihood function p(X|θ). The operation of the EM algorithm can also be viewed in the space of parame- ters, as illustrated schematically in Figure 9.14. Here the red curve depicts the (in- Figure 9.13 Illustration of the M step of the EM algorithm. The distribution q(Z) is held fixed and the lower bound L(q, θ) is maximized with respect to the parameter vector θ to give a revised value θnew . Because the KL divergence is nonnegative, this causes the log likelihood ln p(X|θ) to increase by at least as much as the lower bound does. ln p(X|θnew )L(q, θnew ) KL(q||p)
  • 8. Variational Autioencoder ¤ Variational Autoencoder [Kingma+ 13][Rezende+ 14] ¤ 確率分布を多層ニューラルネットワークで表現した⽣成モデル ¤ 単純のため,潜在変数は𝑧のみとする ¤ 最⼤化する下界は 𝐿 𝐱 = 𝐸89 𝐳 𝐱 log 𝑝; 𝐱, 𝐳 𝑞< 𝐳 𝐱 = 1 𝑇 ? log @ ABC 𝑝; 𝐱, 𝐳(𝒕) 𝑞< 𝐳(𝒕) 𝐱 𝑥 𝑧 𝑞(𝑧|𝑥) 近似分布 =エンコーダーと考える ただし,𝐳(𝒕) = 𝝁 + diag 𝝈 ⨀𝜺, 𝜺~𝑵(𝟎, I) reparameterization trick 𝑥~𝑝(𝑥|𝑧) 𝑧~𝑝(𝑧) デコーダーと考える
  • 9. VAEのモデル化 ¤ ニューラルネットワークによってモデル化する ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ 推論モデル sampling z(l) = µ + σ ⊙ ϵ(l) , ϵ(l) ∼ N(0, I). ost significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there are gative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Same , we call qφ(z|x, w) as encoder and both pθx (x|z(l) ) and pθw (w|z(l) ) as decoder. parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws the which is same as Fig. 1 but represented by deep neural networks. ering the encoder qφ(z|x, w) as a Gaussian distribution, we can estimate mean and variance istribution by neural networks as follows: y(x) = MLPφx (x) y(w) = MLPφw (w) µφ = Linear(y(x), y(w)) log σ2 φ = Tanh(Linear(y(x), y(w)), (9) MLPφx and MLPφw mean deep neural networks corresponding each modality. Moreover, r and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network has e input layers, which are corresponding to a and b. e each modality has different feature representation, we should make different networks for coder, pθx (x|z) and pθw (w|z). The type of distribution and the network architecture depend epresentation of each modality, e.g., Gaussian distribution when the representation of modal- ntinuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw (w|z) is Bernoulli tion B(w|µθw ), the parameter of Bernoulli distribution µθw can estimate as follows: y(z) = MLPθw (z) µθ = Linear(y(z)) (10) that the decoder is Gaussian distribution, you can estimate the parameter of this distribution where z(l) = µ + σ ⊙ ϵ(l) , ϵ(l) ∼ N(0, I). The most significant difference from the estimator of VAE’s lower bound (Eq. (3)) is that there are two negative reconstruction terms in Eq. (8). These terms are correspondent to each modality. Same as VAE, we call qφ(z|x, w) as encoder and both pθx (x|z(l) ) and pθw (w|z(l) ) as decoder. We can parameterize encoder and decoder distribution as deep neural networks. Figs.2 draws the model which is same as Fig. 1 but represented by deep neural networks. Considering the encoder qφ(z|x, w) as a Gaussian distribution, we can estimate mean and variance of the distribution by neural networks as follows: y(x) = MLPφx (x) y(w) = MLPφw (w) µφ = Linear(y(x), y(w)) log σ2 φ = Tanh(Linear(y(x), y(w)), (9) where MLPφx and MLPφw mean deep neural networks corresponding each modality. Moreover, Linear and Tanh mean a linear layer and a tanh layer. Linear(a, b) means that this network has multiple input layers, which are corresponding to a and b. Because each modality has different feature representation, we should make different networks for each decoder, pθx (x|z) and pθw (w|z). The type of distribution and the network architecture depend on the representation of each modality, e.g., Gaussian distribution when the representation of modal- ity is continuous, Bernoulli distribution when binary value, 0 or 1. In case that pθw (w|z) is Bernoulli distribution B(w|µθw ), the parameter of Bernoulli distribution µθw can estimate as follows: y(z) = MLPθw (z) µθ = Linear(y(z)) (10) In case that the decoder is Gaussian distribution, you can estimate the parameter of this distribution in the same way as Eq. (9), except that the input of the Linear network is single. The main advantage of this model is following: Figure 2: The network architecture of MVAE. This represents the sam Hense, L(x, w) = −DKL(qφ(z|x, w)||p(z)) +Eqφ(z|x,w)[log pθx (x|z)] + Eqφ(z|x,w)[log p By SGBM algorithm, the estimator of the lower bound is as follows: ˆL(x, w) = −DKL(qφ(z|x, w)||p(z)) + 1 L L l=1 log pθx (x|z(l) ) + log pθw (w|z( where z(l) = µ + σ ⊙ ϵ(l) , ϵ(l) ∼ N(0, I). The most significant difference from the estimator of VAE’s lower bound (E two negative reconstruction terms in Eq. (8). These terms are correspondent as VAE, we call qφ(z|x, w) as encoder and both pθx (x|z(l) ) and pθw (w|z(l) We can parameterize encoder and decoder distribution as deep neural netw model which is same as Fig. 1 but represented by deep neural networks. $ #% $& rk architecture of MVAE. This represents the same model as Fig.1. = −DKL(qφ(z|x, w)||p(z)) +Eqφ(z|x,w)[log pθx (x|z)] + Eqφ(z|x,w)[log pθw (w|z)] (7) estimator of the lower bound is as follows: w) = −DKL(qφ(z|x, w)||p(z)) + 1 L L l=1 log pθx (x|z(l) ) + log pθw (w|z(l) ), (8) , ϵ(l) ∼ N(0, I). ence from the estimator of VAE’s lower bound (Eq. (3)) is that there are ・・・ ・・・ settings and found that their proposed model can extract better representations than sing settings. Srivastava & Salakhutdinov (2012) used deep restricted Boltzmann machines (RBM), w earliest deep generative model, to multimodal learning settings. The same as one by usin they jointed latent variables in multiple networks and tried to extract high-level feature timodal features: images and texts. In their experiment, they showed that their model ou Ngiam et al. (2011). It suggests that deep generative models may extract better represen discriminative ones. 2.2 VARIATIONAL AUTOENCODERS Variational autoencoders (VAE) (Welling, 2014; Rezende et al., 2014) are recent propose erative models. Given observation variables x and corresponding latent variables z, we consider their processes as follow: z ∼ p(z); x ∼ pθ(x|z), ˆx ˆw where θ means the model parameter of p. In varaitional inference, we consider qφ(z|x), where φ is the model parameter of q, in proximate the posterior distribution pθ(z|x). The goal of this problem is that maximize t 𝑞(𝑧|𝑥) 𝑥~𝑝(𝑥|𝑧) 𝑧~𝑝(𝑧) ⽣成モデル
  • 10. 良いモデルを学習するには? ¤ 最⼤化しているのは下界だが,本当は(対数)尤度を最⼤化したい ¤ 下界が対数尤度をうまく近似できればよい ¤ これは近似分布がどれだけ近似できるか次第 𝐿 𝑥 = log 𝑝 𝑥 − 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥)) ¤ しかし実際には,近似分布はVAEの下界によって制約を受けてしまう. ¤ 事後サンプルが少しでも𝑥を説明できないと,⼤きな制約となる. ¤ 解決策: ¤ より対数尤度を近似するような新たな下界を考える 近似分布が真の事後分布を近似できれば KL距離は0になる=下界と対数尤度が等しくなる
  • 11. Importance Weighted AE ¤ Importance Weighted Autoencoders [Bruda+ 15; ICLR 2016] ¤ 次のような新たな下界を提案 ¤ サンプル数kによる重要度重み推定量 𝐿# 𝑥 = 𝐸RS,…,RU~89 𝑧 𝑥 log 1 𝑘 ? 𝑝; 𝐱, 𝐳(𝐤) 𝑞< 𝐳(𝐤) 𝐱 X #BC ¤ この下界は,次の関係が証明されている log 𝑝 𝑥 ≥ 𝐿#ZC 𝑥 ≥ 𝐿# 𝑥 ≥ 𝐿C 𝑥 = 𝐿 𝑥 ¤ サンプル数を増やすだけで,制約が緩和され,真の下界に近づく.
  • 12. IWAE : 実験結果 ¤ テスト尤度が向上していることが確認できる Under review as a conference paper at ICLR 2016 MNIST OMNIGLOT VAE IWAE VAE IWAE # stoch. layers k NLL active units NLL active units NLL active units NLL active units 1 1 86.76 19 86.76 19 108.11 28 108.11 28 5 86.47 20 85.54 22 107.62 28 106.12 34 50 86.35 20 84.78 25 107.80 28 104.67 41 2 1 85.33 16+5 85.33 16+5 107.58 28+4 107.56 30+5 5 85.01 17+5 83.89 21+5 106.31 30+5 104.79 38+6 50 84.78 17+5 82.90 26+7 106.30 30+5 103.38 44+7 Table 1: Results on density estimation and the number of active latent dimensions. For models with two latent layers, “k1+k2” denotes k1 active units in the first layer and k2 in the second layer. The generative performance of IWAEs improved with increasing k, while that of VAEs benefitted only slightly. Two-layer models achieved better generative performance than one-layer models. The log-likelihood results are reported in Table 1. Our VAE results are comparable to those previ- ously reported in the literature. We observe that training a VAE with k > 1 helped only slightly. By contrast, using multiple samples improved the IWAE results considerably on both datasets. Note that
  • 14. VAEによる半教師あり学習 ¤ Semi-Supervised Learning with Deep Generative Models [Kingma+ 2014 ; NIPS 2014] ¤ 条件付きVAE(CVAE)による半教師あり学習 ¤ 条件付きVAEの下界は 𝐿 𝐱|𝒚 = 𝐸89 𝐳 𝐱, 𝐲 log ]^ 𝐱,𝐳|𝐲 89 𝐳 𝐱, 𝐲 ¤ よって,下界は 𝐿 𝐱 + 𝐿 𝐱|𝐲 + 𝛼𝔼[−log𝑞< 𝐲 𝐱 ] ¤ 最後の項はラベル予測するモデル 𝑧 𝑥 𝑦 ラベル
  • 15. CVAE半教師あり学習の問題点 ¤ ラベルと潜在変数はモデル上独⽴になっている ¤ しかし,近似分布が𝑞< 𝐳 𝐱, 𝐲 となっているため,𝐲と𝐳に依存関係が⽣じ てしまう ¤ データの情報を保持しながら,ラベルとは独⽴な潜在変数を獲得したい. ¤ 𝐲をドメインと考えれば,ドメインを除去した表現が獲得できるはず. 𝑧 𝑥 𝑦
  • 16. Variational Fair Autoencoder ¤ The Variational Fair Autoencoder [Louizos+ 15; ICLR 2016] ¤ 𝑥と𝑠(sensitive変数.前ページでいうラベル)を独⽴にするために,次の maximum mean discrepancy(MMD)を⼩さくするようにする. ¤ s=0とs=1のときの潜在変数の差がなくなるようにする. ¤ これをVAEの下界に追加する. ¤ MMDは通常カーネルの計算に持っていく. ¤ しかし,SGDで⾼次元のグラム⾏列を計算するのは⼤変なので,写像 を次の形で求める (7) Asymptotically, for a universal kernel such as the Gaussian kernel k(x, x0 ) = e kx x0 k2 , `MMD(X, X0 ) is 0 if and only if P0 = P1. Equivalently, minimizing MMD can be viewed as matching all of the moments of P0 and P1. Therefore, we can use it as an extra “regularizer” and force the model to try to match the moments between the marginal posterior distributions of our latent variables, i.e., q (z1|s = 0) and q (z1|s = 1) (in the case of binary nuisance information s1 ). By adding the MMD penalty into the lower bound of our aforementioned VAE architecture we obtain our proposed model, the “Variational Fair Autoencoder” (VFAE): FVFAE( , ✓; xn, xm, sn, sm, yn) = FVAE( , ✓; xn, xm, sn, sm, yn) `MMD(Z1s=0, Z1s=1) (8) where: `MMD(Z1s=0, Z1s=1) = k E˜p(x|s=0)[Eq(z1|x,s=0)[ (z1)]] E˜p(x|s=1)[Eq(z1|x,s=1)[ (z1)]]k2 (9) 2.4 FAST MMD VIA RANDOM FOURIER FEATURES A naive implementation of MMD in minibatch stochastic gradient descent would require computing the M ⇥M Gram matrix for each minibatch during training, where M is the minibatch size. Instead, we can use random kitchen sinks (Rahimi & Recht, 2009) to compute a feature expansion such that computing the estimator (6) approximates the full MMD (7). To compute this, we draw a random K ⇥ D matrix W, where K is the dimensionality of x, D is the number of random features and each entry of W is drawn from a standard isotropic Gaussian. The feature expansion is then given as: W(x) = r 2 D cos ✓r 2 xW + b ◆ . (10) where b is a D-dimensional uniform random vector with entries in [0, 2⇡]. Zhao & Meng (2015) have successfully applied the idea of using random kitchen sinks to approximate MMD. This esti- mator is fairly accurate, and is typically much faster than the full MMD penalty. We use D = 500 et al., 2006): `MMD(X, X0 ) = 1 N2 0 N0X n=1 N0X m=1 k(xn, xm) + 1 N2 1 N1X n=1 N1X m=1 k(x0 n, x0 m) 2 N0N1 N0X n=1 N1X m=1 k(xn, x0 m). (7) Asymptotically, for a universal kernel such as the Gaussian kernel k(x, x0 ) = e kx x0 k2 , `MMD(X, X0 ) is 0 if and only if P0 = P1. Equivalently, minimizing MMD can be viewed as matching all of the moments of P0 and P1. Therefore, we can use it as an extra “regularizer” and force the model to try to match the moments between the marginal posterior distributions of our latent variables, i.e., q (z1|s = 0) and q (z1|s = 1) (in the case of binary nuisance information s1 ). By adding the MMD penalty into the lower bound of our aforementioned VAE architecture we obtain our proposed model, the “Variational Fair Autoencoder” (VFAE): FVFAE( , ✓; xn, xm, sn, sm, yn) = FVAE( , ✓; xn, xm, sn, sm, yn) `MMD(Z1s=0, Z1s=1) (8) where: `MMD(Z1s=0, Z1s=1) = k E˜p(x|s=0)[Eq(z1|x,s=0)[ (z1)]] E˜p(x|s=1)[Eq(z1|x,s=1)[ (z1)]]k2 (9) 2.4 FAST MMD VIA RANDOM FOURIER FEATURES A naive implementation of MMD in minibatch stochastic gradient descent would require computing the M ⇥M Gram matrix for each minibatch during training, where M is the minibatch size. Instead, we can use random kitchen sinks (Rahimi & Recht, 2009) to compute a feature expansion such that computing the estimator (6) approximates the full MMD (7). To compute this, we draw a random K ⇥ D matrix W, where K is the dimensionality of x, D is the number of random features and each entry of W is drawn from a standard isotropic Gaussian. The feature expansion is then given as: W(x) = r 2 D cos ✓r 2 xW + b ◆ . (10) where b is a D-dimensional uniform random vector with entries in [0, 2⇡]. Zhao & Meng (2015) have successfully applied the idea of using random kitchen sinks to approximate MMD. This esti- mator is fairly accurate, and is typically much faster than the full MMD penalty. We use D = 500 in our experiments.
  • 17. 実験:公平性の検証 ¤ zからsの情報がなくなっているかどうかを検証 ¤ zからsを分類したときの正解率で評価 Under review as a conference paper at ICLR 2016 (a) Adult dataset (b) German dataset (c) Health dataset Figure 3: Fair classification results. Columns correspond to each evaluation scenario (in order): Random/RF/LR accuracy on s, Discrimination/Discrimination prob. against s and Random/Model
  • 18. 実験:ドメイン適応の検証 ¤ 異なるドメイン間でのドメイン適応 ¤ 半教師あり学習で実験(⽬標ドメインのラベルがない). ¤ the Amazon reviews dataset ¤ 𝑦はセンチメント(positiveかnegative) ¤ 結果: ¤ 12のうち9が既存研究([Ganin+ 15])を上回った Under review as a conference paper at ICLR 2016 2 LEARNING INVARIANT REPRESENTATIONS x zs N Figure 1: Unsupervised model x z1s z2 y N Figure 2: Semi-supervised model 2.1 UNSUPERVISED MODEL Factoring out undesired variations from the data can be easily formulated as a general probabili model which admits two distinct (independent) “sources”; an observed variable s, which denotes variations that we want to remove, and a continuous latent variable z which models all the remain information. This generative process can be formally defined as: z ⇠ p(z); x ⇠ p✓(x|z, s) where p✓(x|z, s) is an appropriate probability distribution for the data we are modelling. With formulation we explicitly encode a notion of ‘invariance’ in our model, since the latent repres Under review as a conference paper at ICLR 2016 is concerned, we compared against a recent neural network based state of the art method for domain adaptation, Domain Adversarial Neural Network (DANN) (Ganin et al., 2015). As we can observe in table 1, our accuracy on the labels y is higher on 9 out of the 12 domain adaptation tasks whereas on the remaining 3 it is quite similar to the DANN architecture. Table 1: Results on the Amazon reviews dataset. The DANN column is taken directly from Ganin et al. (2015) (the column that uses the original representation as input). Source - Target S Y RF LR VFAE DANN books - dvd 0.535 0.564 0.799 0.784 books - electronics 0.541 0.562 0.792 0.733 books - kitchen 0.537 0.583 0.816 0.779 dvd - books 0.537 0.563 0.755 0.723 dvd - electronics 0.538 0.566 0.786 0.754 dvd - kitchen 0.543 0.589 0.822 0.783 electronics - books 0.562 0.590 0.727 0.713 electronics - dvd 0.556 0.586 0.765 0.738 electronics - kitchen 0.536 0.570 0.850 0.854 kitchen - books 0.560 0.593 0.720 0.709 kitchen - dvd 0.561 0.599 0.733 0.740 kitchen - electronics 0.533 0.565 0.838 0.843
  • 19. CVAEの活⽤ ¤ 条件付きVAEは,ラベル等に条件づけられた画像を⽣成できる ¤ 学習サンプルに存在していないデータも⽣成可能 ¤ 数字ラベルで条件付け[Kingma+ 2014 ; NIPS 2014] (a) Handwriting styles for MNIST obtained by fixing the class label and varying the 2D latent variable z (b) MNIST analogies (c) SVHN analogies Figure 1: (a) Visualisation of handwriting styles learned by the model with 2D z-space. (b,c) Analogical reasoning with generative semi-supervised models using a high-dimensional z-space. The leftmost columns show images from the test set. The other columns show analogical fantasies of x by the generative model, where the latent variable z of each row is set to the value inferred from
  • 20. Conditional alignDRAW ¤ Generating Images from Captions with Attention [Mansimov+ 16 ; ICLR 2016] ¤ DRAWにbidirectional RNNで条件づけたモデル ¤ DRAW [Gregor+ 14] ¤ VAEの枠組みでRNNを使えるようにしたもの. ¤ 各時間ステップで画像を上書きしていく ¤ 前のステップとの差分をみることで注意(attention)をモデル化 Recurrent Neural Network For Image Generation onstructs scenes s emitted by the encoder. s step by step is the scene while e past few years captured by a se- , than by a sin- chelle & Hinton, ; Ranzato, 2014; et al., 2014; Ser- ed by sequential read x zt zt+1 P(x|z1:T )write encoder RNN sample decoder RNN read x write encoder RNN sample decoder RNN ct 1 ct cT henc t 1 hdec t 1 Q(zt|x, z1:t 1) Q(zt+1|x, z1:t) . . . decoding (generative model) encoding (inference) encoder FNN sample decoder FNN z P(x|z) Figure 2. Left: Conventional Variational Auto-Encoder. Dur- DRAW: A Recurrent Neural Network For Image Generati Time Figure 7. MNIST generation sequences for DRAW without at- tention. Notice how the network first generates a very blurry im-
  • 21. Conditional alignDRAW ¤ Conditional alignDRAWの全体像 ¤ DRAWにbidirectional RNNで条件づけたモデル ¤ Bidirectional RNNの出⼒を重み付け和したもので条件付ける. Published as a conference paper at ICLR 2016 Figure 2: AlignDRAW model for generating images by learning an alignment between the input captions and generating canvas. The caption is encoded using the Bidirectional RNN (left). The generative RNN takes a latent sequence z1:T sampled from the prior along with the dynamic caption representation s1:T to generate the canvas matrix cT , which is then used to generate the final image x (right). The inference RNN is used to compute approximate posterior Q over the latent sequence. 3.2 IMAGE MODEL: THE CONDITIONAL DRAW NETWORK To generate an image x conditioned on the caption information y, we extended the DRAW net- work (Gregor et al., 2015) to include caption representation hlang at each step, as shown in Fig. 2. The conditional DRAW network is a stochastic recurrent neural network that consists of a sequence of latent variables Zt 2 RD , t = 1, .., T, where the output is accumulated over all T time-steps. For simplicity in notation, the images x 2 Rh⇥w are assumed to have size h-by-w and only one color 𝑦 𝑥 𝑧 3.3 LEARNING The model is trained to maximize a variational lower bound L on the marginal likelihood of the correct image x given the input caption y: L = X Z Q(Z | x, y) log P(x | y, Z) DKL (Q(Z | x, y) k P(Z | y))  log P(x | y). (9) Similar to the DRAW model, the inference recurrent network produces an approximate posterior Q(Z1:T | x, y) via a read operator, which reads a patch from an input image x using two arrays of 1D Gaussian filters (inverse of write from section 3.2) at each time-step t. Specifically, ˆxt = x (ct 1), (10) rt = read(xt, ˆxt, hgen t 1), (11) hinfer t = LSTM infer (hinfer t 1 , [rt, hgen t 1]), (12) Q(Zt|x, y, Z1:t 1) = N ⇣ µ(hinfer t ), (hinfer t ) ⌘ , (13) where ˆx is the error image and hinfer 0 is initialized to the learned bias b. Note that the inference LSTM infer takes as its input both the output of the read operator rt 2 Rp⇥p , which depends on the original input image x, and the previous state of the generative decoder hgen t 1, which depends on the latent sample history z1:t 1 and dynamic sentence representation st 1 (see Eq. 3). Hence, the approximate posterior Q will depend on the input image x, the corresponding caption y, and the latent history Z1:t 1, except for the first step Q(Z1|x), which depends only on x. The terms in the variational lower bound Eq. 9 can be rearranged using the law of total expectation. Therefore, the variational bound L is calculated as follows: L =EQ(Z1:T | y,x) " log p(x | y, Z1:T ) TX t=2 DKL (Q(Zt | Z1:t 1, y, x) k P(Zt | Z1:t 1, y)) # DKL (Q(Z1 | x) k P(Z1)) . (14) 1 We also experimented with a conditional Gaussian observation model, but it worked worse compared to the Bernoulli model.
  • 22. 実験:キャプション付きMNIST ¤ キャプション付きのMNISTで学習 ¤ キャプションはMNISTの場所を指定 ¤ 左が訓練データにあるもの,右はないもの. ¤ 複数の数字でも適切に⽣成されている. Published as a conference paper at ICLR 2016 Figure 6: Examples of generating 60 ⇥ 60 MNIST images corresponding to respective captions. The captions on the left column were part of the training set. The digits described in the captions on the right column were hidden during training for the respective configurations. APPENDIX A: MNIST WITH CAPTIONS As an additional experiment, we trained our model on the MNIST dataset with artificial captions. Either one or two digits from the MNIST training dataset were placed on a 60 ⇥ 60 blank image. One digit was placed in one of the four (top-left, top-right, bottom-left or bottom-right) corners of the image. Two digits were either placed horizontally or vertically in non-overlapping fashion. The corresponding artificial captions specified the identity of each digit along with their relative positions, e.g. “The digit three is at the top of the digit one”, or “The digit seven is at the bottom left of the image”. The generated images together with the attention alignments are displayed in Figure 6. The model
  • 23. 実験:MSCOCOデータセット ¤ キャプションの⼀部(下線部)だけを変換 ¤ 存在していないキャプションから⽣成 Published as a conference paper at ICLR 2016 A yellow school bus parked in a parking lot. A red school bus parked in a parking lot. A green school bus parked in a parking lot. A blue school bus parked in a parking lot. The decadent chocolate desert is on the table. A bowl of bananas is on the table. A vintage photo of a cat. A vintage photo of a dog. Figure 3: Top: Examples of changing the color while keeping the caption fixed. Bottom: Examples of changing the object while keeping the caption fixed. The shown images are the probabilities (cT ). Best viewed in colour. The expectation can be approximated by L Monte Carlo samples ˜z1:T from Q(Z1:T | y, x): L ⇡ 1 L LX l=1 " log p(x | y, ˜zl 1:T ) TX t=2 DKL Q(Zt | ˜zl 1:t 1, y, x) k P(Zt | ˜zl 1:t 1, y) # DKL (Q(Z1 | x) k P(Z1)) . (15) The model can be trained using stochastic gradient descent. In all of our experiments, we used only a single sample from Q(Z1:T | y, x) for parameter learning. Training details, hyperparameter settings, and the overall model architecture are specified in Appendix B. The code is available at https://github.com/emansim/text2image. 3.4 GENERATING IMAGES FROM CAPTIONS During the image generation step, we discard the inference network and instead sample from the prior distribution. Due to the blurriness of samples generated by the DRAW model, we perform an additional post processing step where we use an adversarial network trained on residuals of a Lapla- Published as a conference paper at ICLR 2016 A stop sign is flying in blue skies. A herd of elephants fly- ing in the blue skies. A toilet seat sits open in the grass field. A person skiing on sand clad vast desert. Figure 1: Examples of generated images based on captions that describe novel scene compositions that are highly unlikely to occur in real life. The captions describe a common object doing unusual things or set in a strange location.
  • 25. ガウス過程 ¤ ガウス過程とは・・・ ¤ 関数の確率分布 ¤ D次元の⼊⼒ベクトルのデータセット に対する関数の出⼒ ベクトル の同時分布が常にガウス分布 ¤ 平均ベクトルは ,共分散⾏列は で完全に記 述される an Processes ew the predictive equations and marginal likelihood for Gaussian processes e associated computational requirements, following the notational conven- n et al. (2015). See, for example, Rasmussen and Williams (2006) for a discussion of GPs. ataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension x an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), tion of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) ector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined function and covariance kernel of the Gaussian process. The kernel, k , is y . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the ibution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , (2) E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . mple, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤ 3 Gaussian Processes We briefly review the predictive equations and marginal likelihood for Gaussian processes (GPs), and the associated computational requirements, following the notational conven- tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a comprehensive discussion of GPs. We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), then any collection of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined from the mean function and covariance kernel of the Gaussian process. The kernel, k , is parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , (2) E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤ We briefly review the predictive equations and marginal likelihood for Gaussian proc (GPs), and the associated computational requirements, following the notational con tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) f comprehensive discussion of GPs. We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimen D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, then any collection of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determ from the mean function and covariance kernel of the Gaussian process. The kernel, k parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2) predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is give f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel functions, to produce scalable deep kernels. 3 Gaussian Processes We briefly review the predictive equations and marginal likelihood for Gaussian processes (GPs), and the associated computational requirements, following the notational conven- tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a comprehensive discussion of GPs. We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), then any collection of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined from the mean function and covariance kernel of the Gaussian process. The kernel, k , is parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , (2) E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤ and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated show that the proposed model outperforms state of the art stand-alone deep learning archi- tectures and Gaussian processes with advanced kernel learning procedures on a wide range of datasets, demonstrating its practical significance. We achieve scalability while retaining non-parametric model structure by leveraging the very recent KISS-GP approach (Wilson and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel functions, to produce scalable deep kernels. 3 Gaussian Processes We briefly review the predictive equations and marginal likelihood for Gaussian processes (GPs), and the associated computational requirements, following the notational conven- tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a comprehensive discussion of GPs. We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), then any collection of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined from the mean function and covariance kernel of the Gaussian process. The kernel, k , is parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , (2) E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤ and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated at training inputs X. All covariance (kernel) matrices implicitly depend on the kernel hyperparameters . 15.2. GPs for regression 517 −5 0 5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 (a) −5 0 5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 (b) Figure 15.2 Left: some functions sampled from a GP prior with SE kernel. Right: some samples from a GP posterior, after conditioning on 5 noise-free observations. The shaded area represents E [f(x)]±2std(f(x). ns and marginal likelihood for Gaussian processes al requirements, following the notational conven- example, Rasmussen and Williams (2006) for a ctor) vectors X = {x1, . . . , xn}, each of dimension ets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), has a joint Gaussian distribution, . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) ariance matrix, (KX,X)ij = k (xi, xj), determined kernel of the Gaussian process. The kernel, k , is Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the ed at the n⇤ test points indexed by X⇤, is given by N(E[f⇤], cov(f⇤)) , (2) X⇤,X[KX,X + 2 I] 1 y , KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . x of covariances between the GP evaluated at X⇤ nd KX,X is the n ⇥ n covariance matrix evaluated
  • 26. 深層ガウス過程 ¤ より複雑なサンプルを表現するため,process compositionによって 多層化する[Lawrence & Moore, 07] ➡ 深層ガウス過程(deep GP) ¤ 以下のように,多層グラフィカルモデルを考える ¤ ここでは𝑌がデータ,𝑋が潜在変数. Published as a conference paper at ICLR 2016 X3 X2 f1 ⇠ GP X1 f2 ⇠ GP Y f3 ⇠ GP Figure 1: A deep Gaussian process with two hidden layers. 2 DEEP GAUSSIAN PROCESSES Gaussian processes provide flexible, non-parametric, probabilistic approaches to function estima- tion. However, their tractability comes at a price: they can only represent a restricted class of functions. Indeed, even though sophisticated definitions and combinations of covariance functions can lead to powerful models (Durrande et al., 2011; G¨onen & Alpaydin, 2011; Hensman et al., 2013; Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu- tion of instantiations of the latent function remains; this limits the applicability of the models. One line of recent research to address this limitation focused on function composition (Snelson et al., 2004; Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process instead employs process composition (Lawrence & Moore, 2007; Damianou et al., 2011; L´azaro-Gredilla, 2012; Damianou & Lawrence, 2013; Hensman & Lawrence, 2014). A deep GP is a deep directed graphical model that consists of multiple layers of latent variables and employs Gaussian processes to govern the mapping between consecutive layers (Lawrence & Moore, 2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observed inputs (if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, consider a set of data Y 2 RN⇥D with N datapoints and D dimensions. A deep GP then defines L layers of latent variables, {Xl}L l=1, Xl 2 RN⇥Ql through the following nested noise model definition: Y = f1(X1) + ✏1, ✏1 ⇠ N(0, 2 1I) (1) Xl 1 = fl(Xl) + ✏l, ✏l ⇠ N(0, 2 l I), l = 2 . . . L (2) where the functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠ GP(0, kl(x, x0 )). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as a fairly uninformative prior which also provides soft regularization, i.e. XL ⇠ N(0, I). In the supervised learning scenario, the inputs of the top hidden layer is observed and govern its hidden outputs. The expressive power of a deep GP is significantly greater than that of a standard GP, because the successive warping of latent variables through the hierarchy allows for modeling non-stationarities and sophisticated, non-parametric functional “features” (see Figure 2). Similarly to how a GP is EEP GAUSSIAN PROCESSES n processes provide flexible, non-parametric, probabilistic approaches to function estima- However, their tractability comes at a price: they can only represent a restricted class of ns. Indeed, even though sophisticated definitions and combinations of covariance functions d to powerful models (Durrande et al., 2011; G¨onen & Alpaydin, 2011; Hensman et al., Duvenaud et al., 2013; Wilson & Adams, 2013), the assumption about joint normal distribu- nstantiations of the latent function remains; this limits the applicability of the models. One recent research to address this limitation focused on function composition (Snelson et al., Calandra et al., 2014). Inspired by deep neural networks, a deep Gaussian process instead s process composition (Lawrence & Moore, 2007; Damianou et al., 2011; L´azaro-Gredilla, Damianou & Lawrence, 2013; Hensman & Lawrence, 2014). GP is a deep directed graphical model that consists of multiple layers of latent variables ploys Gaussian processes to govern the mapping between consecutive layers (Lawrence & 2007; Damianou, 2015). Observed outputs are placed in the down-most layer and observed if any) are placed in the upper-most layer, as illustrated in Figure 1. More formally, consider data Y 2 RN⇥D with N datapoints and D dimensions. A deep GP then defines L layers of ariables, {Xl}L l=1, Xl 2 RN⇥Ql through the following nested noise model definition: Y = f1(X1) + ✏1, ✏1 ⇠ N(0, 2 1I) (1) Xl 1 = fl(Xl) + ✏l, ✏l ⇠ N(0, 2 l I), l = 2 . . . L (2) he functions fl are drawn from Gaussian processes with covariance functions kl, i.e. fl(x) ⇠ kl(x, x0 )). In the unsupervised case, the top hidden layer is assigned a unit Gaussian as uninformative prior which also provides soft regularization, i.e. XL ⇠ N(0, I). In the sed learning scenario, the inputs of the top hidden layer is observed and govern its hidden pressive power of a deep GP is significantly greater than that of a standard GP, because the ive warping of latent variables through the hierarchy allows for modeling non-stationarities histicated, non-parametric functional “features” (see Figure 2). Similarly to how a GP is t of an infinitely wide neural network, a deep GP is the limit where the parametric function ition of a deep neural network turns into a process composition. Specifically, a deep neural 2 DEEP GAUSSIAN PROCESSES Gaussian processes provide flexible, non-parame tion. However, their tractability comes at a pric functions. Indeed, even though sophisticated defi can lead to powerful models (Durrande et al., 2 2013; Duvenaud et al., 2013; Wilson & Adams, 2 tion of instantiations of the latent function remain line of recent research to address this limitation 2004; Calandra et al., 2014). Inspired by deep n employs process composition (Lawrence & Moor 2012; Damianou & Lawrence, 2013; Hensman & A deep GP is a deep directed graphical model th and employs Gaussian processes to govern the m Moore, 2007; Damianou, 2015). Observed output inputs (if any) are placed in the upper-most layer, a a set of data Y 2 RN⇥D with N datapoints and D latent variables, {Xl}L l=1, Xl 2 RN⇥Ql through th Y = f1(X1) + ✏1, ✏1 Xl 1 = fl(Xl) + ✏l, ✏l ⇠ where the functions fl are drawn from Gaussian pr GP(0, kl(x, x0 )). In the unsupervised case, the a fairly uninformative prior which also provides supervised learning scenario, the inputs of the to outputs. The expressive power of a deep GP is significantl successive warping of latent variables through the and sophisticated, non-parametric functional “fea the limit of an infinitely wide neural network, a de composition of a deep neural network turns into a network can be written as:
  • 28. VAE-DGP ¤ DGPで変分推論する枠組みは提案されている[Damianou & Lawrence 13]が,少ないデータでしか学習できなかった. ¤ 共分散⾏列の逆数や,膨⼤なパラメータのため. ¤ DGPの推論をVAEの識別モデル(エンコーダー)と考える. ¤ 制約が加わり,パラメータを減らして推論が速くなる. ¤ 従来のDGPより過学習を抑えられる. ➡VAE-DGP Variationally Auto-Encoded Deep Gaussian Processes [Dai+ 15; ICLR 2016] Published as a conference paper at ICLR 2016 X3 X2 f1 ⇠ GP X1 f2 ⇠ GP Y f3 ⇠ GP {g1(y(n) )}N n=1{g2(µ (n) 1 )}N n=1{g3(µ (n) 2 )}N n=1 Figure 3: A deep Gaussian process with three hidden layers and back-constraints.
  • 29. 実験:⽋損補間 ¤ テストデータの⽋損補間 ¤ 各例の右端が元画像 ed as a conference paper at ICLR 2016 (a) (b) (c) 5: (a) The samples generated from VAE-DGP trained on the combination of Frey faces and ces (Frey-Yale). (b) Imputation from the test set of Frey-Yale. (c) Imputation from the test VHN. The gray color indicates the missing area. The 1st column shows the input images, column show the imputed images and 3rd column shows the original full images. KF1F1 , KU1U1 are the covariance matrices of F1 and U1 respectively, KF1U1 is the cross- nce matrix between F1 and U1, and 1 = Tr(hKF1F1 iq(X1)), 1 = hKF1U1 iq(X1) and ⌦ K> F1U1 KF1U1 ↵ q(X ) , and ⇤1 = KU1U1 + 1. This enables data-parallelism by dis- Published as a conference paper at ICLR 2016 5.1 UNSUPERVISED LEARNING Model MNIST DBN 138±2 Stacked CAE 121 ± 1.6 Deep GSN 214 ± 1.1 Adversarial nets 225 ± 2 GMMN+AE 282 ± 2 VAE-DGP (5) 301.67 VAE-DGP (10-50) 674.86 VAE-DGP (5-20-50) 723.65 Table 1: Log-likelihood for the MNIST test data with different models. The baselines are DBN and Stacked CAE (Bengio et al., 2013), Deep GSN (Bengio et al., 2014), Adversarial nets (Goodfellow et al., 2014) and GMMN+AE (Li et al., 2015). Figure 6: Samples of imputation on the test sets. The gray color indicates the missing area. The 1st column shows the input im- ages, the 2nd column show the imputed im- ages and 3rd column shows the original full images. We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey faces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images, which are resized to 20 ⇥ 28. We take the last 200 frames from the Frey faces and 300 images randomly from Yale faces as the test set and use the rest for training. The intensity of the original gray-scale images are normalized to [0, 1]. The applied VAE-DGP has two hidden layers (a 2D top hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with
  • 30. 実験:精度評価 ¤ 対数尤度(MNIST) ¤ 教師あり学習(回帰) ¤ データセット: ¤ The Abalone dataset ¤ The Creep dataset Published as a conference paper at ICLR 2016 5.1 UNSUPERVISED LEARNING Model MNIST DBN 138±2 Stacked CAE 121 ± 1.6 Deep GSN 214 ± 1.1 Adversarial nets 225 ± 2 GMMN+AE 282 ± 2 VAE-DGP (5) 301.67 VAE-DGP (10-50) 674.86 VAE-DGP (5-20-50) 723.65 Table 1: Log-likelihood for the MNIST test data with different models. The baselines are DBN and Stacked CAE (Bengio et al., 2013), Deep GSN (Bengio et al., 2014), Adversarial nets (Goodfellow et al., 2014) and GMMN+AE (Li et al., 2015). Figure 6: Samples of imputation on the test sets. The gray color indicates the missing area. The 1st column shows the input im- ages, the 2nd column show the imputed im- ages and 3rd column shows the original full images. We first apply to our model to the combination of Frey faces and Yale faces (Frey-Yale). The Frey faces contains 1956 20 ⇥ 28 frames taken from a video clip. The Yale faces contains 2414 images, which are resized to 20 ⇥ 28. We take the last 200 frames from the Frey faces and 300 images randomly from Yale faces as the test set and use the rest for training. The intensity of the original gray-scale images are normalized to [0, 1]. The applied VAE-DGP has two hidden layers (a 2D top hidden layer and a 20D middle hidden layer). The exponentiated quadratic kernel is used for all the layers with 100 inducing points. All the MLPs in the recognition model have two hidden layers with widths (500-300). As a generative model, we can draw samples from the learned model by sampling first from the prior distribution of the top hidden layer (a 2D unit Gaussian distribution in this case) and layer-wise downwards. The generated images are shown in Figure 5a. To evaluate the ability of our model learning the data distribution, we train the VAE-DGP on MNIST (LeCun et al., 1998). We use the whole training set for learning, which consists of 60,000 28 ⇥ 28 images. The intensity of the original gray-scale images are normalized to [0, 1]. We train our model with three different model settings (one, two and three hidden layers). The trained models are Published as a conference paper at ICLR 2016 Figure 7: Bayesian optimization experiments for Model Abalone VEA-DGP 825.31 ± 64.35 GP 888.96 ± 78.22 Lin. Reg. 917.31 ± 53.76 Model Creep VEA-DGP 575.39 ± 29.10 GP 602.11 ± 29.59 Lin. Reg. 1865.76 ± 23.36 Table 2: MSE obtained from our VEA-DGP, standard GP and linear regression for the Abalone and Creep benchmarks.
  • 31. 変分推論における平均場近似 ¤ VAEでは近似分布は𝑞(𝑧|𝑥)と考えてきた ¤ 𝑞(𝑧|𝑥)はニューラルネットワークで表現 ¤ ⼀般的に近似分布は平均場近似によって近似される. ¤ もっとリッチな近似分布を考えることもできる ¤ パラメータ𝜆を確率変数として事前分布を考える(階層変分モデル) log 𝑝 𝑥 = 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥)) 4 Variational Models • We want to compute posterior p(z|x) (z: latent variables, x: data) • Variational inference seeks to minimize 
 for a family q(z; ) KL(q(z; )||p(z|x)) • Maximizing evidence lower bound (ELBO) log p(x) Eq(z; )[log p(x|z)] KL(q(z; )||p(z)) • (Common) Mean-field distribution q(z; ) = Y i q(zi; i) • Hierarchical variational models • (Newer) Interpret the family as a variational model for posterior latent variables z (introducing new latent variables)[1] Lawrence, N. (2000). Variational Inference in Probabilistic Models. PhD thesis.
  • 32. 変分ガウス過程 ¤ The Variational Gaussain Process [Tran+ 15; ICLR 2016] ¤ とても強⼒な変分モデルを提案 ¤ を変分データ(パラメータ)とし,次のような𝑧の ⽣成過程を考える. ¤ 潜在変数 ¤ ⾮線形写像をDによって条件づけられたガウス過程から⽣成 ¤ 潜在変数zを⽣成 7 Variational Gaussian Processes 7 Variational Gaussian Processes Variational Gaussian Processes 7 Variational Gaussian Processes Variational Gaussian Processes
  • 33. 変分ガウス過程の尤度 ¤ 全ページの⽣成過程から,潜在変数𝑧の周辺尤度は ¤ このようにしてモデル化した近似分布は,𝑝(𝑧|𝑥)がどんな分布であろ うと, とするパラメータが存在する (Universal Approximation) ¤ つまり,これまでのどの⼿法よりも限りなく柔軟なモデルとなる. variational distribution. (This idea appears in a different context in Blei & Lafferty (2006).) The VGP specifies the following generative process for posterior latent variables z: 1. Draw latent input ⇠ 2 Rc : ⇠ ⇠ N(0, I). 2. Draw non-linear mapping f : Rc ! Rd conditioned on D: f ⇠ Qd i=1 GP(0, K⇠⇠) | D. 3. Draw approximate posterior samples z 2 supp(p): z = (z1, . . . , zd) ⇠ Qd i=1 q(fi(⇠)). Figure 1 displays a graphical model for the VGP. Marginalizing over all non-linear mappings and latent inputs, the VGP is qVGP(z; ✓, D) = ZZ " dY i=1 q(zi | fi(⇠)) # " dY i=1 GP(fi; 0, K⇠⇠) | D # N(⇠; 0, I) df d⇠, (4) which is parameterized by kernel hyperparameters ✓ and variational data. As a variational model, the VGP forms an infinite ensemble of mean-field distributions. A mean-field distribution is specified conditional on a fixed function f(·) and input ⇠; the d outputs fi(⇠) = i are the mean-field’s parameters. The VGP is a form of a hierarchical variational model (Ranganath et al., 2015); it places a continuous Bayesian nonparametric prior over mean-field parameters. Note that the VGP evaluates the d draws from a GP at the same latent input ⇠, which induces cor- relation between their outputs, the mean-field parameters. In turn, this induces correlation between latent variables of the variational model, correlations that are not captured in classical mean-field. Finally, the complex non-linear mappings drawn from the GP make the VGP a flexible model for complex discrete and continuous posteriors. We emphasize that the VGP needs variational data because—unlike typical GP regression—there is no observed data available to learn a distribution over non-linear mappings. The variational data 再掲 log 𝑝 𝑥 = 𝐿 𝑥 + 𝐾𝐿(𝑞(𝑧|𝑥)||𝑝(𝑧|𝑥)) quence of domain mappings during inference, from variational latent variable space r latent variable space Q to data space P. We perform variational inference in the e and auxiliary inference in the variational space. resses the task of posterior inference by learning f⇤ : conditional on variational data ameters to learn, the distribution of the GP learns to concentrate around this optimal ng inference. This perspective provides intuition behind the following result. Universal approximation). Let q(z; ✓, D) denote the variational Gaussian process. For distribution p(z | x) with a finite number of latent variables and continuous quantile rse CDF), there exist a set of parameters (✓, D) such that KL(q(z; ✓, D) k p(z | x)) = 0. x B for a proof. Theorem 1 states that any posterior distribution with strictly posi- an be represented by a VGP. Thus the VGP is a flexible model for learning posterior BOX INFERENCE IONAL OBJECTIVE
  • 34. 下界 ¤ 学習では次の下界を最⼤化する ¤ イメージとしては次のような感じ ¤ 近似モデルでxからzを⽣成 ¤ 補助モデルでxとzから写像と潜在変数を⽣成 3 BLACK BOX INFERENCE 3.1 VARIATIONAL OBJECTIVE We derive an algorithm for performing black box inference over a wide class of generative models. The original ELBO (Eq.1) is analytically intractable due to the log density log qVGP(z) (Eq.4). We derive a tractable variational objective inspired by auto-encoders. Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtracting an expected KL divergence term from the ELBO: log p(x) EqVGP [log p(x | z)] KL(qVGP(z)kp(z)) EqVGP h KL(q(⇠, f | z)kr(⇠, f | z)) i , where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently by Sal- imans et al. (2015) and Ranganath et al. (2015). Variational inference is performed in the posterior latent variable space, minimizing KL(qkp) to learn the variational model; for this to occur auxil- iary inference is performed in the variational latent variable space, minimizing KL(qkr) to learn an auxiliary model. See Figure 2. Unlike previous approaches, we rewrite this variational objective to connect to auto-encoders: eL = EqVGP [log p(x | z)] EqVGP h KL(q(z | f(⇠))kp(z)) + KL(q(⇠, f)kr(⇠, f | z)) i , (5) where the KL divergences are now taken over tractable distributions (see Appendix C). In auto- encoder parlance, we maximize the expected negative reconstruction error, regularized by an ex- pected divergence between the variational model and the original model’s prior, and an expected divergence between the auxiliary model and the variational model’s prior. This is simply a nested instantiation of the variational auto-encoder bound (Kingma & Welling, 2014): a KL divergence between the inference model and a prior is taken as regularizers on both the posterior and variational spaces. This interpretation justifies the previously proposed bound for variational models; as we shall see, it also enables lower variance gradients during stochastic optimization. derive a tractable variational objective inspired by auto-encoders. Specifically, a tractable lower bound to the model evidence log p(x) can be derived by subtracting an expected KL divergence term from the ELBO: log p(x) EqVGP [log p(x | z)] KL(qVGP(z)kp(z)) EqVGP h KL(q(⇠, f | z)kr(⇠, f | z)) i , where r(⇠, f | z) is an auxiliary model. Such an objective has been considered independently by Sal- imans et al. (2015) and Ranganath et al. (2015). Variational inference is performed in the posterior latent variable space, minimizing KL(qkp) to learn the variational model; for this to occur auxil- iary inference is performed in the variational latent variable space, minimizing KL(qkr) to learn an auxiliary model. See Figure 2. Unlike previous approaches, we rewrite this variational objective to connect to auto-encoders: eL = EqVGP [log p(x | z)] EqVGP h KL(q(z | f(⇠))kp(z)) + KL(q(⇠, f)kr(⇠, f | z)) i , (5) where the KL divergences are now taken over tractable distributions (see Appendix C). In auto- encoder parlance, we maximize the expected negative reconstruction error, regularized by an ex- pected divergence between the variational model and the original model’s prior, and an expected divergence between the auxiliary model and the variational model’s prior. This is simply a nested instantiation of the variational auto-encoder bound (Kingma & Welling, 2014): a KL divergence between the inference model and a prior is taken as regularizers on both the posterior and variational spaces. This interpretation justifies the previously proposed bound for variational models; as we shall see, it also enables lower variance gradients during stochastic optimization. 5 再構成誤差 正規化項 補助モデル Under review as a conference paper at ICLR 2016 3.2 AUTO-ENCODING VARIATIONAL MODELS Inference networks provide a flexible parameterization of approximating in Helmholtz machines (Hinton & Zemel, 1994), deep Boltzmann machin Larochelle, 2010), and variational auto-encoders (Kingma & Welling, 2014; R replaces local variational parameters with global parameters coming from a ne ically, for latent variables zn which correspond to a data point xn, an infere a neural network which takes xn as input and its local variational parameter amortizes inference by only defining a set of global parameters. To auto-encode the VGP we specify inference networks to parameterize bo auxiliary models. Unique from other auto-encoder approaches, we let the aux observed data point xn and variational data point zn as input: xn 7! q(zn | xn; ✓n), xn, zn 7! r(⇠n, fn | xn, zn; n where q has local variational parameters given by the variational data Dn, a fully factorized Gaussian with local variational parameters n = (µn 2 R Note that by letting r’s inference network take both xn and zn as input, w explicit specification of r(✏, f | z). This idea was first suggested but not imple et al. (2015). w as a conference paper at ICLR 2016 ENCODING VARIATIONAL MODELS tworks provide a flexible parameterization of approximating distributions as used z machines (Hinton & Zemel, 1994), deep Boltzmann machines (Salakhutdinov & 010), and variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014). It l variational parameters with global parameters coming from a neural network. Specif- ent variables zn which correspond to a data point xn, an inference network specifies work which takes xn as input and its local variational parameters n as output. This erence by only defining a set of global parameters. ode the VGP we specify inference networks to parameterize both the variational and dels. Unique from other auto-encoder approaches, we let the auxiliary model take both a point xn and variational data point zn as input: xn 7! q(zn | xn; ✓n), xn, zn 7! r(⇠n, fn | xn, zn; n), local variational parameters given by the variational data Dn, and r is specified as a ed Gaussian with local variational parameters n = (µn 2 Rc+d , 2 n 2 Rc+d ). 1 letting r’s inference network take both x and z as input, we avoid the restrictive
  • 35. 実験:対数尤度 ¤ 前⼈未到の70代に突⼊ ¤ ⽣成部分のモデルをDRAW, 近似分布をVGPとしたモデルが⼀番良い Published as a conference paper at ICLR 2016 Model log p(x)  DLGM + VAE [1] 86.76 DLGM + HVI (8 leapfrog steps) [2] 85.51 88.30 DLGM + NF (k = 80) [3] 85.10 EoNADE-5 2hl (128 orderings) [4] 84.68 DBN 2hl [5] 84.55 DARN 1hl [6] 84.13 Convolutional VAE + HVI [2] 81.94 83.49 DLGM 2hl + IWAE (k = 50) [1] 82.90 DRAW [7] 80.97 DLGM 1hl + VGP 84.79 DLGM 2hl + VGP 81.32 DRAW + VGP 79.88 Table 1: Negative predictive log-likelihood for binarized MNIST. Previous best results are [1] (Burda et al., 2016), [2] (Salimans et al., 2015), [3] (Rezende & Mohamed, 2015), [4] (Raiko et al., 2014), [5] (Murray & Salakhutdinov, 2009), [6] (Gregor et al., 2014), [7] (Gregor et al., 2015).
  • 36. まとめ ¤ 今回はICLR2016のVAE研究を中⼼に ¤ 変分推論とVAE ¤ 条件付きVAEと半教師あり学習 ¤ ガウス過程とVAE についてまとめた ¤ 感想 ¤ ICLRの傾向がなんとなくわかった ¤ まとめるのが難しかった