深層生成モデルを用いたマルチモーダル学習

¤
¤ autoencoder [Ngiam+ 11] DBM
[Srivastava+ 12]
¤ deep Boltzmann machine
¤ variational autoencoder (VAE)
→ VAE
• VAE (MVAE)
•
• MVAE
•

¤ autoencoder [Ngiam+ 11]
¤ AE
¤ 2
¤ DBM [Srivastava+ 12]
¤ deep Boltzmann machine
¤ AE
...
...
Hidden Units
Audio Input
(a) Audio RBM
...
...
Hidden Units
Video Input
(b) Video RBM
…...
...
Audio Input
Shared Representation
...
Video Input
(c) Shallow Bimodal RBM
...
...
... ...
Audio Input Video Input
...
(d) Bimodal DBN
Figure 2: RBM Pretraining Models. We train RBMs for (a) audio and (b) video separately as
a baseline. The shallow model (c) is limited and we find that this model is unable to capture
correlations across the modalities. The bimodal deep belief network (DBN) model (d) is trained
in a greedy layer-wise fashion by first training models (a) & (b). We later “unroll” the deep
model (d) to train the deep autoencoder models presented in Figure 3.
...
...
...
...
... ...
...
Video Input
Shared
Representation
Audio Reconstruction Video Reconstruction
(a) Video-Only Deep Autoencoder
...
...
... ...
...
...
... ...
...
Audio Input Video Input
Shared
Representation
Audio Reconstruction Video Reconstruction
(b) Bimodal Deep Autoencoder
Figure 3: Deep Autoencoder Models. A “video-only” model is shown in (a) where the model
learns to reconstruct both modalities given only video as the input. A similar model can be
drawn for the “audio-only” setting. We train the (b) bimodal deep autoencoder in a denoising
fashion, using an augmented dataset with examples that require the network to reconstruct both
modalities given only one. Both models are pre-trained using sparse RBMs (Figure 2d). Since
we use a sigmoid transfer function in the deep network, we can initialize the network using the
conditional probability distributions p(h|v) and p(v|h) of the learned RBM.
Therefore, we consider greedily training a RBM over
the pre-trained layers for each modality, as motivated
by deep learning methods (Figure 2d).2
In particular,
the posteriors (Equation 2) of the first layer hidden
variables are used as the training data for the new
layer. By representing the data through learned first
layer representations, it can be easier for the model to
learn higher-order correlations across modalities. In-
formally, the first layer representations correspond to
phonemes and visemes and the second layer models the
relationships between them. Figure 4 shows visualiza-
ties; it is possible for the model to find representations
such that some hidden units are tuned only for au-
dio while others are tuned only for video. Second, the
models are clumsy to use in a cross modality learn-
ing setting where only one modality is present during
supervised training and testing. With only a single
modality present, one would need to integrate out the
unobserved visible variables to perform inference.
Thus, we propose a deep autoencoder that resolves
both issues. We first consider the cross modality learn-
ing setting where both modalities are present during
Image-specific DBM Text-specific DBM
Multimodal DBM
Figure 2: Left: Image-specific two-layer DBM that uses a Gaussian model to model the distribution over real-
valued image features. Middle: Text-specific two-layer DBM that uses a Replicated Softmax model to model
its distribution over the word count vectors. Right: A Multimodal DBM that models the joint distribution over
image and text inputs.
We illustrate the construction of a multimodal DBM using an image-text bi-modal DBM as our
running example. Let vm 2 RD
denote an image input and vt 2 NK
denote a text input. Consider
modeling each data modality using separate two-layer DBMs (Fig. 2). The image-specific two-layer
DBM assigns probability to vm that is given by (ignoring bias terms on the hidden units for clarity):
P(vm; ✓) =
X
h(1),h(2)
P(vm, h(1)
, h(2)
; ✓) = (4)
→ deep Boltzmann machine

Variational Autoencoder
¤ Variational autoencoder [Kingma+ 13]
¤ !(#)
¤ !% # & '( & #
¤ ℒ(#)
¤ NN
ℒ # = −,-.['((&|#)| ! & + 234(5|6)
log!% # &
≥ qφ(z|x) log
pθ(x, z)
qφ(z|x)
dz
= −DKL(qφ(z|x)||p(z)) + Eqφ(z|x)[log pθ(x|z)]
= L(x) (2)
本稿では，qφ(z|x) をエンコーダー，pθ(x|z) をデコーダー
と呼ぶ．
下界 L(x) をパラメータ θ, φ について最適化する
際，VAE では stochastic gradient variational Bayes
（SGVB）[Kingma 13]，または stochastic backpropagation
[Rezende 14] と呼ばれる手法を利用する．z ∼ qφ(z|x) がガウ
ス分布 N(z|µ, σ2
)（ただし φ = {µ, σ2
}）のとき，z = µ + σϵ
（ただし ϵ ∼ N(0, 1)）のように再パラメータ化（reparameter-
ization）することができる．すると，期待値 Eqφ(z)[fθ(z)] を
EN (ϵ;0,1)[fθ(µ + σϵ)] と置き換えることができ，モンテカルロ
法によって 1
L
L
l=1 fθ(µ + σϵ(l)
)（ただし ϵ(l)
∼ N(0, 1)）と
して求めることができる．すなわち式 (3) の下界の推定量は次
のように求まる．
ˆL(x) = −DKL(qφ(z|x)||p(z)) +
1
L
L
l=1
log pθ(x|z(l)
)
ただし z(l)
= µ + σ ⊙ ϵ(l)
, ϵ(l)
∼ N(0, I) (3)
式 (4) の第 1 項は正則化項，第 2 項は負の再構成誤差となっ
ている．この下界を目的関数としてパラメーター φ, θ につい
て最大化する．
VAE は SGVB を利用することで，他の推定法と比較して
低バリアンスな推定量を求めることができる．その一方で，柔
軟かつ計算が容易な近似事後分布を選ぶ必要もあり，これを解
決するために normalizing ﬂows [Rezende 15] や importance
weighted VA [Burda 15] などが提案されている．
その他 VAE を拡張したモデルとして，conditional VAE
こ
グラ
存関
が観
近
下界
L(x
S
ˆL
式
報の
こ
変分推論の枠組みでは，潜在変数の近似事後分布 qφ(z|x)(φ
はモデルパラメータ) を考えて，次の周辺尤度の下界が最大に
なるように学習する．
log p(x) = log pθ(x, z)dz
= log qφ(z|x)
pθ(x, z)
qφ(z|x)
dz
≥ qφ(z|x) log
pθ(x, z)
qφ(z|x)
dz
= −DKL(qφ(z|x)||p(z)) + Eqφ(z|x)[log pθ(x|z)]
= L(x) (2)
本稿では，qφ(z|x) をエンコーダー，pθ(x|z) をデコーダー
と呼ぶ．
下界 L(x) をパラメータ θ, φ について最適化する
際，VAE では stochastic gradient variational Bayes
（SGVB）[Kingma 13]，または stochastic backpropagation
[Rezende 14] と呼ばれる手法を利用する．z ∼ qφ(z|x) がガウ
ス分布 N(z|µ, σ2
)（ただし φ = {µ, σ2
}）のとき，z = µ + σϵ
（ただし ϵ ∼ N(0, 1)）のように再パラメータ化（reparameter-
ization）することができる．すると，期待値 Eqφ(z)[fθ(z)] を
EN(ϵ;0,1)[fθ(µ + σϵ)] と置き換えることができ，モンテカルロ
法によって 1
L
L
l=1 fθ(µ + σϵ(l)
)（ただし ϵ(l)
∼ N(0, 1)）と
して求めることができる．すなわち式 (3) の下界の推定量は次
のように求まる．
ˆL(x) = −DKL(qφ(z|x)||p(z)) +
1
L
L
l=1
log pθ(x|z(l)
)
ただし z(l)
= µ + σ ⊙ ϵ(l)
, ϵ(l)
∼ N(0, I) (3)
式 (4) の第 1 項は正則化項，第 2 項は負の再構成誤差となっ
ている．この下界を目的関数としてパラメーター φ, θ につい
て最大化する．
VAE は SGVB を利用することで，他の推定法と比較して
低バリアンスな推定量を求めることができる．その一方で，柔
軟かつ計算が容易な近似事後分布を選ぶ必要もあり，これを解
[Kingma 14a]，deep Kalman ﬁlter [Krishnan 15]，varia-
tional Gaussian process [Dai 15] などが提案されている．
3. 提案モデル
こ
グラ
存関
が観
近
下界
L(x
SG
ˆL(
式
報の
こ
現し
エ
きは
→ VAE

¤ VAE MVAE
¤ #, ; 2 !(#, ;)
¤ # ;
¤ !(#, ;) ℒ(#, ;)
'((&|#, ;) !%6
# & !%<
; &
z ∼ pθ(z) (1
x, w ∼ pθ(x, w|z) (2
で，それぞれのドメインのデータについて条件付き独立と仮定する．
pθ(x, w|z) = pθx (x|z)pθw (w|z) (3
x
z
w
図 1 両方のドメインが観測されたときの TrVB
のモデルの変分下界 L は，次のようになる．
tional Gaussian process [Dai 15] などが提案されている
3. 提案モデル
次に本稿で提案するマルチモーダル VAE（MVAE）に
て説明する．
異なるモーダル情報のデータセットを，それぞれ {X
{W} とし，潜在変数を z とする．また，それらの生成過
z ∼ p(z), x, w ∼ pθ(x, w|z)
とする．
x と w を観測変数とすると
pθ(x, w|z) = pθx (x|z)pθw (w|z)
のように条件付き独立となる，ただし θx と θw は各分布
デルパラメータである．
[Kingma 14a]，deep Kalman ﬁlter [Krishnan 15]，varia-
tional Gaussian process [Dai 15] などが提案されている．
3. 提案モデル
次に本稿で提案するマルチモーダル VAE（MVAE）につい
て説明する．
異なるモーダル情報のデータセットを，それぞれ {X} と
{W} とし，潜在変数を z とする．また，それらの生成過程を
z ∼ p(z), x, w ∼ pθ(x, w|z) (4)
とする．
pθ(x, w|z) = pθx (x|z)pθw (w|z) (5)
のように条件付き独立となる，ただし θx と θw は各分布のモ
デルパラメータである．
その他 VAE を拡張したモデルとして，conditiona
[Kingma 14a]，deep Kalman ﬁlter [Krishnan 15]，
tional Gaussian process [Dai 15] などが提案されてい
3. 提案モデル
次に本稿で提案するマルチモーダル VAE（MVAE）
て説明する．
異なるモーダル情報のデータセットを，それぞれ
{W} とし，潜在変数を z とする．また，それらの生成
z ∼ p(z), x, w ∼ pθ(x, w|z)
とする．
pθ(x, w|z) = pθx (x|z)pθw (w|z)
のように条件付き独立となる，ただし θx と θw は各分
デルパラメータである．ℒ #, ; = −,-.['((&|#, ;)| ! & + 234(5|6,<)
log !%6
# & + log !%<
; &
)
ー
る
s
n
ウ
ϵ
-
を
ロ
と
次
図 1: MVAE のグラフィカルモデル
この過程をグラフィカルモデルで表したものが図 1 である．
グラフィカルモデルとは，生成モデルで設計した確率変数の依
存関係を有効グラフで表現したもので，白丸が潜在変数，黒丸
が観測変数を表す．
近似事後分布を qφ(z|x, w) とすると，尤度 p(x, w) の変分
下界は次のようになる．
L(x, w) = qφ(z|x, w) log
pθ(x, w, z)
qφ(z|x, w)
dz
= − qφ(z|x, w) log
qφ(z|x, w)
p(z)
dz
+ qφ(z|x, w)[log pθx (x|z) + log pθw (w|z)]dz
= −DKL(qφ(z|x, w)||p(z))
+Eq (z|x,w)[log pθx (x|z)]

¤
¤
¤ !(#, ;) !(#) !(#|;)
¤ 2
!"
# $ #% $&
' () !"#$!"%
"&#$!"%
図 2: MVAE のネットワーク構造
イ分布 B(w|µθw
) のとき，ベルヌーイ分布のパラメータ µθw
は次のように求める．
y(z) = MLPθw (z)
µθ = Linear(y(z)) (9)
デコーダーがガウス分布のときは，式 (8) の Linear の入力
が 1 つの場合と同様である．
VAE を利用したマルチドメインの研究として，Louizos ら
によって提案された variational fair autoencoder（VFAE）
[Louizos 15] がある．この研究では，ドメイン変数と潜在変数
を結合した層のユニット数は 2048 である．デコーダーのモデル
構造は pθx (x|z) が 2048-1024-1024-3857，pθw (w|z) が 2048-
1024-1024-2000 である．各層の活性化関数には rectiﬁed linear
unit を用い，最適化アルゴリズムに Adam [Kingma 14b] を
利用した．
適切な特徴表現が獲得されていることを確認するために，訓
練事例集合でモデルを学習した後，テスト事例集合を与えた
ときの潜在変数 z をエンコーダー qφ(z|x, w) からサンプリン
グし，カテゴリ情報を出力 y とした写像 f : z → y を線形識
別器であるロジスティック回帰モデルで学習する．訓練事例集
合で学習したロジスティック回帰モデルをテスト事例集合で検
'( & #, ;

¤ Conditional VAE (CVAE) [Kingma+ 2014][Sohn+ 2015]
¤ !(#|M)
¤
¤ Conditional multimodal autoencoder [Pandey 16]
¤ !(#|M)
¤ CVAE
¤
¤
¤ M
¤ # M !(M|#) 2
Since the conditional log-likelihood of the proposed
model is intractable, we use variational methods for train-
ing the model, whereby the posterior of the latent vari-
ables given the faces and the attributes is approximated by
a tractable distribution. While variational methods have
long been a popular tool for training graphical models,
their usage for deep learning became popular after the
reparametrization trick of [15, 21, 24]. Prior to that, mean-
field approximation has been used for training deep Boltz-
mann machines (DBM) [22]. However, the training of
DBM involves solving a variational approximation prob-
lem for every instance in the training data. On the other
hand, reparametrizing the posterior allows one to single
parametrized variational approximation problem for all in-
stances in the training data simultaneously.
The proposed model is referred to as conditional multi-
modal autoencoder (CMMA). We use CMMA for the task
of generating faces from attributes, and to modify faces
in the training data, by modifying the corresponding at-
tributes. The dataset used is the cropped Labelled Faces
in the Wild dataset1
(LFW) [11]. We also compare the
qualitative and quantitative performance of CMMA against
Figure 1: A graphical representation of CMMA
we wish to generate and the attributes correspond to the
modality that we wish to condition on.
A formal description of the problem is as fol-
lows. We are given an i.i.d sequence of N datapoints
{(x(1)
, y(1)
), . . . , (x(N)
, y(N)
)}. For a fixed datapoint
(x, y), let x be the modality that we wish to generate and
y be the modality that we wish to condition on. We as-
sume that x is generated by first sampling a real-valued la-
tent representation z from the distribution p(z|y), and then
sampling x from the distribution p(x|z). The graphical rep-
resentation of the model is given in Figure 1. Furthermore,
we assume that the conditional distribution of the latent rep-
resentation z given y and the distribution of x given z are
#
&M

¤
¤ MNIST
¤ CelebA
¤
1. MVAE
2.
¤ # → # !(#)
¤ ; → # !(#|;)

1 MNIST
¤
¤ #
¤ ;
¤ 50,000 10,000
¤
¤ !%6
# & !%<
; &
¤

¤ VAE MVAE
¤ t-SNE
¤ ;
¤ VAE MVAE
¤ ;
VAE MVAE

# → #
¤ !(#)
¤
¤ MVAE
VAE -82.81
MVAE -83.40

; → #
¤ !(#|;)
¤ ! # ; = ∫ ! # & ' & ; ?& MVAE ' & ;
¤ ' & #, ; # 0 ' & ;
¤ #~!(#|;)
¤
¤ # 0 &~' & #, ;
¤ #

¤ '((&|#, ;) '(<
(&|;) '(6
(&|#)
¤ ℒ #, ; '((&|#, ;) '(<
(&|;) '(6
(&|#) KL
ℒ∗
#, ; = ℒ #, ; + α C −,-.['((&|#,;)||'(6
(&|#)'(<
(&|;)]
&
'((&|#, ;)
'(<
(&|;)

#~!(#|;)
¤ '(<
(&|;) E
¤
¤ &~'(<
(&|;) E + FG, F~H(0,1)
¤
&'(<
(&|;)
G
E

2 CelebA
¤ CelebA [Yang+ 2015]
¤ 20
¤ 40
¤ -1,1 2 Male, Young, Eyeglasses
¤ # ;
¤ VAE
¤ VAE-GAN[Larsen+ 2015] VAE generative adversarial net
(GAN)[Goodfellow+ 2014] VAE
¤ # GAN MVAE-GAN .
¤ MNIST
¤ CNN

¤ CelebA
¤ VAE-GAN
¤ MVAE-GAN
¤ MVAE VAE

# → #
¤ !(#)
¤ MVAE
¤ MVAE-GAN
VAE-GAN -375.21
MVAE-GAN -292.69

; → #
¤ !(#|;)
¤ Conditional VAE[Kingma+ 2014]-GAN ! # ;
¤ MVAE
¤ MVAE ! # !(#|;)
CVAE-GAN -392.07
MVAE-GAN -258.13

MVAE-GAN #~!(#|;)
¤ '(<
(&|;) E
¤ ;
¤
¤ &~'(<
(&|;) E + FG, F~H(0,1)
¤
&
G
E
'(<
(&|;)

¤ # ;~!(;|#)
¤ ; Smiling=3 ; ∗
¤ z
&
#
Male Eyeglasses Smiling
-0.29 -1 -0.98
;
; ; ∗

¤ Smiling 3
¤
Smiling=3
Male = 3
Eyeglasses = 1
Young = -5
Mustache = 5

¤
VAE (MVAE)
¤ KL
¤ MVAE-GAN
¤ MVAE
¤ MNIST t-SNE CelebA
¤
¤ # # ↔ ;
¤ CelebA VAE CVAE
¤
¤
¤

深層生成モデルを用いたマルチモーダル学習

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 深層生成モデルを用いたマルチモーダル学習

Similar to 深層生成モデルを用いたマルチモーダル学習 (20)

More from Masahiro Suzuki

More from Masahiro Suzuki (16)

Recently uploaded

Recently uploaded (12)

深層生成モデルを用いたマルチモーダル学習