Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, Honglak Lee(ICML 2009)
석사과정 세미나 발표를 위해 논문을 읽고 분석한 내용입니다. CDBN은 CNN와 DBN의 장점을 결합하여 translation invariance와 computational competence를 확보하였고, probabilistic max-pooling을 통해 image restoration을 할 수 있는 undirected DBM을 구성할 수 있게 합니다.
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
1. Convolutional Deep Belief Networks for
Scalable Unsupervised Learning of
Hierarchical Representations
Jaehyun Ahn
(jaehyunahn@sogang.ac.kr)
서강대학교 데이터마이닝 연구실
Computer Science Department
Sogang University
Honglak Lee (ICML 2009, 744 quotes)
2. Key Reference
• Convolutional Deep Belief Networks for Scalable
Unsupervised Learning of Hierarchical Representation
(Honglak Lee et al., ICML 2009)
• Lecture at ICML 2009 (http://videolectures.net/icml09_lee_cdb/)
• Learning Multiple Layers of Features from Tiny Images
(Alex Krizhevsky, 2009)
• To be Bernoulli or to be Gaussian, for a Restricted
Boltzmann Machine (Takayoshi Yamashita et al., ICPR 2014)
7/21/15 2
4. What this paper wants to say..
• Taking advantages of Deep Belief Network
through Convolutional networks
– Translation invariance
• Max-pooling
– Scalable to realistic image sizes
• Max-pooling
– Hierarchical probabilistic inference by combining
bottom-up and top-down information
• Probabilistic max-pooling
7/21/15 4
6. Basic Design of Convolutional Networks
7/21/15 6What
is
this?
h)p://www.slideshare.net/sogo1127/101-‐convolu>onal-‐neural-‐networks
Alternate between “Detection” and “Pooling” layers
7. Advantages of Convolutional Networks
1. Translation Invariance
• In LeNet, LeCun et al., 1998, Max-pooling provides a
form of translation invariance. If max-pooling is done
over a 2x2 region, 4 possible configurations will produce
exactly the same output at the convolutional layer.
7/21/15 7
Image:
IEEE
2013,
h)p://www.computer.org/csdl/trans/tp/2013/08/)p2013081930-‐abs.html
8. Advantages of Convolutional Networks
2. Scalable to realistic image sizes
• This paper starts from this basic question “How can we
scale to realistic image sizes (e.g. 200x200 pixels)?”
• Max-pooling shrinks the representation in higher layers
7/21/15 8
Image:
IEEE
2013,
h)p://www.computer.org/csdl/trans/tp/2013/08/)p2013081930-‐abs.html
9. Key Idea with advantages of Convolutional Networks
3. Hierarchical probabilistic inference Max-pooling
• Max-pooling is deterministic and feed-forward only.
However, this paper gave a max-pooling to a
probabilistic semantics that enables to combine bottom-
up and top-down information.
7/21/15 9
11. General Deep Belief Network
7/21/15 11
Z: state probability
If the visible units are binary-valued,
If the visible layers are real-valued,
The visible units are Gaussian with diagonal covariance.
12. General Deep Belief Network
7/21/15 12
The visible units are Gaussian with diagonal covariance.
[Learning features from Tiny Images, 2009 p.13, 1.4.3 Gaussian-Bernoulli RBMs]
13. General Deep Belief Network
7/21/15 13
The visible units are Gaussian with diagonal covariance.
[Learning features from Tiny Images, 2009 p.13, 1.4.3 Gaussian-Bernoulli RBMs]
14. General Deep Belief Network
7/21/15 14
The visible units are Gaussian with diagonal covariance.
[Learning features from Tiny Images, 2009 p.13, 1.4.3 Gaussian-Bernoulli RBMs]
V-dimensional Gaussian Distribution
With Diagonal Covariance given by
And mean in dimension i given by
15. General Deep Belief Network
7/21/15 15
Therefore, we can perform efficient block Gibbs sampling by alternately
sampling each layer’s unit.
16. General Deep Belief Network
7/21/15 16[Learning features from Tiny Images, 2009 p.13, 1.4.3 Gaussian-Bernoulli RBMs]
17. General Deep Belief Network
7/21/15 17
! = − exp!(−! !, ℎ )
!,!
!
To
find
maximizing
model
weights
W,
we
have
to
calculate
Z: state probability
with
gradient
ascent
Carreira-Perpinan and Hinton (2005) showed that the derivate of the log-likelihood of data using chain
-rule. Since computing the average over the true model distribution is intractable, Hinton et al. (2006)
use approximation of that derivative called contrastive divergence: one replaces the average infinite
to small k.
Quotes from: [Representational Power of RBM and DBN, Nicolas Le Roux et al. p.3]
21. Convolutional RBM’s Energy Function
7/21/15 21
K contains an information of Hidden layer’s units “group”
E(v,h) = − hij
k
r,s=1
NW
∑
i, j=1
NH
∑
k=1
K
∑ Wrs
k
vi+r−1, j+s−1 − bk
k=1
K
∑ hij
k
−c vij
i, j=1
NV
∑
i, j=1
NH
∑
NV NW
NH
NP
Bα := {(i, j):hij belongs to the block α }
α := {(C ×C): pooling block }
22. Object function for Gibbs sampling
7/21/15 22
Gibbs sampling
P(hij
k
=1| v) =σ ((W
~
*v)ij + bk )
P(vij =1| h) =σ (( W k
k
∑ *hk
)ij +c)
NV NW
NH
NP
Bα := {(i, j):hij belongs to the block α }
α := {(C ×C): pooling block }
P(v,h)model
Gibbs sampling
26. Meaning of probabilistic max-pooling
7/21/15 26
Max-pooling was intended only for feed-forward
architectures. In contrast, we are interested in a
generative model of images which supports both
top-down and bottom-up interface. Therefore, we
designed our generative model so that inference inv
-loves max-pooling like behavior.
3.3 Probabilistic max-pooling
32. Training Deep Boltzmann Machine
7/21/15 32Deep
learning,
Russ
Salakhutdinov,
University
of
Toronto
(h)p://bit.ly/1Mg9mAi)
33. Training Deep Boltzmann Machine
7/21/15 33Deep
learning,
Russ
Salakhutdinov,
University
of
Toronto
(h)p://bit.ly/1Mg9mAi)
34. Let’s reconstruct our energy function
7/21/15 34Deep
learning,
Russ
Salakhutdinov,
University
of
Toronto
(h)p://bit.ly/1Mg9mAi)
h'
Γ
p
h
v
ω
E(v,h, p,h') = − v
k
∑ •(wk
*hk
)− bk
k
∑ ij
k
hij
∑ − pk
k,l
∑ •(Γkl
*h'kl
)− b'l
l
∑ ij
l
h'ij
∑
35. Let’s reconstruct our energy function
7/21/15 35Deep
learning,
Russ
Salakhutdinov,
University
of
Toronto
(h)p://bit.ly/1Mg9mAi)
h'
Γ
p
h
v
ω
E(v,h, p,h') = − v
k
∑ •(wk
*hk
)− bk
k
∑ ij
k
hij
∑ − pk
k,l
∑ •(Γkl
*h'kl
)− b'l
l
∑ ij
l
h'ij
∑
36. Compare with DBM learning method
7/21/15 36Deep
learning,
Russ
Salakhutdinov,
University
of
Toronto
(h)p://bit.ly/1Mg9mAi)
E(v,h, p,h') = − v
k
∑ •(wk
*hk
)− bk
k
∑ ij
k
hij
∑ − pk
k,l
∑ •(Γkl
*h'kl
)− b'l
l
∑ ij
l
h'ij
∑
visible
hidden
37. Conditional Probability will be given by
7/21/15 37
P(hi, j
k
=1| v,h') =
exp(I(hi, j
k
)+ I(pα
k
))
1+ exp(I(hi, j
k
)+ I(pα
k
))
(i', j')∈Bα
∑
P(pα
k
=1| v,h') =
exp(I(hi, j
k
)+ I(pα
k
))
(i', j')∈Bα
∑
1+ exp(I(hi, j
k
)+ I(pα
k
))
(i', j')∈Bα
∑
Top-‐down
Top-‐down
Bo)om-‐up
Bo)om-‐up