Learning RBM(Restricted Boltzmann Machine in Practice)

A Prac'cal Guide to Training Restricted Boltzmann Machine
Aug 2010, Geoﬀrey Hinton (University of Toronto)
Learning Mul'ple layers of representa'on
Science Direct 2007, Geoﬀrey Hinton (University of Toronto)
Jaehyun Ahn

Nov. 27. 2015
Sogang University
1

2

•  Overview
•  RBM Requires 7 meta parameters to learn
–  Learning Rate
–  The Momentum
–  The weight-cost
–  The sparsity target
–  The ini'al values of the weights
–  The number of hidden units
–  The size of each mini-batch
•  But this does not explain why the decisions were made
or how minor changes will aﬀect performance
3

•  Overview
•  RBM Requires 7 meta parameters to learn
–  Learning Rate
–  The Momentum
–  The weight-cost
–  The sparsity target
–  The ini'al values of the weights
–  The number of hidden units
–  The size of each mini-batch
•  But this does not explain why the decisions were made
or how minor changes will aﬀect performance
A comparison of Neural Network Architectures
4

•  Hopﬁeld Energy Func'on of RBM
•  That decides probability distribu'on of
visible and hidden vector, which will be
{0, 1}!

! !, ℎ = − !!!!
!∈!"#"$%&
− !!ℎ!
!∈!!""#$
− !!ℎ!!!"
!,!

! !, ℎ =
1
!
!!!(!,!) ! = !!!(!,!)
!,!

5

•  Probability that the network assigns to a
visible vector is given by summing over all
possible hidden vectors:
! = !!!(!,!)
!,!

6
! ! =
1
!
!!!(!,!)
!

•  Probability the network assignment to training image can be
raised by adjus'ng the weights and biases to lower the
energy:
! = !!!(!,!)
!,!

7
! ! =
1
!
!!!(!,!)
!

•  Probability the network assignment to training image can be
raised by adjus'ng the weights and biases to lower the
energy:
8
! log !(!)
!!!"
=< !!ℎ! >!"#" −< !!ℎ! >!"#$%
Δ!!" = !(< !!ℎ! >!"#" −< !!ℎ! >!"#$%)
! : learning rate
Max
Log likelihood
Contras've
Divergence

K = k-step CD

•  How to get:
9
Δ!!" = !(< !!ℎ! >!"#" −< !!ℎ! >!"#$%)

•  How to get:
10
Δ!!" = !(< !!ℎ! >!"#" −< !!ℎ! >!"#$%)
우리가 v-train을 통해 얻게 되는 결과
= Posi've Phase
이상적인 weight의 분포로
이루어진 prob distrib를 weight deriv
= Nega've Phase

•  How to get:
11
Δ!!" = !(< !!ℎ! >!"#" −< !!ℎ! >!"#$%)
우리가 v-train을 통해 얻게 되는 결과
구하기 어려운부분, 왜?
우리는 이상적인 hidden node (feature)의 분포를 구성하게 하는 weight 를 모른다
이상적인 weight의 분포로
이루어진 prob distrib를 weight deriv

•  How to get:
12
< !!ℎ! >!"#$% !(!, ℎ)
By using Gibbs Sampling we can get joint distribu'on of , but we need to know !(!, ℎ)

•  How to get:
13
< !!ℎ! >!"#$% !(!, ℎ)
Gibbs sampling
!(!|ℎ) !(ℎ|!) ,
! ℎ! = 1 ! = !(!! + !!!!"
!
)
Sampling update rule

•  How to get:
14
< !!ℎ! >!"#$% !(!, ℎ)
Gibbs sampling
!(!|ℎ) !(ℎ|!) ,
! ℎ! = 1 ! = !(!! + !!!!"
!
)
! !! = 1 ℎ = !(!! + ℎ!!!"
!
)
Sampling update rule
Energy equilibrium

•  How to get:
15
01010110 …
Image Input vector
01010110 …
! ℎ! = 1 ! = !(!! + !!!!"
!
)
!" ! ℎ! = 1 ! >
1
!

1 or 0 recogni'on
< !!ℎ! >!"#$% !(!, ℎ)
Gibbs sampling
!(!|ℎ) !(ℎ|!) ,

•  How to get:
16
01010110 …
Weight
update Comparing 11110110 …
∆!!"
Genera'on (=inference)
11110111 …
! !! = 1 ℎ = !(!! + ℎ!!!"
!
)
< !!ℎ! >!"#$% !(!, ℎ)
Gibbs sampling
!(!|ℎ) !(ℎ|!) ,

•  Now we got:
17
< !!ℎ! >!"#$% !(!, ℎ)
Gibbs sampling
!(!|ℎ) !(ℎ|!) ,
Energy equilibrium
!!"
When N-th Gibbs Sampling has done
will decided

•  Now we got:
18
< !!ℎ! >!"#$% !(!, ℎ)
Gibbs sampling
!(!|ℎ) !(ℎ|!) ,
Energy equilibrium
!!"
When N-th Gibbs Sampling has done
will decided
Δ!!" = !(< !!ℎ! >!"#" −< !!ℎ! >!"#$%)

19

20

21
Δ!!" = !(< !!ℎ! >!"#" −< !!ℎ! >!"#$%)

22
! !! ℎ = !! + ℎ!!!"
!

23
! ℎ! ! = !! + !!!!"
!

24
알고리즘 종료시, 1-CD weights, biases (b, c)를 구할 수 있고 Training이 완료됨

25

An Analysis of Single-Layer Networks in Unsupervised Feature Learning
•  Eﬀec've Learning Features of 1-Hidden Layer RBM
–  Features (# of hidden nodes)
–  Recep've Fields (Filters, Field size)
–  Whitening
26
2011, Honglak Lee
There are two things we are trying to accomplish with whitening:
1.  Make the features less correlated with one another.
2.  Give all of the features the same variance.
Whitening has two simple steps:
1.  Project the dataset onto the eigenvectors. This rotates the dataset so that there is no correlation
between the components.
2.  Normalize the the dataset to have a variance of 1 for all components. This is done by simply dividing
each component by the square root of its eigenvalue.

Example: Olivee faces
•  64x64 pixel gray scale image, 400 samples
•  40 classes, 10 faces of each person
27 출처: hfp://corpocrat.com/2014/10/17/machine-learning-using-restricted-boltzmann-machines/

1.  {0-1} scaling
2.  Convolve (상, 하, 좌, 우)
28 출처: hfp://corpocrat.com/2014/10/17/machine-learning-using-restricted-boltzmann-machines/
X = np.asarray( X, 'ﬂoat32')
X = (X - np.min(X, 0)) / (np.max(X, 0) + 0.0001) # 0<x<1 scaling
컨볼브: hfp://juanreyero.com/ar'cle/python/python-convolu'on.html

2. Convolve (상, 하, 좌, 우)

29
def nudge_dataset(X, Y):
"""
This produces a dataset 5 'mes bigger than the original one,
by moving the 8x8 images in X around by 1px to les, right,
down, up
"""
direc'on_vectors = [
[[0, 1, 0],
[0, 0, 0],
[0, 0, 0]],

[[0, 0, 0],
[1, 0, 0],
[0, 0, 0]],

[[0, 0, 0],
[0, 0, 1],
[0, 0, 0]],

[[0, 0, 0],
[0, 0, 0],
[0, 1, 0]]]
shiN = lambda x, w: convolve(x.reshape((64, 64)), mode='constant’,weights=w).ravel()
X = np.concatenate([X] + [np.apply_along_axis(shis, 1, X, vector) for vector in direc'on_vectors])
Y = np.concatenate([Y for _ in range(5)], axis=0)
return X, Y

# Convert image array to binary with threshold
X = X > 0.5 # True / False

3. Training

30
logis'c = linear_model.Logis'cRegression(C=10)
rbm = BernoulliRBM(n_components=180, learning_rate=0.01, batch_size=10, n_iter=50, verbose=True,
random_state=None)
clf = Pipeline(steps=[('rbm', rbm), ('clf', logis'c)])
X_train, X_test, Y_train, Y_test = cross_valida'on.train_test_split( X, Y, test_size=0.2, random_state=0)
clf.ﬁt(X_train,Y_train)
Y_pred = clf.predict(X_test)
print 'Score: ',(metrics.classiﬁca'on_report(Y_test, Y_pred))
*n_components: # of binary hidden units

4. Plot RBM components of first 16 faces

31
comp = rbm.components_
image_shape = (64, 64)
def plot_gallery('tle, images, n_col, n_row):
plt.figure(figsize=(2. * n_col, 2.26 * n_row))
plt.sub'tle('tle, size=16)
for i, comp in enumerate(images):
plt.subplot(n_row, n_col, i + 1)
vmax = max(comp.max(), -comp.min())
plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray,
vmin=-vmax, vmax=vmax)
plt.x'cks(())
plt.y'cks(())
plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.)
plt.show()
plot_gallery('RBM componenets', comp[:16], 4,4)

32

33

•  Overview
– Mul'layer genera've model
– Approximate inference for mul'layer genera've
model
– Learning many layers of features by composing
RBMs

34

•  Overview
model
RBMs

35
We already covered this slide, page at 11 to 25

•  Overview
model
RBMs

36
Generated sample (image)
= 인식(recogni'on)에 최적화 됨
Genera've model?


37
Mul'layer Genera've Model Genera've Model
Why we use mul'layer genera've model for complex recogni'on? (= Why Deep learning?)
“Generative model with only one hidden layer are much too simple for modeling
The high-dimensional and richly structured sensory data that arrive at the cortex,
But they have been pressed into service because, until recently, it was too difficult to
Perform inference…”


38
Mul'layer Genera've Model Genera've Model
Why we use mul'layer genera've model for complex recogni'on? (= Why Deep learning?)
“Generative model with only one hidden layer are much too simple for modeling
The high-dimensional and richly structured sensory data that arrive at the cortex,
But they have been pressed into service because, until recently, it was too difficult to
Perform inference…”
Who?
Yann LeCun!


39
Mul'layer Genera've Model
Take an advantage of high dimensional rich data recogniton


40
그렇다면 왜? 이렇게 선형 요소부터 관심을 가질까요?
“The role of the bottom up connection is to enable the network to determine activations
For the features in each layer that constitute a plausible explanation (…) Some test images
That the network classifies correctly even though it has never seen them before”
Yann LeCun!
Yann LeCun!
이렇게 바로 구해도 될텐데.


41
그렇다면 어떻게 이렇게 선형/일부/전체 구조 형태의
특징을 찾을 수 있는 weight를 구해 낼 수 있는걸까요?
Yann LeCun!


42
(a)  Two separate restricted Boltzmann Machines(RBMs).
The higher-level RBM is trained by using the hidden ac'vates of the lower RBM as data.
(b)  Composing the two RBMs. Note that the connec'ons in the lower level of the composite
genera've model are directed. The hidden states are s'll inferred by using bofom-up recogni'on
connec'ons, but these are no longer part of the genera've model.


43
recogni'on
inference
!" ! ℎ! = 1 ! >
1
!

!" ! !! = 1 ! >
1
!

While updaMng weights,


44
recogni'on
inference
!" ! ℎ! = 1 ! >
1
!

!" ! !! = 1 ! >
1
!



45
recogni'on
inference
!" ! ℎ! = 1 ! >
1
!

!" ! !! = 1 ! >
1
!



46
recogni'on inference
inference
Why inference only?
-  quick, fast recogni'on: no repeated weight calcula'on
-  misclassiﬁca'on을 확인 가능 및 용인함으로서 local feature 확보 가능

Learning RBM(Restricted Boltzmann Machine in Practice)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from Mad Scientists

More from Mad Scientists (20)

Recently uploaded

Recently uploaded (20)

Learning RBM(Restricted Boltzmann Machine in Practice)