NIPS読み会2013: One-shot learning by inverting a compositional causal process
1. One-‐shot
learning
by
inver2ng
a
composi2onal
causal
process
Brenden
M.Lake
Ruslan
Salakhutdinov
Joshua
B.
Tenenbaum
能地宏
@NII
※スライド中の図は論文からの引用です
2. people can classify new images of a foreign handwritte
[23, 16, 17]. Similarly, while classifiers are generally t
16, 17]. Similarly, while classifiers are generally
[23,
g benchmark One-‐shot
classifica2on[4]and CIFAR
benchmark datasets such as ImageNet [4] and CIFA
datasets such as ImageNet
b)
b)
an you learn a new concept from just one example? (a & b)
n in learn a new concept from just one example? (a & b
wnyoured? Answers for b) are row 4 column 3 (left) and row
wn in red? Answers for b) are row 4 column 3 (left) and ro
7. Overview
‣ 人間はたった一つの例から、そのシンボルの特徴を取り出せる
-‐ 分類:似たものを取り出せる
-‐ 生成:新しいサンプルを作り出せる
‣ 機械学習は典型的に、ラベル毎に大量のデータを必要とする
-‐ ex)
MNIST:
6000
training
data
/
class
‣ タスクと貢献
-‐ 機械学習はこの人間の能力を模倣できるか?
-‐ 丁寧に生成モデルを定義したら、人間と同じような結果が得られた
-‐ 人間も同じような仕組みで特徴を抽出していると言えるかも
8. ing algorithms typically require hundreds or thousand
typically require hundreds or thousan
same problems. Here we present a Hierarchical Bay
Here we present a Hierarchical Bay
positionality and causality that can learn aawide rang
causality that can learn wide ran
ple) visual concepts, generalizing in human-like way
generalizing in human-like wa
evaluated performance on a challenging one-shot c
evaluated performance on a challenging one-shot cla
model
model achieved a human-level error rate while subst
human-level error rate while subs
learn
h We also tested the model on a
deep
deep learning models.yper
parameters: on an
models. We also tested the model
erating
examples, by using
by strokes
30
aa) library oferating new examples, number ofusing aa “visual Turing tes
lphabets primitives human-like performance. “visual Turing te
b)
motor
produces
produces
performance.
Figure 4: Lear
データと学習
frequency
Omniglot
dataset
Number of strokes
6000
1
1
2
2
parameters. a)
2000
primitives, whe
row shows the
0
0
2
4
6
8
mon ones. Th
c)
stroke start positions
trol point (circle
b&c) Empirica
People can acquire a new concept from only the barestof e
People can acquire a new concept from only the barestwhere ex
of th
tions
examples in a high-dimensional space of raw perceptualinp
c) show
sta
examples in a high-dimensional space of raw perceptualhowinp
1
2
≥4
differs by stroke
tackled some of the same classification 3and recognition proble
1 Introduction
1 Introduction
1
4000
1
2
3
2
3
4
4
tackled some of the same classification and recognition probl
Image. standard transformation Arequire4 hundredsfrom P (A(m) ) = Nof ex
An image algorithms (m) 2 R is sampled
the standard algorithms require hundreds or thousands ([1, 1,
the
where the first two elements control a global re-scaling and the or thousands of e
While centerstandard T (m) . The transformed trajectoriessecond two control a globa
tion of the the of mass of MNIST benchmark dataset then be rendered as
can for digit recog
While the standard MNIST benchmark dataset for digit recogn
class image,
can classify new [10] (see rom SI-2). handw
grayscale[19], people can classify new images of foreignhandwri
20
alphabetsusing an inklearn
posterior
fof aaforeign This graysc
class [19], peoplenoisemodel adapted fromimages Sectionmore robust during
is (Figure 1b)by two16, 17]. Similarly,make the classifiers are general
then perturbed [23,
processes, which while gradient
(Figure 1b) partial solutions Similarly,xample:
tion and encourage[23, 16, 17]. during classification. These processes include peo
Figure 2: Four alphabets from Omniglot, each with five only
one
such asby four differentconvol
characters e while classifiers are genera
class, using benchmark datasets drawnas ImageNet [4] and✏(m) ,
using benchmark datasets and pixel flipping with probability CIF
(m)
a class, filter with standard deviation b
Gaussian
such ImageNet [4] and CI
3
3
4
4
50
alphabets;
a)
amount noise “Segway” b) Figure on a These range larger
drawn
new visual object from just one examplea) ofpixels then parameterize 105x105uniformly1a).pre-specifiednew (Section S
(e.g., a and ✏ areb) independent Bernoulli distributions, completi
in
grayscale
1600
characters; along with larger and “deeper” model |✓ ) = P (I |T , A , while). perform
have developed
architectures, and , ✏
model of binary images P (I
steadily (and even spectacularly [15]) improved in this big data setting, it is unknown
20
examples
/
character
2.3 Learning high-level knowledge of motor programs
(m)
b
(m)
(m)
(m)
(m)
(m)
(m)
(m)
b
(m)
progress translates to the “one-shot” setting that is a hallmark of human learning [3, 22, 28
The Omniglot dataset was randomly split into a 30 alphabet “background” set and a 20
“evaluation” set, constrained such that the background set included the six most common
as determined by Google hits. Background images, paired with their motor data, were use
the hyperparameters of the HBPL model, including a set of 1000 primitive motor elemen
4a) and position models for a drawing’s first, second, and third stroke, etc. (Figure 4c).
possible, cross-validation (within the background set) was used to decide issues of model c
within the conditional probability distributions of HBPL. Details are provided in Sectio
learning the1: Can of primitives, a new concept from just one example? transfo
Figure models you learn positions, relations, token variability, and image (a &
Additionally, while classification has received most of the attention in machine learning
can generalize in a variety of other ways after learning a new concept. Equipped with the
“Segway” or a new handwritten character (Figure 1c), people can produce new examples,
object into its critical parts, and fill in a missing part of an image. While this flexibility highl
Figure 1:shown much Answers for are row 4 column features&
richness of people’s concepts, suggesting they Can youred? more thanb)discriminative 3 (left) and
concept are in learn a new concept from just one example? (a
2.4
Inference
10. モデル
type level
primitives
R1
} y11
x11
(m)
R1 x(m)
11
}
(m)
L1
(m)
y11
(m)
R2
x12
y12
17
= along s11
42
x21
y21
(m)
(m)
R2
(m)
x12
x21 (m)
(m)
y12 (m)
y21
L2
T2
(m)
2
character type 2 ( = 2)
157
z11 = 5
z21 = 42
(m)
T1
{A, ✏,
5
z12 = 17
z11 = 17
= independent
token level ✓(m)
...
character type 1 ( = 2)
R1
I (m)
x11
y11
= independent
(m)
R2
= start of s11
(m)
(m)
x11(m) R2
y11
(m)
L2
R1
(m)
L1
x21
y21
(m)
x21
(m)
y21
(m)
(m)
T1
{A, ✏,
b}
z21 = 17
T2
(m)
b}
I (m)
Figure 3: An illustration of the HBPL model generating two character types (left and right), where the dotted
line separates the type-level from the token-level variables. Legend: number of strokes , relations R, primitive
id z (color-coded to highlight sharing), control points x (open circles), scale y, start locations L, trajectories T ,
transformation A, noise ✏ and ✓b , and image I.
11. ハイパーパラメータの学習
a)
b)
library of motor primitives
number of of strokes
Number strokes
frequency
6000
1
1
2
2
4000
2000
0
c)
0
2
4
6
8
stroke start positions
1
1
2
1
2
3
3
4
2
3
3
3
4
4
≥4
4
Figur
param
primi
row
mon
trol p
b&c)
tions
c) sh
differ
Image. An image transformation A(m) 2 R4 is sampled from P (A(m) ) =
where the first two elements control a global re-scaling and the second two cont
‣ シンボルの描き方に関する“常識”を学習
tion of the center of mass of T (m) . The transformed trajectories can then be ren
grayscale image, using an ink model adapted from [10] (see Section SI-2). Th
is then ycle
data
by two noise processes,
‣ motor
cperturbed (動画)を用いる which make the gradient more robu
tion and encourage partial solutions during classification. These processes inclu
(m)
a Gaussian filter with standard deviation b and pixel flipping with probabi
(m)
(m)
12. d MNIST benchmark dataset for digit recognition has 6000 training example
can classify new images of a foreign handwritten character from just one exa
can classify new imagesin theinference intestedcharacterveryjust one exam
e. Forty participants of a foreign were this model is fromchallenging
Posterior USA handwritten on one-shot classificati
6, 17]. Similarly, while classifiers are generally trainedon hundreds of images
6, 17]. Similarly,Figureclassifiersare generally traineddifferent numbersaan
ch trial, as in whileImageNet [4] and CIFAR-10/100 [14], image can lea
aas 1b, participants space shown an peopleofof n
large combinatorial were of on hundreds image
hmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can le
mark datasets such that shows the same character. To ensure class
on another image
one-‐shot
classifica2on
developed an algorithm for finding K high-probabi
c)
c)
各イメージに対して、
completed mostone randomly selected proposed by a f
just promising candidates trial from each
are the
fication tasks, so5a and detailed in Section の
posterior
を推定 appro
that characters never repeated These parses Ther
stroke
SI-5. across trials.
wo practice trials with the Latin and Greek alphabets, and K
feedback
type
X
(m) (m)
Human dr
( , a |I ) ⇡
rchial Bayesian Program Learning.P For ✓ test image I (T ) wHuman
and (✓
i 2
token
..., 20, we use a Bayesian classification rule for which wei=1
compute
1
(T ) (c)
where each weight wi is proportional to parse 212 3
score
argmax log P (I |I ).
b)
b)
participants
c
[i
˜
earn a new concept from just one example? (a & b) Where arewi / wiexamples of
the other = P (
vely, new conceptare row 4 column 3 (left) (a &rowWhere are 4 (right). c) The lear
earnAnswersapproximationone example? and b) 2 column the other examples
a the for b) from just uses the HBPL search algorithm to get K
ed?
Pcolumn 4 (right). c) The le
d? Answers abilities suchconstrained such that
CMC chains to are rowas generating (left) and row parsing. = 1. around tha
rt many other for b) estimatecolumn 3 examples and 2 variability Rather eac
and 4 the local type-level i wi
1
rt many other abilities suchre-optimizes the token-level variables ✓ (T ) (al
as generating examples and parsing.
2
nt-based searches to
1
approximation can be improved by incorporating so
posterior
2
推定されたtypeからの
(T )
3
canoni
x
cano
mage I
. The approximation can be written aswhich closelySI-7 fo
the token-level variables ✓(m) , (see Section track
1Z
ターゲットの生成確率
6.2
(T ) (c) is inexpensive to draw conditional samples from the
1 P (I (T ) |✓ (T ) )P (✓ (T ) | )Q(✓ (c) , , I
log P (I |I ) ⇡ log
6
it does not require evaluating the likelihood of the im
13. an algorithm for finding K high-probability parses, [1] , ✓(m)[1] , ..., [K] , ✓(m
st promising candidates proposed by a fast, bottom-up image analysis, show
iled in Section SI-5. These parses approximate the posterior with a discrete d
posterior
inference
P ( , ✓(m) |I (m) ) ⇡
train
e
K
X
i=1
train
wi (✓(m)
✓(m)[i] ) (
[i]
),
prior
からのスコア
正規化
train
weight wi is proportional to parse score, marginalizing over 1
shape variables
1
1
Binary image
e
b)
1
2
train
train
train
traintrain
train
22
222
2
1
111
11
aw)
P
0
2
2
train
1
1
2
22 11
222 111
2i 1
1
wi / w = P (
˜
−59.6
0
2
1 2
22 1
1
1
11
111
[i]
(m)[i] 222 (m)
22
11
11
111
2
2
1
22
1
x 0
1
−59.6
train
1
1
,✓
−88.9
−59.6
,I
12
)
−159
−88.9
1
1
11
111 2 1
1
−88.9
−168
−159
2
1
1
12
−159
−168
ined such that i wi0 = 1. Rather than using just a point estimate for eac
-159
-60
-89
-168
1
1
22
22 1
12
2
ion can be improved by incorporating some1of the local variance around the
2 1
1 21
11
1
2
1
1
111
22
222
(m) 2
111
パースの候補を選んで近似
2
1
2 allow
evel variables ✓ 1, which closely track122 11111image, 2222211 for1 little variability,
the11
2
11
2
2
11
2 1
1
22 11
222 11
1
11
11
111
1
(m)[i] (m) 1
ive to draw conditional samples from the1type-level P ( |✓
,I ) = P(
1.
シンボルの上でランダムウォークを行い、ストロークのサンプル
equire evaluating the likelihood of the image, just the local variance around th
を得る(150個)
d with the token-level fixed. Metropolis Hastings is run to produce N samp
each parse ✓(m)[i] , denoted by [i1] , ..., [iN ] , where the improved approxim
ge
Thinned image
ge
aned)
test
00000
0
test
−59.6
−59.6
−59.6
−59.6
−59.6
0
−59.6
test
test
test
test test
test
test
−831
−88.9
−88.9
−88.9
−88.9
−88.9
−59.6
−88.9
0
−159
−159
−159
−159
−159
−88.9
−159
−59.6
−168
−168
−168
−168
−168
−159
−168
−88.9
−168
−159
test
−881
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−831
test
−1.98e+03
−1.98e+03
−1.98e+03
−1.98e+03
−1.98e+03
−2.12e+03
−881
−1.41e+03
−983
−1.98e+03
−1.22e+03
−979
−2.07e+03
−2.07e+03
−2.07e+03
−2.07e+03
−2.07e+03
−2.07e+03
−1.41e+03
−1.98e+03
−983
−2.09e+03
−2.09e+03
−2.09e+03
−2.09e+03
−2.09e+03
−1.22e+03
−2.07e+03
−979
−1.18e+03
−1.17e+03
−2.09e+03
−1.72e+03
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−1.18e+03
−2.09e+03
−1.17e+03
−1.72e+03
−2.12e+03
planning
2.
そのストロークのスコアを
prior
-1273
から計算、上位
K
個に絞る
-831
-2041 N
K
X by a thinning algorithm X (ii) and
image. (m)
is processed
(m) a) The raw image (i) (m)
(m)
(m)
(m)[i] 1 [18]
ned
,✓
ned
|I
) ⇡ Q( , ✓
planning cleaned
,I
)=
wi (✓
✓
)
(
14. barest of experience – just one or a handful of
の計算
adjustment
rceptual input. Although machine learning has
nition problems that people solve so effortlessly, 1
2
1
b)
2
1
he baresta)of examples to reach good performance.2121 1
experience – just one or a handful of 1 2 1
i
1
1
1
usands of
11
22
2
2
2
22 1
222 1
22
222
11
111
perceptual input. Although machine learning111has
11
11
111
111
digit recognition people solve so effortlessly,
ognition problems that has 6000 training examples per
-60
-168
eign handwritten character 2from just one example-159
housands ofii examples to reach good0 performance. -89
1
22
2
1
1
2
1
1
1
for digit recognitiontrained on hundreds of images per 111111
has 6000 training examples per
s are generally
22
222
1
22
222
foreign handwritten character from111just one example
11
111
22 11
222 111
11
111
1
1
t [4] andiiiCIFAR-10/100 [14], peopleper learn a
can
fiers are generally trained on hundreds of images
Thinned
Binary image
train
train
Binary image
Binary image
train
train
train
train
traintrain
train
1
1
Traced graph (raw)
2
0
22
1
1
−59.6
0
test
test
test
test test
test
Thinned image
test
−59.6
−59.6
−59.6
−59.6
−59.6
0
−59.6
−831
1
2
2
1
1
−881
−2.12e+03
Net [4] and CIFAR-10/100 [14], people can learn a
c)
planning
planning
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−831
−1.98e+03
−1.98e+03
−1.98e+03
−1.98e+03
−1.98e+03
−2.12e+03
−881
1
2
2 11
0
−88.9
−59.6
2
1
1
1
22
1
1
−59.6
−159
−88.9
1
2
1
−88.9
−168
−159
2
1
1
12
−
−
test
test
2
12
traced graph (cleaned)
1
1
1
train
2
test
00000
0
Thinned image
Thinned image
2
train
−88.9
−88.9
−88.9
−88.9
−88.9
−59.6
−88.9
0
−159
−159
−159
−159
−159
−88.9
−159
−59.6
−168
−168
−168
−168
−168
−159
−168
−88.9
1
2
test
2
1
1
1
2 1
1
2
−1.41e+03
−983
−1.98e+03
1
2 2
11
−2.07e+03
−2.07e+03
−2.07e+03
−2.07e+03
−2.07e+03
−1.41e+03
−1.98e+03
−983
−2.09e+03
−2.09e+03
−2.09e+03
−2.09e+03
−2.09e+03
−1.22e+03
−2.07e+03
−979
−1.22e+03
−979
−2.07e+03
1
1
2
1
−1.18e+03
−1.17e+03
−2.09e+03
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−2.12e+03
−1.18e+03
−2.09e+03
−1.17e+03
−
−1
1
2
2 1
1
−1.7
−2.1
−1.72
−2.1
planning
c)
-831
-1273
-2041
e 5: Parsing a raw image. a) The raw image (i) is processed by a thinning algorithm [18] (ii)
zed as an undirected graph [20] (iii) where parses are guided random walks (Section SI-5). b)
parses found for that image (top row) are shown with their log wj (Eq. 5), where numbers insid
Human drawers
推定された
type
circles denote sub-stroke breaks. These fi
Human drawers
e stroke order and starting position, and smaller open 変数を用いて、ターゲットの
re-fit to three different raw images of characters (left in image triplets), where the best parse (t
token
変数を推定(MCMC)
2
1
1 are shown 1
1
s associated image reconstruction (bottom right)
above its score (Eq. 9).
2
2
2
1
1
3
1
3
n an approximate posterior for a particular image, the model 3can evaluate the posterior
2 3
2
score of a new& b) Wherere-fitting the examples of the
image by are the other token-level variables 5(bottom Figure 5b), as 5.3
expl
canonical
5.1
3
3
e example? (a
planning cleaned
planning cleaned
planning cleaned