SlideShare a Scribd company logo
1 of 28
Download to read offline
Matching Networks for One Shot Learning
¤ Deep Mind
¤ Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu,
Daan Wierstra
¤ 2016/06/13 arXiv
¤ One-shot learning
¤ Matching Nets state-of-the art
¤
¤
¤ one-shot learning
One-shot learning
One-shot learning
¤ One-shot learning
¤ 1
¤
¤ deep learning
¤ Deep learning
¤ AI
¤ One-shot learning
¤ Li Fei-Fei Brenden Lake Ruslan Salakhutdinov
Joshua B. Tenenbaum
One shot learning of simple visual concepts
Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum
Department of Brain and Cognitive Sciences
Massachusetts Institute of Technology
Abstract
People can learn visual concepts from just one en-
counter, but it remains a mystery how this is accom-
plished. Many authors have proposed that transferred
knowledge from more familiar concepts is a route to
one shot learning, but what is the form of this abstract
knowledge? One hypothesis is that the sharing of parts
is core to one shot learning, but there have been few
attempts to test this hypothesis on a large scale. This
paper works in the domain of handwritten characters,
which contain a rich component structure of strokes.
We introduce a generative model of how characters are
composed from strokes, and how knowledge from previ-
ous characters helps to infer the latent strokes in novel
characters. After comparing several models and humans
on one shot character learning, we find that our stroke
model outperforms a state-of-the-art character model by
a large margin, and it provides a closer fit to human per-
ceptual data.
Keywords: category learning; transfer learning;
Bayesian modeling; neural networks
A hallmark of human cognition is learning from just a
few examples. For instance, a person only needs to see
one Segway to acquire the concept and be able to dis-
criminate future Segways from other vehicles like scoot-
ers and unicycles (Fig. 1 left). Similarly, children can ac-
quire a new word from one encounter (Carey & Bartlett,
1978). How is one shot learning possible?
New concepts are almost never learned in a vacuum.
Experience with other, more familiar concepts in a do-
Where are the others?
Figure 1: Test yourself on one shot learning. From
the example boxed in red, can you find the others in
the grid below? On the left is a Segway and on
the right is the first character of the Bengali alphabet.
AnswerfortheBengalicharacter:Row2,Column2;Row3,Column4.
[Lake+ 2011]One shot learning of simple visual concepts
Zero-shot learning
¤ Zero-shot learning 1
¤ side information
¤ One-shot learning
[Socher+ 2013] Zero-Shot Learning Through Cross-Modal Transfer
¤
¤ One-shot learning
¤
¤
[Pan+ 2010] A Survey on Transfer Learning
¤
¤ [Pan+2010] [ +2010]
¤
¤
¤
¤
¤
¤
¤
¤
¤
or
One-shot learning
One-shot learning
¤ Fei-Fei [Fei-Fei+ 2006]
¤
¤ Zero-shot learning [Larochelle+ 2008]
¤ Hierarchical Bayesian Program Learning (HBPL) [Lake+ 2011; 2012;
2013; 2015]
¤ 1
¤
for each subpart. Last, parts are roughly positioned
to begin either independently, at the beginning, at
the end, or along previous parts, as defined by
relation Ri (Fig. 3A, iv).
Character tokens q(m)
are produced by execut-
ing the parts and the relations and modeling how
ink flows from the pen to the page. First, motor
noise is added to the control points and the scale
of the subparts to create token-level stroke tra-
jectories S(m)
. Second, the trajectory’s precise start
location L(m)
is sampled from the schematic pro-
vided by its relation Ri to previous strokes. Third,
global transformations are sampled, including
an affine warp A(m)
and adaptive noise parame-
ters that ease probabilistic inference (30). Last, a
binary image I(m)
is created by a stochastic ren-
dering function, lining the stroke trajectories
with grayscale ink and interpreting the pixel
values as independent Bernoulli probabilities.
Posterior inference requires searching the large
combinatorial space of programs that could have
generated a raw image I (m)
. Our strategy uses fast
bottom-up methods (31) to propose a range of
candidate parses. The most promising candidates
are refined by using continuous optimization
and local search, forming a
tion to the posterior distrib
(section S3). Figure 4A show
ered programs for a train
how they are refit to differen
compute a classification scor
log posterior predictive proba
scores indicate that they ar
long to the same class. A hi
when at least one set of par
successfully explain both th
test images, without violating
of the learned within-class
Figure 4B compares the m
parses with the ground-trut
several characters.
Results
People, BPL, and alternative
pared side by side on five co
that examine different form
from just one or a few exam
Fig. 5). All behavioral exp
through Amazon’s Mechanic
perimental procedures are d
1
1
1 2
1 2
Human or Machine?
the pen (Fig. 3A, ii). To construct a new character
type, first the model samples the number of parts
k and the number of subparts ni, for each part
i = 1, ..., k, from their empirical distributions as
measured from the background set. Second, a
template for a part Si is constructed by sampling
subparts from a set of discrete primitive actions
learned from the background set (Fig. 3A, i),
such that the probability of the next action
depends on the previous. Third, parts are then
grounded as parameterized curves (splines) by
sampling the control points and scale parameters
Fig. 3. A generative model of handwritten characters. (A) New types are generated by choosing primitive actions (color coded) from a library (i),
combining these subparts (ii) to make parts (iii), and combining parts with relations to define simple programs (iv). New tokens are generated by running
these programs (v), which are then rendered as raw data (vi). (B) Pseudocode for generating new types y and new token images I(m)
for m = 1, ..., M. The
function f (·, ·) transforms a subpart sequence and start location into a trajectory.
Human parses Machine parsesHuman drawings
-505 -593 -655 -695 -723
Training item with model’s five best parses
RESEARCH | RESEARCH ARTICLES
one-shot learning
¤
1.
2.
3.
one-shot learning
¤
¤
¤ 2
¤ One-shot learning
¤ One-shot same or different
¤ One-shot
Siamese Neural Networks for One-shot Image Recognition
should generalize to one-shot classification. The verifica-
tion model learns to identify input pairs according to the
probability that they belong to the same class or differ-
ent classes. This model can then be used to evaluate new
images, exactly one per novel class, in a pairwise manner
against the test image. The pairing with the highest score
according to the verification network is then awarded the
highest probability for the one-shot task. If the features
learned by the verification model are sufficient to confirm
or deny the identity of characters from one set of alpha-
Siamese Neural Networks for One-shot Image
Figure 2. Our general strategy. 1) Train a model to discriminate
should general
tion model lea
probability tha
ent classes. Th
images, exactl
against the tes
according to th
highest probab
learned by the
or deny the id
bets, then they
provided that
alphabets to en
tures.
2. Related W
Siamese Network
¤ Siamese Network one-shot learning [Koch+ 2015]
¤ Siamese Network [Bromlay+ 1993]
¤
¤
Siamese Neural Networks for One-shot Image Recognition
Figure 3. A simple 2 hidden layer siamese network for binary
classification with logistic prediction p. The structure of the net-
work is replicated across the top and bottom sections to form twin
networks, with shared weight matrices at each layer.
sets where very few examples exist for some classes, pro-
viding a flexible and continuous means of incorporating
inter-class information into the model.
by the energy loss, whereas we fix the metric as spec
above, following the approach in Facebook’s DeepFac
per (Taigman et al., 2014).
Our best-performing models use multiple convolut
layers before the fully-connected layers and top-
energy function. Convolutional neural networks
achieved exceptional results in many large-scale com
vision applications, particularly in image recognition
(Bengio, 2009; Krizhevsky et al., 2012; Simonyan &
serman, 2014; Srivastava, 2013).
Several factors make convolutional networks especiall
pealing. Local connectivity can greatly reduce the n
ber of parameters in the model, which inherently prov
some form of built-in regularization, although conv
tional layers are computationally more expensive than
dard nonlinearities. Also, the convolution operation us
these networks has a direct filtering interpretation, w
each feature map is convolved against input featur
identify patterns as groupings of pixels. Thus, the
puts of each convolutional layer correspond to impo
spatial features in the original input space and offer s
robustness to simple transforms. Finally, very fast CU
libraries are now available in order to build large conv
tional networks without an unacceptable amount of t
activation function. This final layer induces a metric on
the learned feature space of the (L 1)th hidden layer
and scores the similarity between the two feature vec-
tors. The ↵j are additional parameters that are learned
by the model during training, weighting the importance
of the component-wise distance. This defines a final Lth
fully-connected layer for the network which joins the two
siamese twins.
We depict one example above (Figure 4), which shows the
largest version of our model that we considered. This net-
work also gave the best result for any network on the veri-
fication task.
3.2. Learning
Loss function. Let M represent the minibatch size, where
i indexes the ith minibatch. Now let y(x
(i)
1 , x
(i)
2 ) be a
length-M vector which contains the labels for the mini-
batch, where we assume y(x
(i)
1 , x
(i)
2 ) = 1 whenever x1 and
x2 are from the same character class and y(x
(i)
1 , x
(i)
2 ) = 0
otherwise. We impose a regularized cross-entropy objec-
tive on our binary classifier of the following form:
L(x
(i)
1 , x
(i)
2 ) = y(x
(i)
1 , x
(i)
2 ) log p(x
(i)
1 , x
(i)
2 )+
(1 y(x
(i)
1 , x
(i)
2 )) log (1 p(x
(i)
1 , x
(i)
2 )) + T
|w|2
Optimization. This objective is combined with standard
backpropagation algorithm, where the gradient is additive
across the twin networks due to the tied weights. We fix
Weight initialization. We in
in the convolutional layers fro
zero-mean and a standard dev
also initialized from a norma
0.5 and standard deviation 1
layers, the biases were initia
convolutional layers, but the
much wider normal distributi
dard deviation 2 ⇥ 10 1
.
Learning schedule. Althoug
learning rate for each layer,
uniformly across the network
that ⌘
(T )
j = 0.99⌘
(T 1)
j . We
learning rate, the network w
minima more easily without g
face. We fixed momentum t
increasing linearly each epoc
the individual momentum term
We trained each network for a
monitored one-shot validatio
shot learning tasks generated
and drawers in the validation s
did not decrease for 20 epoc
parameters of the model at th
one-shot validation error. If t
to decrease for the entire lear
final state of the model genera
Hyperparameter optimizat
¤ …
¤ 1 one-shot
Neural Turing Machine
¤ Neural Turing Machine (NTM) [Graves+ 2014]
¤
Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller
network receives inputs from an external environment and emits outputs in response. It also
reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed
Memory Augmented Neural Network
¤
¤
¤
¤ one-shot learning
25
2
タスク設定
• この一連のプロセスを エピソード と呼ぶ
• エピソードの冒頭では、番号はランダムに推定するしかない
• エピソードの後半に行くにつれて、正答率が上がってくる。
• 素早く正答率が上がる = One-Shot Learning がよく出来る
2
正解!2 1
“少数の文字例を見ただけで、すぐに認識できるようになる”
というタスクを学習させたい
以下50回続く...
記憶
http://www.slideshare.net/YusukeWatanabe3/metalearning-with-memory-augmented-neural-network
Matching Networks for One Shot Learning
one-shot learning one-shot
learning
¤ One-shot learning
¤ N
¤ N k 1 5
¤ one-shot learning
1. N
k 1 5 L
2. L S B
¤ One-shot learning (",$ %&) ∈ ) * = {"-, %-}-/0
1
	
S "& −> %&
¤ 5(%&|"&, *)
¤ * −> *7
¤ 5(%&|"&, *)
support set S, and adds “depth” to the computation of attention (see appendix for more details).
2.2 Training Strategy
In the previous subsection we described Matching Networks which map a support set to a classification
function, S ! c(ˆx). We achieve this via a modification of the set-to-set paradigm augmented with
attention, with the resulting mapping being of the form P✓(.|ˆx, S), noting that ✓ are the parameters
of the model (i.e. of the embedding functions f and g described previously).
The training procedure has to be chosen carefully so as to match inference at test time. Our model
has to perform well with support sets S0
which contain classes never seen during training.
More specifically, let us define a task T as distribution over possible label sets L. Typically we
consider T to uniformly weight all data sets of up to a few unique classes (e.g., 5), with a few
examples per class (e.g., up to 5). In this case, a label set L sampled from a task T, L ⇠ T, will
typically have 5 to 25 examples.
To form an “episode” to compute gradients and update our model, we first sample L from T (e.g.,
L could be the label set {cats, dogs}). We then use L to sample the support set S and a batch B
(i.e., both S and B are labelled examples of cats and dogs). The Matching Net is then trained to
minimise the error predicting the labels in the batch B conditioned on the support set S. This is a
form of meta-learning since the training procedure explicitly learns to learn from a given support set
to minimise a loss over a batch. More precisely, the Matching Nets training objective is as follows:
✓ = arg max
✓
EL⇠T
2
4ES⇠L,B⇠L
2
4
X
(x,y)2B
log P✓ (y|x, S)
3
5
3
5 . (2)
Training ✓ with eq. 2 yields a model which works well when sampling S0
⇠ T0
from a different
distribution of novel labels. Crucially, our model does not need any fine tuning on the classes it has
Matching Networks
¤ 5 %& "&, * Matching
networks
¤ one-shot learning end-to-end
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot learning,
we contribute by the definition of tasks that can be used to benchmark other approaches on both
S
"&
%&
Matching Networks
¤ Matching network
¤ a
¤ nearest-neighbor
¤ neural machine
translation alignment model
¤ [Bahdanau+ 2016]
¤ a y memories bound
new support set of examples S0
from which to one-shot learn, we simply use the parametric neural
network defined by P to make predictions about the appropriate label ˆy for each test example ˆx:
P(ˆy|ˆx, S0
). In general, our predicted output class for a given input unseen example ˆx and a support
set S becomes arg maxy P(y|ˆx, S).
Our model in its simplest form computes ˆy as follows:
ˆy =
kX
i=1
a(ˆx, xi)yi (1)
where xi, yi are the samples and labels from the support set S = {(xi, yi)}k
i=1, and a is an attention
mechanism which we discuss below. Note that eq. 1 essentially describes the output for a new class as
a linear combination of the labels in the support set. Where the attention mechanism a is a kernel on
X ⇥ X, then (1) is akin to a kernel density estimator. Where the attention mechanism is zero for the
b furthest xi from ˆx according to some distance metric and an appropriate constant otherwise, then
1) is equivalent to ‘k b’-nearest neighbours (although this requires an extension to the attention
mechanism that we describe in Section 2.1.2). Thus (1) subsumes both KDE and kNN methods.
Another view of (1) is where a acts as an attention mechanism and the yi act as memories bound to
he corresponding xi. In this case we can understand this as a particular kind of associative memory
where, given an input, we “point” to the corresponding example in the support set, retrieving its label.
However, unlike other attentional memory mechanisms [2], (1) is non-parametric in nature: as the
support set size grows, so does the memory used. Hence the functional form defined by the classifier
cS(ˆx) is very flexible and can adapt easily to any new support set.
¤ a c softmax
¤ g bidirectional RNN
¤ f LSTM
¤ VGG Inception
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot learning,
we contribute by the definition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we briefly elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ
a training strategy which is tailored for one-shot learning from the support set S.
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural network architectures with
external memories and other components that make them more “computer-like”. We draw inspiration
from models such as sequence to sequence (seq2seq) with attention [2], memory networks [29] and
pointer networks [27].
In all these models, a neural attention mechanism, often fully differentiable, is defined to access (or
read) a memory matrix which stores useful information to solve the task at hand. Typical uses of
this include machine translation, speech recognition, or question answering. More generally, these
architectures model P(B|A) where A and/or B can be a sequence (like in seq2seq models), or, more
interestingly for us, a set [26].
Our contribution is to cast the problem of one-shot learning within the set-to-set framework [26].
Appendix
A Model Description
In this section we fully specify the models which condition the embedding functions f and g on the
whole support set S. Much previous work has fully described similar mechanisms, which is why we
left the precise details for this appendix.
A.1 The Fully Conditional Embedding f
As described in section 2.1.2, the embedding function for an example ˆx in the batch B is as follows:
f(ˆx, S) = attLSTM(f0
(ˆx), g(S), K)
where f0
is a neural network (e.g., VGG or Inception, as described in the main text). We define K
to be the number of “processing” steps following work from [26] from their “Process” block. g(S)
represents the embedding function g applied to each element xi from the set S.
Thus, the state after k processing steps is as follows:
Appendix
A Model Description
In this section we fully specify the models which condition the embedding function
whole support set S. Much previous work has fully described similar mechanisms,
left the precise details for this appendix.
A.1 The Fully Conditional Embedding f
As described in section 2.1.2, the embedding function for an example ˆx in the batch
f(ˆx, S) = attLSTM(f0
(ˆx), g(S), K)
where f0
is a neural network (e.g., VGG or Inception, as described in the main tex
to be the number of “processing” steps following work from [26] from their “Proc
represents the embedding function g applied to each element xi from the set S.
where, given an input, we “point” to the corresponding example in the support set, retrievin
However, unlike other attentional memory mechanisms [2], (1) is non-parametric in natu
support set size grows, so does the memory used. Hence the functional form defined by th
cS(ˆx) is very flexible and can adapt easily to any new support set.
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies
fier. The simplest form that this takes (and which has very tight relationships with
attention models and kernel functions) is to use the softmax over the cosine distan
a(ˆx, xi) = ec(f(ˆx),g(xi))
/
Pk
j=1 ec(f(ˆx),g(xj ))
with embedding functions f and g being
ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments we
examples where f and g are parameterised variously as deep convolutional networks
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language
Section 4).
We note that, though related to metric learning, the classifier defined by Equation 1 is disc
For a given support set S and sample to classify ˆx, it is enough for ˆx to be sufficiently ali
pairs (x0
, y0
) 2 S such that y0
= y and misaligned with the rest. This kind of loss is also
methods such as Neighborhood Component Analysis (NCA) [18], triplet loss [9] or lar
nearest neighbor [28].
However, the objective that we are trying to optimize is precisely aligned with multi-way
classification, and thus we expect it to perform better than its counterparts. Additionally,
simple and differentiable so that one can find the optimal parameters in an “end-to-end” f
2.1.2 Full Context Embeddings
The main novelty of our model lies in reinterpreting a well studied framework (neural netw
external memories) to do one-shot learning. Closely related to metric learning, the embed
tions f and g act as a lift to feature space X to achieve maximum accuracy through the cla
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input,
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “content”
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from g(S) is
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S), K) = hK where hk
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g(xi, S),
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above, e.g. a
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursion for ~h
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments – these classes were
excluded for training during our one-shot experiments described in section 4.1.2.
Set-to-set
¤ seq2seq
¤ Matching network
¤ Order Matters: Sequence to sequence for sets [Vinyals+ 2015]
¤
¤ Seq2seq
Published as a conference paper at ICLR 2016
All these empirical findings point to the same story: often for optimization purposes, the order in
which input data is shown to the model has an impact on the learning performance.
Note that we can define an ordering which is independent of the input sequence or set X (e.g., always
reversing the words in a translation task), but also an ordering which is input dependent (e.g., sorting
the input points in the convex hull case). This distinction also applies in the discussion about output
sequences and sets in Section 5.1.
Recent approaches which pushed the seq2seq paradigm further by adding memory and computation
to these models allowed us to define a model which makes no assumptions about input ordering,
whilst preserving the right properties which we just discussed: a memory that increases with the
size of the set, and which is order invariant. In the next sections, we explain such a modification,
which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural
Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1.
4.2 ATTENTION MECHANISMS
Neural models with memories coupled to differentiable addressing mechanism have been success-
fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
2015). Since we are interested in associative memories we employed a “content” based attention.
This has the property that the vector retrieved from our memory would not change if we randomly
shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,
our process block based on an attention mechanism uses the following:
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
a query vector which allows us to read rt from the memories, f is a function that computes a
single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
recurrent state but which takes no inputs. q⇤
t is the state which this LSTM evolves, and is formed
by concatenating the query qt with the resulting attention readout rt. t is the index which indicates
¤ N-way k-shot learning
¤ One-shot
¤ N k
¤ N
N
1/N
¤ fine-tuning N
1:
¤ Omniglot
¤ 1623 20
¤
¤ Pixels nearest neighbor
¤ Baseline CNN nearest neighbor
¤ N
¤ MANN
¤ Siamese network
¤
¤ Fine-tuning
¤
¤ Lake (by Karpathy )
¤ 1-shot 20-way 95.2% [Lake+ 2011]
2
¤
¤ One-shot generalization [Rezende+
2016]
¤ VAE
¤ One-shot generation
One-shot Generalization in Deep Generative Models
xct 1
zt 1
ht 1
A
…
…
fw
fc
A fw
fo
hT
cT
Generative model
zT
A
ht 1
x
zt
fr
Inference model
(a) Unconditional generative model.
x
A fw
fo
hT
cTx’
hT 1
A
Generative model
zT
A
ht 1
x
fr
x’
A
zt
Inference model
(b) One-step of the conditional generative model.
Figure 2. Stochastic computational graph showing conditional probabilities and computational steps for sequential generative models.
A represents an attentional mechanism that uses function fw for writings and function fr for reading.
and our transition is specified as a long short-term mem-
ory network (LSTM, Hochreiter & Schmidhuber (1997).
We explicitly represent the creation of a set of hidden vari-
ables ct that is a hidden canvas of the model (equation (6)).
The canvas function fc allows for many different trans-
formations, and it is here where generative (writing) at-
tention is used; we describe a number of choices for this
function in section 3.2.3. The generated image (7) is sam-
pled using an observation function fo(c; ✓o) that maps the
last hidden canvas cT to the parameters of the observation
model. The set of all parameters of the generative model is
✓ = {✓h, ✓c, ✓o}.
3.2.2. FREE ENERGY OBJECTIVE
Given the probabilistic model (3)-(7) we can obtain an ob-
smaller in size and can have any number of channels (four
in this paper). We consider two ways with which to update
the hidden canvas:
Additive Canvas. As the name implies, an additive canvas
updates the canvas by simply adding a transformation of the
hidden state fw(ht; ✓c) to the previous canvas state ct 1.
This is a simple, yet effective (see results) update rule:
fc(ct 1, ht; ✓c) = ct 1 + fw(ht; ✓c), (9)
Gated Recurrent Canvas. The canvas function can be up-
dated using a convolutional gated recurrent unit (CGRU)
architecture (Kaiser & Sutskever, 2015), which provides a
non-linear and recursive updating mechanism for the can-
vas and are simplified versions of convolutional LSTMs
(further details of the CGRU are given in appendix B). The
One-shot Generalization in Deep Generative Models
Figure 8. Unconditional samples for 52 ⇥ 52 omniglot (task 1).
For a video of the generation process, see https://www.youtube.com/
watch?v=HQEI2xfTgm4
Figure 9. Generating new examplars of a given character for the
weak generalization test (task 2a). The first row shows the test
images and the next 10 are one-shot samples from the model.
30-20 40-10 45-5
Figure 10. Generating new examplars of a given character for the
strong generalization test (task 2b,c), with models trained with
different amounts of data. Left: Samples from model trained on
30-20 train-test split; Middle: 40-10 split; Right: 45-5 split (right)
¤ One-shot learning
¤ Zero-shot learning
¤
¤
¤ 3
1.
2.
3.
¤ 3
¤ Matching Networks end-to-end
¤
¤
¤
¤ https://www.quora.com/How-is-one-shot-learning-different-from-
deep-learning#
¤ https://www.quora.com/What-is-the-difference-between-one-shot-
learning-and-transfer-learning
¤ Karpathy https://github.com/karpathy/paper-
notes/blob/master/matching_networks.md
¤
¤

More Related Content

What's hot

PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門tmtm otm
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)Masahiro Suzuki
 
[Ridge-i 論文よみかい] Wasserstein auto encoder
[Ridge-i 論文よみかい] Wasserstein auto encoder[Ridge-i 論文よみかい] Wasserstein auto encoder
[Ridge-i 論文よみかい] Wasserstein auto encoderMasanari Kimura
 
Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西Keigo Nishida
 
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks? 【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks? Deep Learning JP
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision TransformerYusuke Uchida
 
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2harmonylab
 
[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classification[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classificationDeep Learning JP
 
Tensor コアを使った PyTorch の高速化
Tensor コアを使った PyTorch の高速化Tensor コアを使った PyTorch の高速化
Tensor コアを使った PyTorch の高速化Yusuke Fujimoto
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情Yuta Kikuchi
 
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...Deep Learning JP
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion ModelsDeep Learning JP
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...Deep Learning JP
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習Deep Learning JP
 
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -tmtm otm
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...Deep Learning JP
 
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介Deep Learning JP
 
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)Deep Learning JP
 
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
[DL輪読会]Few-Shot Unsupervised Image-to-Image TranslationDeep Learning JP
 

What's hot (20)

PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
[Ridge-i 論文よみかい] Wasserstein auto encoder
[Ridge-i 論文よみかい] Wasserstein auto encoder[Ridge-i 論文よみかい] Wasserstein auto encoder
[Ridge-i 論文よみかい] Wasserstein auto encoder
 
Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西Layer Normalization@NIPS+読み会・関西
Layer Normalization@NIPS+読み会・関西
 
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks? 【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2
 
[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classification[DL輪読会]A closer look at few shot classification
[DL輪読会]A closer look at few shot classification
 
Tensor コアを使った PyTorch の高速化
Tensor コアを使った PyTorch の高速化Tensor コアを使った PyTorch の高速化
Tensor コアを使った PyTorch の高速化
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
 
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -深層学習の不確実性 - Uncertainty in Deep Neural Networks -
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
 
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
 
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
 
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
[DL輪読会]Few-Shot Unsupervised Image-to-Image Translation
 

Similar to Siamese Neural Networks for One-Shot Image Recognition via Metric Learning

Enhancing the Design pattern Framework of Robots Object Selection Mechanism -...
Enhancing the Design pattern Framework of Robots Object Selection Mechanism -...Enhancing the Design pattern Framework of Robots Object Selection Mechanism -...
Enhancing the Design pattern Framework of Robots Object Selection Mechanism -...INFOGAIN PUBLICATION
 
Visualizing and Understanding Convolutional Networks
Visualizing and Understanding Convolutional NetworksVisualizing and Understanding Convolutional Networks
Visualizing and Understanding Convolutional NetworksWilly Marroquin (WillyDevNET)
 
AUTOMATIC TRAINING DATA SYNTHESIS FOR HANDWRITING RECOGNITION USING THE STRUC...
AUTOMATIC TRAINING DATA SYNTHESIS FOR HANDWRITING RECOGNITION USING THE STRUC...AUTOMATIC TRAINING DATA SYNTHESIS FOR HANDWRITING RECOGNITION USING THE STRUC...
AUTOMATIC TRAINING DATA SYNTHESIS FOR HANDWRITING RECOGNITION USING THE STRUC...ijaia
 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal EmbeddingIRJET Journal
 
Machine Intelligence.html
Machine Intelligence.htmlMachine Intelligence.html
Machine Intelligence.htmlJohnChan191
 
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATIONMULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATIONijaia
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLlauratoni4
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To SentencesIRJET Journal
 
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...CSCJournals
 
Citython presentation
Citython presentationCitython presentation
Citython presentationAnkit Tewari
 
Study of Different Multi-instance Learning kNN Algorithms
Study of Different Multi-instance Learning kNN AlgorithmsStudy of Different Multi-instance Learning kNN Algorithms
Study of Different Multi-instance Learning kNN AlgorithmsEditor IJCATR
 
A Novel GA-SVM Model For Vehicles And Pedestrial Classification In Videos
A Novel GA-SVM Model For Vehicles And Pedestrial Classification In VideosA Novel GA-SVM Model For Vehicles And Pedestrial Classification In Videos
A Novel GA-SVM Model For Vehicles And Pedestrial Classification In Videosijtsrd
 
Laplacian-regularized Graph Bandits
Laplacian-regularized Graph BanditsLaplacian-regularized Graph Bandits
Laplacian-regularized Graph Banditslauratoni4
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal processnozyh
 
A proposed accelerated image copy-move forgery detection-vcip2014
A proposed accelerated image copy-move forgery detection-vcip2014A proposed accelerated image copy-move forgery detection-vcip2014
A proposed accelerated image copy-move forgery detection-vcip2014SondosFadl
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learningKazuki Fujikawa
 
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...Pourya Jafarzadeh
 
GROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCEGROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCEijaia
 

Similar to Siamese Neural Networks for One-Shot Image Recognition via Metric Learning (20)

Enhancing the Design pattern Framework of Robots Object Selection Mechanism -...
Enhancing the Design pattern Framework of Robots Object Selection Mechanism -...Enhancing the Design pattern Framework of Robots Object Selection Mechanism -...
Enhancing the Design pattern Framework of Robots Object Selection Mechanism -...
 
FULL PAPER.PDF
FULL PAPER.PDFFULL PAPER.PDF
FULL PAPER.PDF
 
Visualizing and Understanding Convolutional Networks
Visualizing and Understanding Convolutional NetworksVisualizing and Understanding Convolutional Networks
Visualizing and Understanding Convolutional Networks
 
AUTOMATIC TRAINING DATA SYNTHESIS FOR HANDWRITING RECOGNITION USING THE STRUC...
AUTOMATIC TRAINING DATA SYNTHESIS FOR HANDWRITING RECOGNITION USING THE STRUC...AUTOMATIC TRAINING DATA SYNTHESIS FOR HANDWRITING RECOGNITION USING THE STRUC...
AUTOMATIC TRAINING DATA SYNTHESIS FOR HANDWRITING RECOGNITION USING THE STRUC...
 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal Embedding
 
Machine Intelligence.html
Machine Intelligence.htmlMachine Intelligence.html
Machine Intelligence.html
 
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATIONMULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To Sentences
 
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
 
Study of Different Multi-instance Learning kNN Algorithms
Study of Different Multi-instance Learning kNN AlgorithmsStudy of Different Multi-instance Learning kNN Algorithms
Study of Different Multi-instance Learning kNN Algorithms
 
A Novel GA-SVM Model For Vehicles And Pedestrial Classification In Videos
A Novel GA-SVM Model For Vehicles And Pedestrial Classification In VideosA Novel GA-SVM Model For Vehicles And Pedestrial Classification In Videos
A Novel GA-SVM Model For Vehicles And Pedestrial Classification In Videos
 
Laplacian-regularized Graph Bandits
Laplacian-regularized Graph BanditsLaplacian-regularized Graph Bandits
Laplacian-regularized Graph Bandits
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal process
 
A proposed accelerated image copy-move forgery detection-vcip2014
A proposed accelerated image copy-move forgery detection-vcip2014A proposed accelerated image copy-move forgery detection-vcip2014
A proposed accelerated image copy-move forgery detection-vcip2014
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
 
GROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCEGROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCE
 

More from Masahiro Suzuki

深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)Masahiro Suzuki
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについてMasahiro Suzuki
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究についてMasahiro Suzuki
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習Masahiro Suzuki
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural NetworkMasahiro Suzuki
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習Masahiro Suzuki
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman FiltersMasahiro Suzuki
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel LearningMasahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...Masahiro Suzuki
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target PropagationMasahiro Suzuki
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization TrickMasahiro Suzuki
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?Masahiro Suzuki
 

More from Masahiro Suzuki (19)

深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Siamese Neural Networks for One-Shot Image Recognition via Metric Learning

  • 1. Matching Networks for One Shot Learning
  • 2. ¤ Deep Mind ¤ Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra ¤ 2016/06/13 arXiv ¤ One-shot learning ¤ Matching Nets state-of-the art ¤ ¤ ¤ one-shot learning
  • 4. One-shot learning ¤ One-shot learning ¤ 1 ¤ ¤ deep learning ¤ Deep learning ¤ AI ¤ One-shot learning ¤ Li Fei-Fei Brenden Lake Ruslan Salakhutdinov Joshua B. Tenenbaum One shot learning of simple visual concepts Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Abstract People can learn visual concepts from just one en- counter, but it remains a mystery how this is accom- plished. Many authors have proposed that transferred knowledge from more familiar concepts is a route to one shot learning, but what is the form of this abstract knowledge? One hypothesis is that the sharing of parts is core to one shot learning, but there have been few attempts to test this hypothesis on a large scale. This paper works in the domain of handwritten characters, which contain a rich component structure of strokes. We introduce a generative model of how characters are composed from strokes, and how knowledge from previ- ous characters helps to infer the latent strokes in novel characters. After comparing several models and humans on one shot character learning, we find that our stroke model outperforms a state-of-the-art character model by a large margin, and it provides a closer fit to human per- ceptual data. Keywords: category learning; transfer learning; Bayesian modeling; neural networks A hallmark of human cognition is learning from just a few examples. For instance, a person only needs to see one Segway to acquire the concept and be able to dis- criminate future Segways from other vehicles like scoot- ers and unicycles (Fig. 1 left). Similarly, children can ac- quire a new word from one encounter (Carey & Bartlett, 1978). How is one shot learning possible? New concepts are almost never learned in a vacuum. Experience with other, more familiar concepts in a do- Where are the others? Figure 1: Test yourself on one shot learning. From the example boxed in red, can you find the others in the grid below? On the left is a Segway and on the right is the first character of the Bengali alphabet. AnswerfortheBengalicharacter:Row2,Column2;Row3,Column4. [Lake+ 2011]One shot learning of simple visual concepts
  • 5. Zero-shot learning ¤ Zero-shot learning 1 ¤ side information ¤ One-shot learning [Socher+ 2013] Zero-Shot Learning Through Cross-Modal Transfer
  • 6. ¤ ¤ One-shot learning ¤ ¤ [Pan+ 2010] A Survey on Transfer Learning
  • 7. ¤ ¤ [Pan+2010] [ +2010] ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ or
  • 9. One-shot learning ¤ Fei-Fei [Fei-Fei+ 2006] ¤ ¤ Zero-shot learning [Larochelle+ 2008] ¤ Hierarchical Bayesian Program Learning (HBPL) [Lake+ 2011; 2012; 2013; 2015] ¤ 1 ¤ for each subpart. Last, parts are roughly positioned to begin either independently, at the beginning, at the end, or along previous parts, as defined by relation Ri (Fig. 3A, iv). Character tokens q(m) are produced by execut- ing the parts and the relations and modeling how ink flows from the pen to the page. First, motor noise is added to the control points and the scale of the subparts to create token-level stroke tra- jectories S(m) . Second, the trajectory’s precise start location L(m) is sampled from the schematic pro- vided by its relation Ri to previous strokes. Third, global transformations are sampled, including an affine warp A(m) and adaptive noise parame- ters that ease probabilistic inference (30). Last, a binary image I(m) is created by a stochastic ren- dering function, lining the stroke trajectories with grayscale ink and interpreting the pixel values as independent Bernoulli probabilities. Posterior inference requires searching the large combinatorial space of programs that could have generated a raw image I (m) . Our strategy uses fast bottom-up methods (31) to propose a range of candidate parses. The most promising candidates are refined by using continuous optimization and local search, forming a tion to the posterior distrib (section S3). Figure 4A show ered programs for a train how they are refit to differen compute a classification scor log posterior predictive proba scores indicate that they ar long to the same class. A hi when at least one set of par successfully explain both th test images, without violating of the learned within-class Figure 4B compares the m parses with the ground-trut several characters. Results People, BPL, and alternative pared side by side on five co that examine different form from just one or a few exam Fig. 5). All behavioral exp through Amazon’s Mechanic perimental procedures are d 1 1 1 2 1 2 Human or Machine? the pen (Fig. 3A, ii). To construct a new character type, first the model samples the number of parts k and the number of subparts ni, for each part i = 1, ..., k, from their empirical distributions as measured from the background set. Second, a template for a part Si is constructed by sampling subparts from a set of discrete primitive actions learned from the background set (Fig. 3A, i), such that the probability of the next action depends on the previous. Third, parts are then grounded as parameterized curves (splines) by sampling the control points and scale parameters Fig. 3. A generative model of handwritten characters. (A) New types are generated by choosing primitive actions (color coded) from a library (i), combining these subparts (ii) to make parts (iii), and combining parts with relations to define simple programs (iv). New tokens are generated by running these programs (v), which are then rendered as raw data (vi). (B) Pseudocode for generating new types y and new token images I(m) for m = 1, ..., M. The function f (·, ·) transforms a subpart sequence and start location into a trajectory. Human parses Machine parsesHuman drawings -505 -593 -655 -695 -723 Training item with model’s five best parses RESEARCH | RESEARCH ARTICLES
  • 11. one-shot learning ¤ ¤ ¤ 2 ¤ One-shot learning ¤ One-shot same or different ¤ One-shot Siamese Neural Networks for One-shot Image Recognition should generalize to one-shot classification. The verifica- tion model learns to identify input pairs according to the probability that they belong to the same class or differ- ent classes. This model can then be used to evaluate new images, exactly one per novel class, in a pairwise manner against the test image. The pairing with the highest score according to the verification network is then awarded the highest probability for the one-shot task. If the features learned by the verification model are sufficient to confirm or deny the identity of characters from one set of alpha- Siamese Neural Networks for One-shot Image Figure 2. Our general strategy. 1) Train a model to discriminate should general tion model lea probability tha ent classes. Th images, exactl against the tes according to th highest probab learned by the or deny the id bets, then they provided that alphabets to en tures. 2. Related W
  • 12. Siamese Network ¤ Siamese Network one-shot learning [Koch+ 2015] ¤ Siamese Network [Bromlay+ 1993] ¤ ¤ Siamese Neural Networks for One-shot Image Recognition Figure 3. A simple 2 hidden layer siamese network for binary classification with logistic prediction p. The structure of the net- work is replicated across the top and bottom sections to form twin networks, with shared weight matrices at each layer. sets where very few examples exist for some classes, pro- viding a flexible and continuous means of incorporating inter-class information into the model. by the energy loss, whereas we fix the metric as spec above, following the approach in Facebook’s DeepFac per (Taigman et al., 2014). Our best-performing models use multiple convolut layers before the fully-connected layers and top- energy function. Convolutional neural networks achieved exceptional results in many large-scale com vision applications, particularly in image recognition (Bengio, 2009; Krizhevsky et al., 2012; Simonyan & serman, 2014; Srivastava, 2013). Several factors make convolutional networks especiall pealing. Local connectivity can greatly reduce the n ber of parameters in the model, which inherently prov some form of built-in regularization, although conv tional layers are computationally more expensive than dard nonlinearities. Also, the convolution operation us these networks has a direct filtering interpretation, w each feature map is convolved against input featur identify patterns as groupings of pixels. Thus, the puts of each convolutional layer correspond to impo spatial features in the original input space and offer s robustness to simple transforms. Finally, very fast CU libraries are now available in order to build large conv tional networks without an unacceptable amount of t activation function. This final layer induces a metric on the learned feature space of the (L 1)th hidden layer and scores the similarity between the two feature vec- tors. The ↵j are additional parameters that are learned by the model during training, weighting the importance of the component-wise distance. This defines a final Lth fully-connected layer for the network which joins the two siamese twins. We depict one example above (Figure 4), which shows the largest version of our model that we considered. This net- work also gave the best result for any network on the veri- fication task. 3.2. Learning Loss function. Let M represent the minibatch size, where i indexes the ith minibatch. Now let y(x (i) 1 , x (i) 2 ) be a length-M vector which contains the labels for the mini- batch, where we assume y(x (i) 1 , x (i) 2 ) = 1 whenever x1 and x2 are from the same character class and y(x (i) 1 , x (i) 2 ) = 0 otherwise. We impose a regularized cross-entropy objec- tive on our binary classifier of the following form: L(x (i) 1 , x (i) 2 ) = y(x (i) 1 , x (i) 2 ) log p(x (i) 1 , x (i) 2 )+ (1 y(x (i) 1 , x (i) 2 )) log (1 p(x (i) 1 , x (i) 2 )) + T |w|2 Optimization. This objective is combined with standard backpropagation algorithm, where the gradient is additive across the twin networks due to the tied weights. We fix Weight initialization. We in in the convolutional layers fro zero-mean and a standard dev also initialized from a norma 0.5 and standard deviation 1 layers, the biases were initia convolutional layers, but the much wider normal distributi dard deviation 2 ⇥ 10 1 . Learning schedule. Althoug learning rate for each layer, uniformly across the network that ⌘ (T ) j = 0.99⌘ (T 1) j . We learning rate, the network w minima more easily without g face. We fixed momentum t increasing linearly each epoc the individual momentum term We trained each network for a monitored one-shot validatio shot learning tasks generated and drawers in the validation s did not decrease for 20 epoc parameters of the model at th one-shot validation error. If t to decrease for the entire lear final state of the model genera Hyperparameter optimizat
  • 13. ¤ … ¤ 1 one-shot Neural Turing Machine ¤ Neural Turing Machine (NTM) [Graves+ 2014] ¤ Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed
  • 14. Memory Augmented Neural Network ¤ ¤ ¤ ¤ one-shot learning 25 2 タスク設定 • この一連のプロセスを エピソード と呼ぶ • エピソードの冒頭では、番号はランダムに推定するしかない • エピソードの後半に行くにつれて、正答率が上がってくる。 • 素早く正答率が上がる = One-Shot Learning がよく出来る 2 正解!2 1 “少数の文字例を見ただけで、すぐに認識できるようになる” というタスクを学習させたい 以下50回続く... 記憶 http://www.slideshare.net/YusukeWatanabe3/metalearning-with-memory-augmented-neural-network
  • 15. Matching Networks for One Shot Learning
  • 16. one-shot learning one-shot learning ¤ One-shot learning ¤ N ¤ N k 1 5
  • 17. ¤ one-shot learning 1. N k 1 5 L 2. L S B ¤ One-shot learning (",$ %&) ∈ ) * = {"-, %-}-/0 1 S "& −> %& ¤ 5(%&|"&, *) ¤ * −> *7 ¤ 5(%&|"&, *) support set S, and adds “depth” to the computation of attention (see appendix for more details). 2.2 Training Strategy In the previous subsection we described Matching Networks which map a support set to a classification function, S ! c(ˆx). We achieve this via a modification of the set-to-set paradigm augmented with attention, with the resulting mapping being of the form P✓(.|ˆx, S), noting that ✓ are the parameters of the model (i.e. of the embedding functions f and g described previously). The training procedure has to be chosen carefully so as to match inference at test time. Our model has to perform well with support sets S0 which contain classes never seen during training. More specifically, let us define a task T as distribution over possible label sets L. Typically we consider T to uniformly weight all data sets of up to a few unique classes (e.g., 5), with a few examples per class (e.g., up to 5). In this case, a label set L sampled from a task T, L ⇠ T, will typically have 5 to 25 examples. To form an “episode” to compute gradients and update our model, we first sample L from T (e.g., L could be the label set {cats, dogs}). We then use L to sample the support set S and a batch B (i.e., both S and B are labelled examples of cats and dogs). The Matching Net is then trained to minimise the error predicting the labels in the batch B conditioned on the support set S. This is a form of meta-learning since the training procedure explicitly learns to learn from a given support set to minimise a loss over a batch. More precisely, the Matching Nets training objective is as follows: ✓ = arg max ✓ EL⇠T 2 4ES⇠L,B⇠L 2 4 X (x,y)2B log P✓ (y|x, S) 3 5 3 5 . (2) Training ✓ with eq. 2 yields a model which works well when sampling S0 ⇠ T0 from a different distribution of novel labels. Crucially, our model does not need any fine tuning on the classes it has
  • 18. Matching Networks ¤ 5 %& "&, * Matching networks ¤ one-shot learning end-to-end Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both S "& %&
  • 19. Matching Networks ¤ Matching network ¤ a ¤ nearest-neighbor ¤ neural machine translation alignment model ¤ [Bahdanau+ 2016] ¤ a y memories bound new support set of examples S0 from which to one-shot learn, we simply use the parametric neural network defined by P to make predictions about the appropriate label ˆy for each test example ˆx: P(ˆy|ˆx, S0 ). In general, our predicted output class for a given input unseen example ˆx and a support set S becomes arg maxy P(y|ˆx, S). Our model in its simplest form computes ˆy as follows: ˆy = kX i=1 a(ˆx, xi)yi (1) where xi, yi are the samples and labels from the support set S = {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that eq. 1 essentially describes the output for a new class as a linear combination of the labels in the support set. Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel density estimator. Where the attention mechanism is zero for the b furthest xi from ˆx according to some distance metric and an appropriate constant otherwise, then 1) is equivalent to ‘k b’-nearest neighbours (although this requires an extension to the attention mechanism that we describe in Section 2.1.2). Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an attention mechanism and the yi act as memories bound to he corresponding xi. In this case we can understand this as a particular kind of associative memory where, given an input, we “point” to the corresponding example in the support set, retrieving its label. However, unlike other attentional memory mechanisms [2], (1) is non-parametric in nature: as the support set size grows, so does the memory used. Hence the functional form defined by the classifier cS(ˆx) is very flexible and can adapt easily to any new support set.
  • 20. ¤ a c softmax ¤ g bidirectional RNN ¤ f LSTM ¤ VGG Inception Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by first defining and explaining our model whilst linking its several compo- nents to related work. Then in the following section we briefly elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ a training strategy which is tailored for one-shot learning from the support set S. 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural network architectures with external memories and other components that make them more “computer-like”. We draw inspiration from models such as sequence to sequence (seq2seq) with attention [2], memory networks [29] and pointer networks [27]. In all these models, a neural attention mechanism, often fully differentiable, is defined to access (or read) a memory matrix which stores useful information to solve the task at hand. Typical uses of this include machine translation, speech recognition, or question answering. More generally, these architectures model P(B|A) where A and/or B can be a sequence (like in seq2seq models), or, more interestingly for us, a set [26]. Our contribution is to cast the problem of one-shot learning within the set-to-set framework [26]. Appendix A Model Description In this section we fully specify the models which condition the embedding functions f and g on the whole support set S. Much previous work has fully described similar mechanisms, which is why we left the precise details for this appendix. A.1 The Fully Conditional Embedding f As described in section 2.1.2, the embedding function for an example ˆx in the batch B is as follows: f(ˆx, S) = attLSTM(f0 (ˆx), g(S), K) where f0 is a neural network (e.g., VGG or Inception, as described in the main text). We define K to be the number of “processing” steps following work from [26] from their “Process” block. g(S) represents the embedding function g applied to each element xi from the set S. Thus, the state after k processing steps is as follows: Appendix A Model Description In this section we fully specify the models which condition the embedding function whole support set S. Much previous work has fully described similar mechanisms, left the precise details for this appendix. A.1 The Fully Conditional Embedding f As described in section 2.1.2, the embedding function for an example ˆx in the batch f(ˆx, S) = attLSTM(f0 (ˆx), g(S), K) where f0 is a neural network (e.g., VGG or Inception, as described in the main tex to be the number of “processing” steps following work from [26] from their “Proc represents the embedding function g applied to each element xi from the set S. where, given an input, we “point” to the corresponding example in the support set, retrievin However, unlike other attentional memory mechanisms [2], (1) is non-parametric in natu support set size grows, so does the memory used. Hence the functional form defined by th cS(ˆx) is very flexible and can adapt easily to any new support set. 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies fier. The simplest form that this takes (and which has very tight relationships with attention models and kernel functions) is to use the softmax over the cosine distan a(ˆx, xi) = ec(f(ˆx),g(xi)) / Pk j=1 ec(f(ˆx),g(xj )) with embedding functions f and g being ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments we examples where f and g are parameterised variously as deep convolutional networks tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language Section 4). We note that, though related to metric learning, the classifier defined by Equation 1 is disc For a given support set S and sample to classify ˆx, it is enough for ˆx to be sufficiently ali pairs (x0 , y0 ) 2 S such that y0 = y and misaligned with the rest. This kind of loss is also methods such as Neighborhood Component Analysis (NCA) [18], triplet loss [9] or lar nearest neighbor [28]. However, the objective that we are trying to optimize is precisely aligned with multi-way classification, and thus we expect it to perform better than its counterparts. Additionally, simple and differentiable so that one can find the optimal parameters in an “end-to-end” f 2.1.2 Full Context Embeddings The main novelty of our model lies in reinterpreting a well studied framework (neural netw external memories) to do one-shot learning. Closely related to metric learning, the embed tions f and g act as a lift to feature space X to achieve maximum accuracy through the cla noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input, h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “content” based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from g(S) is concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), K) = hK where hk is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g(xi, S), as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above, e.g. a VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursion for ~h starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classes were excluded for training during our one-shot experiments described in section 4.1.2.
  • 21. Set-to-set ¤ seq2seq ¤ Matching network ¤ Order Matters: Sequence to sequence for sets [Vinyals+ 2015] ¤ ¤ Seq2seq Published as a conference paper at ICLR 2016 All these empirical findings point to the same story: often for optimization purposes, the order in which input data is shown to the model has an impact on the learning performance. Note that we can define an ordering which is independent of the input sequence or set X (e.g., always reversing the words in a translation task), but also an ordering which is input dependent (e.g., sorting the input points in the convex hull case). This distinction also applies in the discussion about output sequences and sets in Section 5.1. Recent approaches which pushed the seq2seq paradigm further by adding memory and computation to these models allowed us to define a model which makes no assumptions about input ordering, whilst preserving the right properties which we just discussed: a memory that increases with the size of the set, and which is order invariant. In the next sections, we explain such a modification, which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1. 4.2 ATTENTION MECHANISMS Neural models with memories coupled to differentiable addressing mechanism have been success- fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 2015). Since we are interested in associative memories we employed a “content” based attention. This has the property that the vector retrieved from our memory would not change if we randomly shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular, our process block based on an attention mechanism uses the following: qt = LSTM(q⇤ t 1) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q⇤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is a query vector which allows us to read rt from the memories, f is a function that computes a single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a recurrent state but which takes no inputs. q⇤ t is the state which this LSTM evolves, and is formed by concatenating the query qt with the resulting attention readout rt. t is the index which indicates
  • 22. ¤ N-way k-shot learning ¤ One-shot ¤ N k ¤ N N 1/N ¤ fine-tuning N
  • 23. 1: ¤ Omniglot ¤ 1623 20 ¤ ¤ Pixels nearest neighbor ¤ Baseline CNN nearest neighbor ¤ N ¤ MANN ¤ Siamese network
  • 24. ¤ ¤ Fine-tuning ¤ ¤ Lake (by Karpathy ) ¤ 1-shot 20-way 95.2% [Lake+ 2011]
  • 25. 2 ¤
  • 26. ¤ One-shot generalization [Rezende+ 2016] ¤ VAE ¤ One-shot generation One-shot Generalization in Deep Generative Models xct 1 zt 1 ht 1 A … … fw fc A fw fo hT cT Generative model zT A ht 1 x zt fr Inference model (a) Unconditional generative model. x A fw fo hT cTx’ hT 1 A Generative model zT A ht 1 x fr x’ A zt Inference model (b) One-step of the conditional generative model. Figure 2. Stochastic computational graph showing conditional probabilities and computational steps for sequential generative models. A represents an attentional mechanism that uses function fw for writings and function fr for reading. and our transition is specified as a long short-term mem- ory network (LSTM, Hochreiter & Schmidhuber (1997). We explicitly represent the creation of a set of hidden vari- ables ct that is a hidden canvas of the model (equation (6)). The canvas function fc allows for many different trans- formations, and it is here where generative (writing) at- tention is used; we describe a number of choices for this function in section 3.2.3. The generated image (7) is sam- pled using an observation function fo(c; ✓o) that maps the last hidden canvas cT to the parameters of the observation model. The set of all parameters of the generative model is ✓ = {✓h, ✓c, ✓o}. 3.2.2. FREE ENERGY OBJECTIVE Given the probabilistic model (3)-(7) we can obtain an ob- smaller in size and can have any number of channels (four in this paper). We consider two ways with which to update the hidden canvas: Additive Canvas. As the name implies, an additive canvas updates the canvas by simply adding a transformation of the hidden state fw(ht; ✓c) to the previous canvas state ct 1. This is a simple, yet effective (see results) update rule: fc(ct 1, ht; ✓c) = ct 1 + fw(ht; ✓c), (9) Gated Recurrent Canvas. The canvas function can be up- dated using a convolutional gated recurrent unit (CGRU) architecture (Kaiser & Sutskever, 2015), which provides a non-linear and recursive updating mechanism for the can- vas and are simplified versions of convolutional LSTMs (further details of the CGRU are given in appendix B). The One-shot Generalization in Deep Generative Models Figure 8. Unconditional samples for 52 ⇥ 52 omniglot (task 1). For a video of the generation process, see https://www.youtube.com/ watch?v=HQEI2xfTgm4 Figure 9. Generating new examplars of a given character for the weak generalization test (task 2a). The first row shows the test images and the next 10 are one-shot samples from the model. 30-20 40-10 45-5 Figure 10. Generating new examplars of a given character for the strong generalization test (task 2b,c), with models trained with different amounts of data. Left: Samples from model trained on 30-20 train-test split; Middle: 40-10 split; Right: 45-5 split (right)
  • 27. ¤ One-shot learning ¤ Zero-shot learning ¤ ¤ ¤ 3 1. 2. 3. ¤ 3 ¤ Matching Networks end-to-end ¤ ¤