4. One-shot learning
¤ One-shot learning
¤ 1
¤
¤ deep learning
¤ Deep learning
¤ AI
¤ One-shot learning
¤ Li Fei-Fei Brenden Lake Ruslan Salakhutdinov
Joshua B. Tenenbaum
One shot learning of simple visual concepts
Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum
Department of Brain and Cognitive Sciences
Massachusetts Institute of Technology
Abstract
People can learn visual concepts from just one en-
counter, but it remains a mystery how this is accom-
plished. Many authors have proposed that transferred
knowledge from more familiar concepts is a route to
one shot learning, but what is the form of this abstract
knowledge? One hypothesis is that the sharing of parts
is core to one shot learning, but there have been few
attempts to test this hypothesis on a large scale. This
paper works in the domain of handwritten characters,
which contain a rich component structure of strokes.
We introduce a generative model of how characters are
composed from strokes, and how knowledge from previ-
ous characters helps to infer the latent strokes in novel
characters. After comparing several models and humans
on one shot character learning, we find that our stroke
model outperforms a state-of-the-art character model by
a large margin, and it provides a closer fit to human per-
ceptual data.
Keywords: category learning; transfer learning;
Bayesian modeling; neural networks
A hallmark of human cognition is learning from just a
few examples. For instance, a person only needs to see
one Segway to acquire the concept and be able to dis-
criminate future Segways from other vehicles like scoot-
ers and unicycles (Fig. 1 left). Similarly, children can ac-
quire a new word from one encounter (Carey & Bartlett,
1978). How is one shot learning possible?
New concepts are almost never learned in a vacuum.
Experience with other, more familiar concepts in a do-
Where are the others?
Figure 1: Test yourself on one shot learning. From
the example boxed in red, can you find the others in
the grid below? On the left is a Segway and on
the right is the first character of the Bengali alphabet.
AnswerfortheBengalicharacter:Row2,Column2;Row3,Column4.
[Lake+ 2011]One shot learning of simple visual concepts
5. Zero-shot learning
¤ Zero-shot learning 1
¤ side information
¤ One-shot learning
[Socher+ 2013] Zero-Shot Learning Through Cross-Modal Transfer
9. One-shot learning
¤ Fei-Fei [Fei-Fei+ 2006]
¤
¤ Zero-shot learning [Larochelle+ 2008]
¤ Hierarchical Bayesian Program Learning (HBPL) [Lake+ 2011; 2012;
2013; 2015]
¤ 1
¤
for each subpart. Last, parts are roughly positioned
to begin either independently, at the beginning, at
the end, or along previous parts, as defined by
relation Ri (Fig. 3A, iv).
Character tokens q(m)
are produced by execut-
ing the parts and the relations and modeling how
ink flows from the pen to the page. First, motor
noise is added to the control points and the scale
of the subparts to create token-level stroke tra-
jectories S(m)
. Second, the trajectory’s precise start
location L(m)
is sampled from the schematic pro-
vided by its relation Ri to previous strokes. Third,
global transformations are sampled, including
an affine warp A(m)
and adaptive noise parame-
ters that ease probabilistic inference (30). Last, a
binary image I(m)
is created by a stochastic ren-
dering function, lining the stroke trajectories
with grayscale ink and interpreting the pixel
values as independent Bernoulli probabilities.
Posterior inference requires searching the large
combinatorial space of programs that could have
generated a raw image I (m)
. Our strategy uses fast
bottom-up methods (31) to propose a range of
candidate parses. The most promising candidates
are refined by using continuous optimization
and local search, forming a
tion to the posterior distrib
(section S3). Figure 4A show
ered programs for a train
how they are refit to differen
compute a classification scor
log posterior predictive proba
scores indicate that they ar
long to the same class. A hi
when at least one set of par
successfully explain both th
test images, without violating
of the learned within-class
Figure 4B compares the m
parses with the ground-trut
several characters.
Results
People, BPL, and alternative
pared side by side on five co
that examine different form
from just one or a few exam
Fig. 5). All behavioral exp
through Amazon’s Mechanic
perimental procedures are d
1
1
1 2
1 2
Human or Machine?
the pen (Fig. 3A, ii). To construct a new character
type, first the model samples the number of parts
k and the number of subparts ni, for each part
i = 1, ..., k, from their empirical distributions as
measured from the background set. Second, a
template for a part Si is constructed by sampling
subparts from a set of discrete primitive actions
learned from the background set (Fig. 3A, i),
such that the probability of the next action
depends on the previous. Third, parts are then
grounded as parameterized curves (splines) by
sampling the control points and scale parameters
Fig. 3. A generative model of handwritten characters. (A) New types are generated by choosing primitive actions (color coded) from a library (i),
combining these subparts (ii) to make parts (iii), and combining parts with relations to define simple programs (iv). New tokens are generated by running
these programs (v), which are then rendered as raw data (vi). (B) Pseudocode for generating new types y and new token images I(m)
for m = 1, ..., M. The
function f (·, ·) transforms a subpart sequence and start location into a trajectory.
Human parses Machine parsesHuman drawings
-505 -593 -655 -695 -723
Training item with model’s five best parses
RESEARCH | RESEARCH ARTICLES
11. one-shot learning
¤
¤
¤ 2
¤ One-shot learning
¤ One-shot same or different
¤ One-shot
Siamese Neural Networks for One-shot Image Recognition
should generalize to one-shot classification. The verifica-
tion model learns to identify input pairs according to the
probability that they belong to the same class or differ-
ent classes. This model can then be used to evaluate new
images, exactly one per novel class, in a pairwise manner
against the test image. The pairing with the highest score
according to the verification network is then awarded the
highest probability for the one-shot task. If the features
learned by the verification model are sufficient to confirm
or deny the identity of characters from one set of alpha-
Siamese Neural Networks for One-shot Image
Figure 2. Our general strategy. 1) Train a model to discriminate
should general
tion model lea
probability tha
ent classes. Th
images, exactl
against the tes
according to th
highest probab
learned by the
or deny the id
bets, then they
provided that
alphabets to en
tures.
2. Related W
12. Siamese Network
¤ Siamese Network one-shot learning [Koch+ 2015]
¤ Siamese Network [Bromlay+ 1993]
¤
¤
Siamese Neural Networks for One-shot Image Recognition
Figure 3. A simple 2 hidden layer siamese network for binary
classification with logistic prediction p. The structure of the net-
work is replicated across the top and bottom sections to form twin
networks, with shared weight matrices at each layer.
sets where very few examples exist for some classes, pro-
viding a flexible and continuous means of incorporating
inter-class information into the model.
by the energy loss, whereas we fix the metric as spec
above, following the approach in Facebook’s DeepFac
per (Taigman et al., 2014).
Our best-performing models use multiple convolut
layers before the fully-connected layers and top-
energy function. Convolutional neural networks
achieved exceptional results in many large-scale com
vision applications, particularly in image recognition
(Bengio, 2009; Krizhevsky et al., 2012; Simonyan &
serman, 2014; Srivastava, 2013).
Several factors make convolutional networks especiall
pealing. Local connectivity can greatly reduce the n
ber of parameters in the model, which inherently prov
some form of built-in regularization, although conv
tional layers are computationally more expensive than
dard nonlinearities. Also, the convolution operation us
these networks has a direct filtering interpretation, w
each feature map is convolved against input featur
identify patterns as groupings of pixels. Thus, the
puts of each convolutional layer correspond to impo
spatial features in the original input space and offer s
robustness to simple transforms. Finally, very fast CU
libraries are now available in order to build large conv
tional networks without an unacceptable amount of t
activation function. This final layer induces a metric on
the learned feature space of the (L 1)th hidden layer
and scores the similarity between the two feature vec-
tors. The ↵j are additional parameters that are learned
by the model during training, weighting the importance
of the component-wise distance. This defines a final Lth
fully-connected layer for the network which joins the two
siamese twins.
We depict one example above (Figure 4), which shows the
largest version of our model that we considered. This net-
work also gave the best result for any network on the veri-
fication task.
3.2. Learning
Loss function. Let M represent the minibatch size, where
i indexes the ith minibatch. Now let y(x
(i)
1 , x
(i)
2 ) be a
length-M vector which contains the labels for the mini-
batch, where we assume y(x
(i)
1 , x
(i)
2 ) = 1 whenever x1 and
x2 are from the same character class and y(x
(i)
1 , x
(i)
2 ) = 0
otherwise. We impose a regularized cross-entropy objec-
tive on our binary classifier of the following form:
L(x
(i)
1 , x
(i)
2 ) = y(x
(i)
1 , x
(i)
2 ) log p(x
(i)
1 , x
(i)
2 )+
(1 y(x
(i)
1 , x
(i)
2 )) log (1 p(x
(i)
1 , x
(i)
2 )) + T
|w|2
Optimization. This objective is combined with standard
backpropagation algorithm, where the gradient is additive
across the twin networks due to the tied weights. We fix
Weight initialization. We in
in the convolutional layers fro
zero-mean and a standard dev
also initialized from a norma
0.5 and standard deviation 1
layers, the biases were initia
convolutional layers, but the
much wider normal distributi
dard deviation 2 ⇥ 10 1
.
Learning schedule. Althoug
learning rate for each layer,
uniformly across the network
that ⌘
(T )
j = 0.99⌘
(T 1)
j . We
learning rate, the network w
minima more easily without g
face. We fixed momentum t
increasing linearly each epoc
the individual momentum term
We trained each network for a
monitored one-shot validatio
shot learning tasks generated
and drawers in the validation s
did not decrease for 20 epoc
parameters of the model at th
one-shot validation error. If t
to decrease for the entire lear
final state of the model genera
Hyperparameter optimizat
13. ¤ …
¤ 1 one-shot
Neural Turing Machine
¤ Neural Turing Machine (NTM) [Graves+ 2014]
¤
Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller
network receives inputs from an external environment and emits outputs in response. It also
reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed
17. ¤ one-shot learning
1. N
k 1 5 L
2. L S B
¤ One-shot learning (",$ %&) ∈ ) * = {"-, %-}-/0
1
S "& −> %&
¤ 5(%&|"&, *)
¤ * −> *7
¤ 5(%&|"&, *)
support set S, and adds “depth” to the computation of attention (see appendix for more details).
2.2 Training Strategy
In the previous subsection we described Matching Networks which map a support set to a classification
function, S ! c(ˆx). We achieve this via a modification of the set-to-set paradigm augmented with
attention, with the resulting mapping being of the form P✓(.|ˆx, S), noting that ✓ are the parameters
of the model (i.e. of the embedding functions f and g described previously).
The training procedure has to be chosen carefully so as to match inference at test time. Our model
has to perform well with support sets S0
which contain classes never seen during training.
More specifically, let us define a task T as distribution over possible label sets L. Typically we
consider T to uniformly weight all data sets of up to a few unique classes (e.g., 5), with a few
examples per class (e.g., up to 5). In this case, a label set L sampled from a task T, L ⇠ T, will
typically have 5 to 25 examples.
To form an “episode” to compute gradients and update our model, we first sample L from T (e.g.,
L could be the label set {cats, dogs}). We then use L to sample the support set S and a batch B
(i.e., both S and B are labelled examples of cats and dogs). The Matching Net is then trained to
minimise the error predicting the labels in the batch B conditioned on the support set S. This is a
form of meta-learning since the training procedure explicitly learns to learn from a given support set
to minimise a loss over a batch. More precisely, the Matching Nets training objective is as follows:
✓ = arg max
✓
EL⇠T
2
4ES⇠L,B⇠L
2
4
X
(x,y)2B
log P✓ (y|x, S)
3
5
3
5 . (2)
Training ✓ with eq. 2 yields a model which works well when sampling S0
⇠ T0
from a different
distribution of novel labels. Crucially, our model does not need any fine tuning on the classes it has
18. Matching Networks
¤ 5 %& "&, * Matching
networks
¤ one-shot learning end-to-end
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot learning,
we contribute by the definition of tasks that can be used to benchmark other approaches on both
S
"&
%&
19. Matching Networks
¤ Matching network
¤ a
¤ nearest-neighbor
¤ neural machine
translation alignment model
¤ [Bahdanau+ 2016]
¤ a y memories bound
new support set of examples S0
from which to one-shot learn, we simply use the parametric neural
network defined by P to make predictions about the appropriate label ˆy for each test example ˆx:
P(ˆy|ˆx, S0
). In general, our predicted output class for a given input unseen example ˆx and a support
set S becomes arg maxy P(y|ˆx, S).
Our model in its simplest form computes ˆy as follows:
ˆy =
kX
i=1
a(ˆx, xi)yi (1)
where xi, yi are the samples and labels from the support set S = {(xi, yi)}k
i=1, and a is an attention
mechanism which we discuss below. Note that eq. 1 essentially describes the output for a new class as
a linear combination of the labels in the support set. Where the attention mechanism a is a kernel on
X ⇥ X, then (1) is akin to a kernel density estimator. Where the attention mechanism is zero for the
b furthest xi from ˆx according to some distance metric and an appropriate constant otherwise, then
1) is equivalent to ‘k b’-nearest neighbours (although this requires an extension to the attention
mechanism that we describe in Section 2.1.2). Thus (1) subsumes both KDE and kNN methods.
Another view of (1) is where a acts as an attention mechanism and the yi act as memories bound to
he corresponding xi. In this case we can understand this as a particular kind of associative memory
where, given an input, we “point” to the corresponding example in the support set, retrieving its label.
However, unlike other attentional memory mechanisms [2], (1) is non-parametric in nature: as the
support set size grows, so does the memory used. Hence the functional form defined by the classifier
cS(ˆx) is very flexible and can adapt easily to any new support set.
20. ¤ a c softmax
¤ g bidirectional RNN
¤ f LSTM
¤ VGG Inception
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot learning,
we contribute by the definition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we briefly elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ
a training strategy which is tailored for one-shot learning from the support set S.
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural network architectures with
external memories and other components that make them more “computer-like”. We draw inspiration
from models such as sequence to sequence (seq2seq) with attention [2], memory networks [29] and
pointer networks [27].
In all these models, a neural attention mechanism, often fully differentiable, is defined to access (or
read) a memory matrix which stores useful information to solve the task at hand. Typical uses of
this include machine translation, speech recognition, or question answering. More generally, these
architectures model P(B|A) where A and/or B can be a sequence (like in seq2seq models), or, more
interestingly for us, a set [26].
Our contribution is to cast the problem of one-shot learning within the set-to-set framework [26].
Appendix
A Model Description
In this section we fully specify the models which condition the embedding functions f and g on the
whole support set S. Much previous work has fully described similar mechanisms, which is why we
left the precise details for this appendix.
A.1 The Fully Conditional Embedding f
As described in section 2.1.2, the embedding function for an example ˆx in the batch B is as follows:
f(ˆx, S) = attLSTM(f0
(ˆx), g(S), K)
where f0
is a neural network (e.g., VGG or Inception, as described in the main text). We define K
to be the number of “processing” steps following work from [26] from their “Process” block. g(S)
represents the embedding function g applied to each element xi from the set S.
Thus, the state after k processing steps is as follows:
Appendix
A Model Description
In this section we fully specify the models which condition the embedding function
whole support set S. Much previous work has fully described similar mechanisms,
left the precise details for this appendix.
A.1 The Fully Conditional Embedding f
As described in section 2.1.2, the embedding function for an example ˆx in the batch
f(ˆx, S) = attLSTM(f0
(ˆx), g(S), K)
where f0
is a neural network (e.g., VGG or Inception, as described in the main tex
to be the number of “processing” steps following work from [26] from their “Proc
represents the embedding function g applied to each element xi from the set S.
where, given an input, we “point” to the corresponding example in the support set, retrievin
However, unlike other attentional memory mechanisms [2], (1) is non-parametric in natu
support set size grows, so does the memory used. Hence the functional form defined by th
cS(ˆx) is very flexible and can adapt easily to any new support set.
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies
fier. The simplest form that this takes (and which has very tight relationships with
attention models and kernel functions) is to use the softmax over the cosine distan
a(ˆx, xi) = ec(f(ˆx),g(xi))
/
Pk
j=1 ec(f(ˆx),g(xj ))
with embedding functions f and g being
ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments we
examples where f and g are parameterised variously as deep convolutional networks
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language
Section 4).
We note that, though related to metric learning, the classifier defined by Equation 1 is disc
For a given support set S and sample to classify ˆx, it is enough for ˆx to be sufficiently ali
pairs (x0
, y0
) 2 S such that y0
= y and misaligned with the rest. This kind of loss is also
methods such as Neighborhood Component Analysis (NCA) [18], triplet loss [9] or lar
nearest neighbor [28].
However, the objective that we are trying to optimize is precisely aligned with multi-way
classification, and thus we expect it to perform better than its counterparts. Additionally,
simple and differentiable so that one can find the optimal parameters in an “end-to-end” f
2.1.2 Full Context Embeddings
The main novelty of our model lies in reinterpreting a well studied framework (neural netw
external memories) to do one-shot learning. Closely related to metric learning, the embed
tions f and g act as a lift to feature space X to achieve maximum accuracy through the cla
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input,
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “content”
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from g(S) is
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S), K) = hK where hk
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g(xi, S),
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above, e.g. a
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursion for ~h
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments – these classes were
excluded for training during our one-shot experiments described in section 4.1.2.
21. Set-to-set
¤ seq2seq
¤ Matching network
¤ Order Matters: Sequence to sequence for sets [Vinyals+ 2015]
¤
¤ Seq2seq
Published as a conference paper at ICLR 2016
All these empirical findings point to the same story: often for optimization purposes, the order in
which input data is shown to the model has an impact on the learning performance.
Note that we can define an ordering which is independent of the input sequence or set X (e.g., always
reversing the words in a translation task), but also an ordering which is input dependent (e.g., sorting
the input points in the convex hull case). This distinction also applies in the discussion about output
sequences and sets in Section 5.1.
Recent approaches which pushed the seq2seq paradigm further by adding memory and computation
to these models allowed us to define a model which makes no assumptions about input ordering,
whilst preserving the right properties which we just discussed: a memory that increases with the
size of the set, and which is order invariant. In the next sections, we explain such a modification,
which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural
Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1.
4.2 ATTENTION MECHANISMS
Neural models with memories coupled to differentiable addressing mechanism have been success-
fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
2015). Since we are interested in associative memories we employed a “content” based attention.
This has the property that the vector retrieved from our memory would not change if we randomly
shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,
our process block based on an attention mechanism uses the following:
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
a query vector which allows us to read rt from the memories, f is a function that computes a
single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
recurrent state but which takes no inputs. q⇤
t is the state which this LSTM evolves, and is formed
by concatenating the query qt with the resulting attention readout rt. t is the index which indicates
22. ¤ N-way k-shot learning
¤ One-shot
¤ N k
¤ N
N
1/N
¤ fine-tuning N
26. ¤ One-shot generalization [Rezende+
2016]
¤ VAE
¤ One-shot generation
One-shot Generalization in Deep Generative Models
xct 1
zt 1
ht 1
A
…
…
fw
fc
A fw
fo
hT
cT
Generative model
zT
A
ht 1
x
zt
fr
Inference model
(a) Unconditional generative model.
x
A fw
fo
hT
cTx’
hT 1
A
Generative model
zT
A
ht 1
x
fr
x’
A
zt
Inference model
(b) One-step of the conditional generative model.
Figure 2. Stochastic computational graph showing conditional probabilities and computational steps for sequential generative models.
A represents an attentional mechanism that uses function fw for writings and function fr for reading.
and our transition is specified as a long short-term mem-
ory network (LSTM, Hochreiter & Schmidhuber (1997).
We explicitly represent the creation of a set of hidden vari-
ables ct that is a hidden canvas of the model (equation (6)).
The canvas function fc allows for many different trans-
formations, and it is here where generative (writing) at-
tention is used; we describe a number of choices for this
function in section 3.2.3. The generated image (7) is sam-
pled using an observation function fo(c; ✓o) that maps the
last hidden canvas cT to the parameters of the observation
model. The set of all parameters of the generative model is
✓ = {✓h, ✓c, ✓o}.
3.2.2. FREE ENERGY OBJECTIVE
Given the probabilistic model (3)-(7) we can obtain an ob-
smaller in size and can have any number of channels (four
in this paper). We consider two ways with which to update
the hidden canvas:
Additive Canvas. As the name implies, an additive canvas
updates the canvas by simply adding a transformation of the
hidden state fw(ht; ✓c) to the previous canvas state ct 1.
This is a simple, yet effective (see results) update rule:
fc(ct 1, ht; ✓c) = ct 1 + fw(ht; ✓c), (9)
Gated Recurrent Canvas. The canvas function can be up-
dated using a convolutional gated recurrent unit (CGRU)
architecture (Kaiser & Sutskever, 2015), which provides a
non-linear and recursive updating mechanism for the can-
vas and are simplified versions of convolutional LSTMs
(further details of the CGRU are given in appendix B). The
One-shot Generalization in Deep Generative Models
Figure 8. Unconditional samples for 52 ⇥ 52 omniglot (task 1).
For a video of the generation process, see https://www.youtube.com/
watch?v=HQEI2xfTgm4
Figure 9. Generating new examplars of a given character for the
weak generalization test (task 2a). The first row shows the test
images and the next 10 are one-shot samples from the model.
30-20 40-10 45-5
Figure 10. Generating new examplars of a given character for the
strong generalization test (task 2b,c), with models trained with
different amounts of data. Left: Samples from model trained on
30-20 train-test split; Middle: 40-10 split; Right: 45-5 split (right)