Presented by indico co-founder Madison May at ODSC East.
Abstract: Transfer learning, the practice of applying knowledge gained on one machine learning task to aid the solution of a second task, has seen historic success in the field of computer vision. The output representations of generic image classification models trained on ImageNet have been leveraged to build models that detect the presence of custom objects in natural images. Image classification tasks that would typically require hundreds of thousands of images can be tackled with mere dozens of training examples per class thanks to the use of these pretrained reprsentations. The field of natural language processing, however, has seen more limited gains from transfer learning, with most approaches limited to the use of pretrained word representations. In this talk, we explore parameter and data efficient mechanisms for transfer learning on text, and show practical improvements on real-world tasks. In addition, we demo the use of Enso, a newly open-sourced library designed to simplify benchmarking of transfer learning methods on a variety of target tasks. Enso provides tools for the fair comparison of varied feature representations and target task models as the amount of training data made available to the target model is incrementally increased.
3. Machine Learning Architect @ Indico Data Solutions
Solve big problems with small data.
Email: madison@indico.io
Twitter: @pragmaticml
Github: @madisonmay
4. Overview:
- Deep learning and its limitations
- Transfer learning primer
- Practical recommendations for transfer learning
- Enso + transfer learning benchmarking
- Transfer learning in recent literature
6. A better term for “deep learning”:
“representation learning”
"Visualizing and Understanding Convolutional Networks”
Zeiler, Fergus
Input
Layer 1
activation
Layer 2
activation
Layer 3
activation
Pre-trained
ImageNet model
Feature responds
to car wheels
Feature responds
to faces
7. Representation learning in NLP: word2vec
CBOW objective for word2vec model
https://www.tensorflow.org/tutorials/word2vec
8. Learned word2vec representations have
semantic meaning
“Distributed Representations of Words and Phrases and their Compositionality”
Mikolov, Sutskever, et al.
Advances in neural information processing systems, 3111-3119
15. A shuffled tiger
Each pixel treated as an independent feature →
Can tell that tigers are generally orange and black but not much more
Independently each pixel
has little predictive value
17. In practice, learned features aren’t this interpretable.
However, the relationship between input feature
and target is typically simpler, and learning simpler
relationships requires less data and less compute.
18. Basic transfer learning outline:
1) Train base model on large, general corpus
2) Compute base model’s representations of input data for target task
3) Train lightweight model on top of pre-trained feature representations
Shared encoder -- “featurizer”
“Source Model”
(ex. Movie Review Sentiment)
input hidden hidden
Custom classifier
“Target model”
Box Office
Results
Movie
Sentiment
Aspect
Movie
Genre
Prediction
19. How does transfer learning fix deep learning’s problems?
Training data requirements:
● Pre-trained representations → simpler models → less training data
Memory Requirements:
● A single copy of the base model can fuel many transfer models
● Target models have thousands rather than millions of parameters
● Target model size measured in KBs rather than GBs
Training Time Requirements:
● Target model training takes seconds rather than days
20. HBO’s Silicon Valley “Not Hotdog” app
Transfer learning for computer vision for
“practical” application
21. Transfer learning for NLP vs transfer learning for computer
vision
● More variety in types of target tasks (entity extraction,
classification, seq. labeling)
● More variety in input data (source language, field-specific
terminology)
● No clear “ImageNet” equivalent -- lack of large, generic,
labeled corpora
● Lack of consensus on what source tasks produce good
representations
23. Source model is the single most important variable
Keep source model and target model well-aligned when possible
● Source vocabulary should be aligned with target vocabulary
● Source task should be aligned with target task
Good: product review sentiment → product review category
Good: hotel ratings → restaurant ratings
Less Good: product review sentiment → biology paper classification
Source models Target tasks
Shape ≅ Vocabulary
Color ≅ Task type
24. What source tasks produce good, general representations?
● Natural language inference
○ Are two sentences in agreement, disagreement, or neither?
● Machine translation
○ English → French
● Multi-task learning
○ Learning to solve many supervised problems at once
● Language modeling
○ Learning to model the distribution of natural language.
○ Predicting the next word in a sequence given context
25. Keep target models simple
● Limiting model complexity is a strong implicit regularizer
● Logistic regression goes a long way
● Use L2 regularization / dropout as additional regularization
26. Consider second-order optimization methods
● Transfer learning necessitates simple model with few parameters
because of limited training data
● L-BFGS is usually overlooked in deep learning because it scales
poorly with number of parameters + examples
● L-BFGS performs well in practice for transfer learning applications
First order methods: move a
step in direction of gradient
Second order methods: move
to minimum of second order
approximation of curve
■ Weight Update
■ Approx. of loss surface
■ True loss surface
27. When comparing approaches, measure performance variance
● Limited labeled training data →limited test and validation data
● High variance across CV splits may correspond with poor
generalization
Training Data Volume Training Data Volume
ModelAcc.
ModelAcc.
28. “Classic” machine learning problems are exaggerated at small
training dataset sizes
● Ex: class imbalance can lead to degenerate models that predict
only a single class -- consider oversampling / undersampling
● Ex: unrepresentative dataset -- small sample sizes increase the
likelihood that a model will pick up on spurious correlations
class balance
29. “Feature engineering” has its place
● Modern day “feature engineering” takes the form of model
architecture decisions
● Ex: when trying to determine whether or not a job description and a
resume are a good match, use the absolute difference of the two
feature representations as input to the model.
Model input
Job Description
Resume
31. Enso:
provides a standard interface for the benchmarking
of embeddings and transfer learning methods for
NLP tasks.
32. The need:
● Eliminate human “overfitting” of hyperparameters
to values that work well for a single task
● Ensure higher fidelity baselines
● Benchmark on many datasets to better
understand where an approach is effective
33. Enso workflow:
● Download 2 dozen included datasets for benchmarking on diverse tasks
● “Featurize” all examples in the dataset via a pre-trained source model
● Train target model using the featurized training examples as inputs
● Repeat process for all combinations of featurizers, dataset sizes, target
model architectures, etc.
● Visualize and manually inspect results
39. Recent Papers of Note:
● “Learning General Purpose Distributed Sentence
Representations via Large Scale Multi-task Learning”
by Subramanian, et. al.
● “Fine-tuned Language Models for Text Classification”
by Howard, Ruder
● “Deep contextualized word representations”
by Peters, et. al.
40. “Deep contextualized word representations”
by Peters, et. al. (AllenAI)
● Language modeling is a good objective for source model
● Many different layers of representation are useful, attend over
layers of representation and learn to weight on a per-task basis
● Per token representations mean applicability to broader range of
tasks than vanilla document representation
“Embedding Language Model
Outputs” (ELMO) layer weights
learned on a variety of target tasks
41. Shared encoder -- “featurizer”
input hidden hidden 0.5 0.2 0.3
Each colored block is a “representation”
or “feature vector”
Each representation is weighted then
summed to produce a feature vector of
the same dimensions
46. ● Small data problems are more common than big data
problems.
● Transfer learning enables taking advantage of deep learning
without massive labeled corpora.
● When in doubt, trend toward simplicity.
48. Other Resources for Transfer Learning on NLP tasks
● http://ruder.io, Sebastian Ruder’s blog
● https://arxiv.org/list/cs.CL (Arxiv Computation and Language)
● https://fast.ai (Making neural nets uncool again)
49. “Learning General Purpose Distributed Sentence Representations via
Large Scale Multi-task Learning”
by Subramanian, et. al.
● Learning document representations using bidirectional LSTM
trained on a multi-task learning objective
● Tasks included skip-thought vectors, neural machine translation,
parse tree construction, and natural language inference
● Diverse source tasks led to document representations that
produced strong empirical results when applied to a dozen
different target tasks
Task 1
Task 2
Input
50. “Fine-tuned Language Models for Text Classification”
by Howard, Ruder
● Outlines a “bag of tricks” for applying transfer learning to NLP
● Language modeling is an effective source task
● Fine-tune the source model rather than using a static
representation
● Use separate learning rate per layer to keep the first layer relatively
static while updating the final layer more