SlideShare a Scribd company logo
1 of 50
Download to read offline
@ODSC
OPEN
DATA
SCIENCE
CONFERENCE
Boston | May 1 - 4 2018
Effective Transfer Learning
for NLP
Madison May
madison@indico.io
Machine Learning Architect @ Indico Data Solutions
Solve big problems with small data.
Email: madison@indico.io
Twitter: @pragmaticml
Github: @madisonmay
Overview:
- Deep learning and its limitations
- Transfer learning primer
- Practical recommendations for transfer learning
- Enso + transfer learning benchmarking
- Transfer learning in recent literature
Deep learning and its limitations
A better term for “deep learning”:
“representation learning”
"Visualizing and Understanding Convolutional Networks”
Zeiler, Fergus
Input
Layer 1
activation
Layer 2
activation
Layer 3
activation
Pre-trained
ImageNet model
Feature responds
to car wheels
Feature responds
to faces
Representation learning in NLP: word2vec
CBOW objective for word2vec model
https://www.tensorflow.org/tutorials/word2vec
Learned word2vec representations have
semantic meaning
“Distributed Representations of Words and Phrases and their Compositionality”
Mikolov, Sutskever, et al.
Advances in neural information processing systems, 3111-3119
Training data requirements
Deep Learning
Traditional ML
Labeled Training Data
Performance
~10,000+ labeled examples
Training Time + Computational Expense
Transfer learning primer
Everyone has problems.
Not everyone has data.
Small data problems are more
common than big data problems.
<1k examples = small data
Transfer learning:
the application of knowledge gained in
one context to a different context
A shuffled tiger
Each pixel treated as an independent feature →
Can tell that tigers are generally orange and black but not much more
Independently each pixel
has little predictive value
Transfer learning: re-represent new
data in terms of existing concepts
0.8 0.9 0.7 0.8
large orange striped cat
In practice, learned features aren’t this interpretable.
However, the relationship between input feature
and target is typically simpler, and learning simpler
relationships requires less data and less compute.
Basic transfer learning outline:
1) Train base model on large, general corpus
2) Compute base model’s representations of input data for target task
3) Train lightweight model on top of pre-trained feature representations
Shared encoder -- “featurizer”
“Source Model”
(ex. Movie Review Sentiment)
input hidden hidden
Custom classifier
“Target model”
Box Office
Results
Movie
Sentiment
Aspect
Movie
Genre
Prediction
How does transfer learning fix deep learning’s problems?
Training data requirements:
● Pre-trained representations → simpler models → less training data
Memory Requirements:
● A single copy of the base model can fuel many transfer models
● Target models have thousands rather than millions of parameters
● Target model size measured in KBs rather than GBs
Training Time Requirements:
● Target model training takes seconds rather than days
HBO’s Silicon Valley “Not Hotdog” app
Transfer learning for computer vision for
“practical” application
Transfer learning for NLP vs transfer learning for computer
vision
● More variety in types of target tasks (entity extraction,
classification, seq. labeling)
● More variety in input data (source language, field-specific
terminology)
● No clear “ImageNet” equivalent -- lack of large, generic,
labeled corpora
● Lack of consensus on what source tasks produce good
representations
Practical recommendations for
transfer learning
Source model is the single most important variable
Keep source model and target model well-aligned when possible
● Source vocabulary should be aligned with target vocabulary
● Source task should be aligned with target task
Good: product review sentiment → product review category
Good: hotel ratings → restaurant ratings
Less Good: product review sentiment → biology paper classification
Source models Target tasks
Shape ≅ Vocabulary
Color ≅ Task type
What source tasks produce good, general representations?
● Natural language inference
○ Are two sentences in agreement, disagreement, or neither?
● Machine translation
○ English → French
● Multi-task learning
○ Learning to solve many supervised problems at once
● Language modeling
○ Learning to model the distribution of natural language.
○ Predicting the next word in a sequence given context
Keep target models simple
● Limiting model complexity is a strong implicit regularizer
● Logistic regression goes a long way
● Use L2 regularization / dropout as additional regularization
Consider second-order optimization methods
● Transfer learning necessitates simple model with few parameters
because of limited training data
● L-BFGS is usually overlooked in deep learning because it scales
poorly with number of parameters + examples
● L-BFGS performs well in practice for transfer learning applications
First order methods: move a
step in direction of gradient
Second order methods: move
to minimum of second order
approximation of curve
■ Weight Update
■ Approx. of loss surface
■ True loss surface
When comparing approaches, measure performance variance
● Limited labeled training data →limited test and validation data
● High variance across CV splits may correspond with poor
generalization
Training Data Volume Training Data Volume
ModelAcc.
ModelAcc.
“Classic” machine learning problems are exaggerated at small
training dataset sizes
● Ex: class imbalance can lead to degenerate models that predict
only a single class -- consider oversampling / undersampling
● Ex: unrepresentative dataset -- small sample sizes increase the
likelihood that a model will pick up on spurious correlations
class balance
“Feature engineering” has its place
● Modern day “feature engineering” takes the form of model
architecture decisions
● Ex: when trying to determine whether or not a job description and a
resume are a good match, use the absolute difference of the two
feature representations as input to the model.
Model input
Job Description
Resume
Introducing: Enso
Enso:
provides a standard interface for the benchmarking
of embeddings and transfer learning methods for
NLP tasks.
The need:
● Eliminate human “overfitting” of hyperparameters
to values that work well for a single task
● Ensure higher fidelity baselines
● Benchmark on many datasets to better
understand where an approach is effective
Enso workflow:
● Download 2 dozen included datasets for benchmarking on diverse tasks
● “Featurize” all examples in the dataset via a pre-trained source model
● Train target model using the featurized training examples as inputs
● Repeat process for all combinations of featurizers, dataset sizes, target
model architectures, etc.
● Visualize and manually inspect results
> python -m enso.download
> python -m enso.featurize
> python -m enso.experiment
> python -m enso.visualize
Comparison of transfer model architectures
Comparison of optimizer used
http://github.com/IndicoDataSolutions/enso
http://enso.readthedocs.io
Research spotlight
Recent Papers of Note:
● “Learning General Purpose Distributed Sentence
Representations via Large Scale Multi-task Learning”
by Subramanian, et. al.
● “Fine-tuned Language Models for Text Classification”
by Howard, Ruder
● “Deep contextualized word representations”
by Peters, et. al.
“Deep contextualized word representations”
by Peters, et. al. (AllenAI)
● Language modeling is a good objective for source model
● Many different layers of representation are useful, attend over
layers of representation and learn to weight on a per-task basis
● Per token representations mean applicability to broader range of
tasks than vanilla document representation
“Embedding Language Model
Outputs” (ELMO) layer weights
learned on a variety of target tasks
Shared encoder -- “featurizer”
input hidden hidden 0.5 0.2 0.3
Each colored block is a “representation”
or “feature vector”
Each representation is weighted then
summed to produce a feature vector of
the same dimensions
Source: Chris Olah's personal blog
Bidirectional LSTM
Source + Task RNN’s
Source RNN
(frozen weights)
Task RNN
(task-specific arch.)
Input + FW + BW
(learned avg.)
Empirical Results
Conclusions
● Small data problems are more common than big data
problems.
● Transfer learning enables taking advantage of deep learning
without massive labeled corpora.
● When in doubt, trend toward simplicity.
Appendix
Other Resources for Transfer Learning on NLP tasks
● http://ruder.io, Sebastian Ruder’s blog
● https://arxiv.org/list/cs.CL (Arxiv Computation and Language)
● https://fast.ai (Making neural nets uncool again)
“Learning General Purpose Distributed Sentence Representations via
Large Scale Multi-task Learning”
by Subramanian, et. al.
● Learning document representations using bidirectional LSTM
trained on a multi-task learning objective
● Tasks included skip-thought vectors, neural machine translation,
parse tree construction, and natural language inference
● Diverse source tasks led to document representations that
produced strong empirical results when applied to a dozen
different target tasks
Task 1
Task 2
Input
“Fine-tuned Language Models for Text Classification”
by Howard, Ruder
● Outlines a “bag of tricks” for applying transfer learning to NLP
● Language modeling is an effective source task
● Fine-tune the source model rather than using a static
representation
● Use separate learning rate per layer to keep the first layer relatively
static while updating the final layer more

More Related Content

What's hot

Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer ConnectAnuj Gupta
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question AnsweringSujit Pal
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningSebastian Ruder
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
 
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentationBushra Jbawi
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingSebastian Ruder
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learningananth
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSebastian Ruder
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1ananth
 

What's hot (20)

Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer Connect
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
 
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 

Similar to ODSC East: Effective Transfer Learning for NLP

How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...Wee Hyong Tok
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupYves Peirsman
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionLiad Magen
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer LearningDanielle Dean
 
Single Responsibility Principle
Single Responsibility PrincipleSingle Responsibility Principle
Single Responsibility PrincipleBADR
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Introduction to object oriented language
Introduction to object oriented languageIntroduction to object oriented language
Introduction to object oriented languagefarhan amjad
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Concepts In Object Oriented Programming Languages
Concepts In Object Oriented Programming LanguagesConcepts In Object Oriented Programming Languages
Concepts In Object Oriented Programming Languagesppd1961
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB
 
conceptsinobjectorientedprogramminglanguages-12659959597745-phpapp02.pdf
conceptsinobjectorientedprogramminglanguages-12659959597745-phpapp02.pdfconceptsinobjectorientedprogramminglanguages-12659959597745-phpapp02.pdf
conceptsinobjectorientedprogramminglanguages-12659959597745-phpapp02.pdfSahajShrimal1
 
Multi-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search RankingMulti-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search Rankingbutest
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
 
Rsqrd AI: ML Tooling at an AI-first Startup
Rsqrd AI: ML Tooling at an AI-first StartupRsqrd AI: ML Tooling at an AI-first Startup
Rsqrd AI: ML Tooling at an AI-first StartupSanjana Chowdhury
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Envisioning the Future of Language Workbenches
Envisioning the Future of Language WorkbenchesEnvisioning the Future of Language Workbenches
Envisioning the Future of Language WorkbenchesMarkus Voelter
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsShreyas Suresh Rao
 

Similar to ODSC East: Effective Transfer Learning for NLP (20)

How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full version
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer Learning
 
Single Responsibility Principle
Single Responsibility PrincipleSingle Responsibility Principle
Single Responsibility Principle
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Introduction to object oriented language
Introduction to object oriented languageIntroduction to object oriented language
Introduction to object oriented language
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Concepts In Object Oriented Programming Languages
Concepts In Object Oriented Programming LanguagesConcepts In Object Oriented Programming Languages
Concepts In Object Oriented Programming Languages
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
conceptsinobjectorientedprogramminglanguages-12659959597745-phpapp02.pdf
conceptsinobjectorientedprogramminglanguages-12659959597745-phpapp02.pdfconceptsinobjectorientedprogramminglanguages-12659959597745-phpapp02.pdf
conceptsinobjectorientedprogramminglanguages-12659959597745-phpapp02.pdf
 
Multi-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search RankingMulti-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search Ranking
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Rsqrd AI: ML Tooling at an AI-first Startup
Rsqrd AI: ML Tooling at an AI-first StartupRsqrd AI: ML Tooling at an AI-first Startup
Rsqrd AI: ML Tooling at an AI-first Startup
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Envisioning the Future of Language Workbenches
Envisioning the Future of Language WorkbenchesEnvisioning the Future of Language Workbenches
Envisioning the Future of Language Workbenches
 
Bp301
Bp301Bp301
Bp301
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 

More from indico data

Small Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLPSmall Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLPindico data
 
Getting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured ContentGetting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured Contentindico data
 
Everything You Wanted to Know About Optimization
Everything You Wanted to Know About OptimizationEverything You Wanted to Know About Optimization
Everything You Wanted to Know About Optimizationindico data
 
TensorFlow in Practice
TensorFlow in PracticeTensorFlow in Practice
TensorFlow in Practiceindico data
 
The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learningindico data
 
How Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital MarketingHow Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital Marketingindico data
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modelingindico data
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical Peopleindico data
 
Getting started with indico APIs [Python]
Getting started with indico APIs [Python]Getting started with indico APIs [Python]
Getting started with indico APIs [Python]indico data
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Pythonindico data
 

More from indico data (10)

Small Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLPSmall Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLP
 
Getting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured ContentGetting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured Content
 
Everything You Wanted to Know About Optimization
Everything You Wanted to Know About OptimizationEverything You Wanted to Know About Optimization
Everything You Wanted to Know About Optimization
 
TensorFlow in Practice
TensorFlow in PracticeTensorFlow in Practice
TensorFlow in Practice
 
The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learning
 
How Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital MarketingHow Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital Marketing
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modeling
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical People
 
Getting started with indico APIs [Python]
Getting started with indico APIs [Python]Getting started with indico APIs [Python]
Getting started with indico APIs [Python]
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 

Recently uploaded

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Recently uploaded (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

ODSC East: Effective Transfer Learning for NLP

  • 2. Effective Transfer Learning for NLP Madison May madison@indico.io
  • 3. Machine Learning Architect @ Indico Data Solutions Solve big problems with small data. Email: madison@indico.io Twitter: @pragmaticml Github: @madisonmay
  • 4. Overview: - Deep learning and its limitations - Transfer learning primer - Practical recommendations for transfer learning - Enso + transfer learning benchmarking - Transfer learning in recent literature
  • 5. Deep learning and its limitations
  • 6. A better term for “deep learning”: “representation learning” "Visualizing and Understanding Convolutional Networks” Zeiler, Fergus Input Layer 1 activation Layer 2 activation Layer 3 activation Pre-trained ImageNet model Feature responds to car wheels Feature responds to faces
  • 7. Representation learning in NLP: word2vec CBOW objective for word2vec model https://www.tensorflow.org/tutorials/word2vec
  • 8. Learned word2vec representations have semantic meaning “Distributed Representations of Words and Phrases and their Compositionality” Mikolov, Sutskever, et al. Advances in neural information processing systems, 3111-3119
  • 9. Training data requirements Deep Learning Traditional ML Labeled Training Data Performance ~10,000+ labeled examples
  • 10. Training Time + Computational Expense
  • 12. Everyone has problems. Not everyone has data. Small data problems are more common than big data problems. <1k examples = small data
  • 13. Transfer learning: the application of knowledge gained in one context to a different context
  • 14.
  • 15. A shuffled tiger Each pixel treated as an independent feature → Can tell that tigers are generally orange and black but not much more Independently each pixel has little predictive value
  • 16. Transfer learning: re-represent new data in terms of existing concepts 0.8 0.9 0.7 0.8 large orange striped cat
  • 17. In practice, learned features aren’t this interpretable. However, the relationship between input feature and target is typically simpler, and learning simpler relationships requires less data and less compute.
  • 18. Basic transfer learning outline: 1) Train base model on large, general corpus 2) Compute base model’s representations of input data for target task 3) Train lightweight model on top of pre-trained feature representations Shared encoder -- “featurizer” “Source Model” (ex. Movie Review Sentiment) input hidden hidden Custom classifier “Target model” Box Office Results Movie Sentiment Aspect Movie Genre Prediction
  • 19. How does transfer learning fix deep learning’s problems? Training data requirements: ● Pre-trained representations → simpler models → less training data Memory Requirements: ● A single copy of the base model can fuel many transfer models ● Target models have thousands rather than millions of parameters ● Target model size measured in KBs rather than GBs Training Time Requirements: ● Target model training takes seconds rather than days
  • 20. HBO’s Silicon Valley “Not Hotdog” app Transfer learning for computer vision for “practical” application
  • 21. Transfer learning for NLP vs transfer learning for computer vision ● More variety in types of target tasks (entity extraction, classification, seq. labeling) ● More variety in input data (source language, field-specific terminology) ● No clear “ImageNet” equivalent -- lack of large, generic, labeled corpora ● Lack of consensus on what source tasks produce good representations
  • 23. Source model is the single most important variable Keep source model and target model well-aligned when possible ● Source vocabulary should be aligned with target vocabulary ● Source task should be aligned with target task Good: product review sentiment → product review category Good: hotel ratings → restaurant ratings Less Good: product review sentiment → biology paper classification Source models Target tasks Shape ≅ Vocabulary Color ≅ Task type
  • 24. What source tasks produce good, general representations? ● Natural language inference ○ Are two sentences in agreement, disagreement, or neither? ● Machine translation ○ English → French ● Multi-task learning ○ Learning to solve many supervised problems at once ● Language modeling ○ Learning to model the distribution of natural language. ○ Predicting the next word in a sequence given context
  • 25. Keep target models simple ● Limiting model complexity is a strong implicit regularizer ● Logistic regression goes a long way ● Use L2 regularization / dropout as additional regularization
  • 26. Consider second-order optimization methods ● Transfer learning necessitates simple model with few parameters because of limited training data ● L-BFGS is usually overlooked in deep learning because it scales poorly with number of parameters + examples ● L-BFGS performs well in practice for transfer learning applications First order methods: move a step in direction of gradient Second order methods: move to minimum of second order approximation of curve ■ Weight Update ■ Approx. of loss surface ■ True loss surface
  • 27. When comparing approaches, measure performance variance ● Limited labeled training data →limited test and validation data ● High variance across CV splits may correspond with poor generalization Training Data Volume Training Data Volume ModelAcc. ModelAcc.
  • 28. “Classic” machine learning problems are exaggerated at small training dataset sizes ● Ex: class imbalance can lead to degenerate models that predict only a single class -- consider oversampling / undersampling ● Ex: unrepresentative dataset -- small sample sizes increase the likelihood that a model will pick up on spurious correlations class balance
  • 29. “Feature engineering” has its place ● Modern day “feature engineering” takes the form of model architecture decisions ● Ex: when trying to determine whether or not a job description and a resume are a good match, use the absolute difference of the two feature representations as input to the model. Model input Job Description Resume
  • 31. Enso: provides a standard interface for the benchmarking of embeddings and transfer learning methods for NLP tasks.
  • 32. The need: ● Eliminate human “overfitting” of hyperparameters to values that work well for a single task ● Ensure higher fidelity baselines ● Benchmark on many datasets to better understand where an approach is effective
  • 33. Enso workflow: ● Download 2 dozen included datasets for benchmarking on diverse tasks ● “Featurize” all examples in the dataset via a pre-trained source model ● Train target model using the featurized training examples as inputs ● Repeat process for all combinations of featurizers, dataset sizes, target model architectures, etc. ● Visualize and manually inspect results
  • 34. > python -m enso.download > python -m enso.featurize > python -m enso.experiment > python -m enso.visualize
  • 35. Comparison of transfer model architectures
  • 39. Recent Papers of Note: ● “Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning” by Subramanian, et. al. ● “Fine-tuned Language Models for Text Classification” by Howard, Ruder ● “Deep contextualized word representations” by Peters, et. al.
  • 40. “Deep contextualized word representations” by Peters, et. al. (AllenAI) ● Language modeling is a good objective for source model ● Many different layers of representation are useful, attend over layers of representation and learn to weight on a per-task basis ● Per token representations mean applicability to broader range of tasks than vanilla document representation “Embedding Language Model Outputs” (ELMO) layer weights learned on a variety of target tasks
  • 41. Shared encoder -- “featurizer” input hidden hidden 0.5 0.2 0.3 Each colored block is a “representation” or “feature vector” Each representation is weighted then summed to produce a feature vector of the same dimensions
  • 42. Source: Chris Olah's personal blog Bidirectional LSTM
  • 43. Source + Task RNN’s Source RNN (frozen weights) Task RNN (task-specific arch.) Input + FW + BW (learned avg.)
  • 46. ● Small data problems are more common than big data problems. ● Transfer learning enables taking advantage of deep learning without massive labeled corpora. ● When in doubt, trend toward simplicity.
  • 48. Other Resources for Transfer Learning on NLP tasks ● http://ruder.io, Sebastian Ruder’s blog ● https://arxiv.org/list/cs.CL (Arxiv Computation and Language) ● https://fast.ai (Making neural nets uncool again)
  • 49. “Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning” by Subramanian, et. al. ● Learning document representations using bidirectional LSTM trained on a multi-task learning objective ● Tasks included skip-thought vectors, neural machine translation, parse tree construction, and natural language inference ● Diverse source tasks led to document representations that produced strong empirical results when applied to a dozen different target tasks Task 1 Task 2 Input
  • 50. “Fine-tuned Language Models for Text Classification” by Howard, Ruder ● Outlines a “bag of tricks” for applying transfer learning to NLP ● Language modeling is an effective source task ● Fine-tune the source model rather than using a static representation ● Use separate learning rate per layer to keep the first layer relatively static while updating the final layer more