SlideShare a Scribd company logo
1 of 40
Download to read offline
Small Data for Big
Problems
Practical Transfer
Learning for NLP
Who Am I?
• Founder, CTO Indico
• Research done at Olin College of
Engineering
• Indico focuses Intelligent Process
Automation for Unstructured
Content
• Leverages Indico innovation in
Transfer Learning for text and image
content
Agenda
• Overview of Traditional Approaches to
Feature Engineering in NLP
• Introduction to Transfer Learning and
Text Embeddings
• Word Embeddings vs. Text
Embeddings
• Takeaways and Resources
Assumed
Knowledge
• Traditional NLP Basics (e.g. tf-idf
vectors)
• Traditional Data Science Basics (e.g.
Logistic Regression)
• Generic Math Background (e.g. Vector
Spaces)
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
Name
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
Name
Traditional Solution(s)
• Tf-idf
• Soundex/NYSIIS encoding
• Ignore – low algorithmic value
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
Name
Issues(s)
• Out of Vocabulary
Traditional Solution(s)
• Tf-idf
• Soundex/NYSIIS encoding
• Ignore – low algorithmic value
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Gender
• Location
• Age
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Gender
• Location
• Age
Traditional Solution(s)
• Tf-idf
• Hand-coded features (i.e.
gender)
• Location dictionary
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Gender
• Location
• Age
Issues(s)
• Local Context: His birthday vs
his daughter’s birthday
• Brittle gender detection
• Location detection
Traditional Solution(s)
• Tf-idf
• Hand-coded features (i.e.
gender)
• Location dictionary
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel
Traditional Solution(s)
• Tf-idf
• Parse trees (soreness ->
elbow)
• Domain-specific lexicon
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel
Issues(s)
• Linguistic Context (Semantics)
• Error-prone parse trees
• Maintaining the lexicon
Traditional Solution(s)
• Tf-idf
• Parse trees (soreness ->
elbow)
• Domain-specific lexicon
The Problem With Text
Problem Traditional Solution Traditional Problem
Linguistic Context • Stemming
• Synonym sets
• Lexicons
• Brittle
• Labor-intensive
• Messy real-world data
Local Context • Parse trees
• N-grams
• Phrase lexicon
• Inaccurate parsing
• Limited Context
• Messy real-world data
Out of Vocabulary Issues • Lemmatization
• Expanded vocabulary
• Ignore
• Computationally expensive
• Diminishing returns
• Messy real-world data
Problems with
Small Data
Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Enter Embeddings Transfer Learning
What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
Embedding Method
(e.g. Word2Vec)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Embedding Method
(e.g. Word2Vec)
Linguistic Context
(e.g. Wikipedia)
Pitfalls
• Sufficient, Diverse Linguistic Context
• Clean Test/Train Splits
• The Curse of Dimensionality
• Effective Benchmarking
King
Queen
- man
+ woman
(Royalty)
How do Embeddings Work?
• Meaning is “encoded” into the
embedding space
• Individual dimensions are not
human interpretable
• Embedding method learns by
examining large corpora of
generic language
• Goal is accurate language
representation as a proxy for
downstream performance
“Word” Embeddings
Examples
• Word2vec
• GloVe
• fastText
“Word” Embeddings
Token Value
“great” [0.1, 0.3, …]
… …
Examples In Practice
• Word2vec
• GloVe
• fastText
“Word” Embeddings
Token Value
“great” [0.1, 0.3, …]
… …
Examples In Practice
Training
The quick brown fox _____ over the lazy dog
___ ___ ____ ___ jumps ___ __ ___ ___
CBOW
Skip Gram
• Word2vec
• GloVe
• fastText
Do They Really Preserve Algorithmic Value?
• Embeddings generally
outperform raw text at low data
volumes
• Leveraging large, generic text
corpora improves
generalizability
• This is 4 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Glove Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
Problems with
Small Data
Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Text Embeddings
Examples
• Doc2vec
• Elmo
• ULMFiT
Text Embeddings
Examples
In Practice
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT
Text Embeddings
Examples In Practice
Training
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Language
Supervised
dog
True
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT
Text Embeddings
CNN-Style
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Prediction
https://arxiv.org/pdf/1408.5882.pdf
Example
Text Embeddings
RNN-Style
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Output
Memory
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
…
σ σ σ σ σ σ σ σ
Prediction
https://arxiv.org/pdf/1802.05365.pdf
Example
Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Problems with
Small Data
The Power of Context
We used a bytepair encoding (BPE) vocabulary…
significantly improving upon the state of the art in 9 out of
the 12 tasks studied
- Improving Language Understanding by Generative Pre-Training*
* https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf
Problems with
Small Data
Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Do They Really Preserve Algorithmic Value?
• Newer transfer learning
techniques have made deep
learning at low data volumes
tractable
• Even when operating on top of
byte-pair encodings sufficient
context is retained to achieve
sota performance
• 4x error reduction over tf-idf
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Finetune Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
Finetune
Takeaways
• At low data volumes embeddings
drastically improve accuracy via
transfer learning
• The transfer learning space moves
very quickly. Adoption of Glove is very
low, but already out of date
• This is just basic framing. Practical use
of embeddings is more complex. See
our session at DSS to learn more
Resources
• Github library – Finetune
(https://github.com/indicodatasolutions/finetune)
• Github library – Enso
(https://github.com/indicodatasolutions/enso)
• Indico Machine learning newsletter
(indico.io)
• Deep Learning Book
(https://www.deeplearningbook.org/)
Questions?
• slater@indico.io
• Quora: https://www.quora.com/profile/Slater-Ryan-Victoroff
The Real Problem With Text
Select
Features
Optimize
Hyper
parameters
Test/Train
Split
Train
Model
Evaluate
Errors and
View Test
Error
Feature Engineering?
Standard Data Science?
The Real Problem With Text
Select
Features
Optimize
Hyper
parameters
Test/Train
Split
Train
Model
Evaluate
Errors and
View Test
Error
Feature Engineering?
Standard Data Science?
Overfitting
Test/Train Contamination
The Real Problem With Text
Select
Features
Optimize
Hyper
parameters
Test/Train
Split
Train
Model
Evaluate
Errors and
View Test
Error
Feature Engineering?
Standard Data Science?
Overfitting
Test/Train Contamination
Manual feature engineering
leads to inaccurate
perceptions of performance

More Related Content

Similar to Small Data for Big Problems: Practical Transfer Learning for NLP

Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
Fasihul Kabir
 

Similar to Small Data for Big Problems: Practical Transfer Learning for NLP (20)

Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
Text Representations for Deep learning
Text Representations for Deep learningText Representations for Deep learning
Text Representations for Deep learning
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Natural Language Processing for Irish
Natural Language Processing for IrishNatural Language Processing for Irish
Natural Language Processing for Irish
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
#02 Next RNN
#02 Next RNN#02 Next RNN
#02 Next RNN
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
What might a spoken corpus tell us about language
What might a spoken corpus tell us about languageWhat might a spoken corpus tell us about language
What might a spoken corpus tell us about language
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part I
 
Detecting semantic shift in large corpora by exploiting temporal random indexing
Detecting semantic shift in large corpora by exploiting temporal random indexingDetecting semantic shift in large corpora by exploiting temporal random indexing
Detecting semantic shift in large corpora by exploiting temporal random indexing
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
 

More from indico data

The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learning
indico data
 

More from indico data (10)

Getting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured ContentGetting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured Content
 
Everything You Wanted to Know About Optimization
Everything You Wanted to Know About OptimizationEverything You Wanted to Know About Optimization
Everything You Wanted to Know About Optimization
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
TensorFlow in Practice
TensorFlow in PracticeTensorFlow in Practice
TensorFlow in Practice
 
The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learning
 
How Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital MarketingHow Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital Marketing
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modeling
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical People
 
Getting started with indico APIs [Python]
Getting started with indico APIs [Python]Getting started with indico APIs [Python]
Getting started with indico APIs [Python]
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 

Recently uploaded

➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 

Small Data for Big Problems: Practical Transfer Learning for NLP

  • 1. Small Data for Big Problems Practical Transfer Learning for NLP
  • 2. Who Am I? • Founder, CTO Indico • Research done at Olin College of Engineering • Indico focuses Intelligent Process Automation for Unstructured Content • Leverages Indico innovation in Transfer Learning for text and image content
  • 3. Agenda • Overview of Traditional Approaches to Feature Engineering in NLP • Introduction to Transfer Learning and Text Embeddings • Word Embeddings vs. Text Embeddings • Takeaways and Resources
  • 4. Assumed Knowledge • Traditional NLP Basics (e.g. tf-idf vectors) • Traditional Data Science Basics (e.g. Logistic Regression) • Generic Math Background (e.g. Vector Spaces)
  • 5. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) Name
  • 6. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) Name Traditional Solution(s) • Tf-idf • Soundex/NYSIIS encoding • Ignore – low algorithmic value
  • 7. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) Name Issues(s) • Out of Vocabulary Traditional Solution(s) • Tf-idf • Soundex/NYSIIS encoding • Ignore – low algorithmic value
  • 8. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) • Gender • Location • Age
  • 9. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) • Gender • Location • Age Traditional Solution(s) • Tf-idf • Hand-coded features (i.e. gender) • Location dictionary
  • 10. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) • Gender • Location • Age Issues(s) • Local Context: His birthday vs his daughter’s birthday • Brittle gender detection • Location detection Traditional Solution(s) • Tf-idf • Hand-coded features (i.e. gender) • Location dictionary
  • 11. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) • Activity • Prior Affliction/Treatment • Travel
  • 12. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) • Activity • Prior Affliction/Treatment • Travel Traditional Solution(s) • Tf-idf • Parse trees (soreness -> elbow) • Domain-specific lexicon
  • 13. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) • Activity • Prior Affliction/Treatment • Travel Issues(s) • Linguistic Context (Semantics) • Error-prone parse trees • Maintaining the lexicon Traditional Solution(s) • Tf-idf • Parse trees (soreness -> elbow) • Domain-specific lexicon
  • 14. The Problem With Text Problem Traditional Solution Traditional Problem Linguistic Context • Stemming • Synonym sets • Lexicons • Brittle • Labor-intensive • Messy real-world data Local Context • Parse trees • N-grams • Phrase lexicon • Inaccurate parsing • Limited Context • Messy real-world data Out of Vocabulary Issues • Lemmatization • Expanded vocabulary • Ignore • Computationally expensive • Diminishing returns • Messy real-world data
  • 15. Problems with Small Data Add Linguistic Context (Semantics) Add Local Context Prevent Out of Vocabulary Issues
  • 17. What is an Embedding? Text Space (e.g. English) Embedding Space (e.g. R300) Embedding Method (e.g. Word2Vec) 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 …
  • 18. What is an Embedding? Text Space (e.g. English) Embedding Space (e.g. R300) 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Embedding Method (e.g. Word2Vec) Linguistic Context (e.g. Wikipedia)
  • 19. Pitfalls • Sufficient, Diverse Linguistic Context • Clean Test/Train Splits • The Curse of Dimensionality • Effective Benchmarking
  • 20. King Queen - man + woman (Royalty) How do Embeddings Work? • Meaning is “encoded” into the embedding space • Individual dimensions are not human interpretable • Embedding method learns by examining large corpora of generic language • Goal is accurate language representation as a proxy for downstream performance
  • 22. “Word” Embeddings Token Value “great” [0.1, 0.3, …] … … Examples In Practice • Word2vec • GloVe • fastText
  • 23. “Word” Embeddings Token Value “great” [0.1, 0.3, …] … … Examples In Practice Training The quick brown fox _____ over the lazy dog ___ ___ ____ ___ jumps ___ __ ___ ___ CBOW Skip Gram • Word2vec • GloVe • fastText
  • 24. Do They Really Preserve Algorithmic Value? • Embeddings generally outperform raw text at low data volumes • Leveraging large, generic text corpora improves generalizability • This is 4 year old tech. Embeddings have improved drastically. Text has not. Reported numbers are the average of 5 runs of randomly sampled test/train splits each reporting the average of a 5-fold cv, within which Logistic Regression hyperparameters are optimized. Generated using Enso 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 Accuracy Number of Data Points Glove Benchmark (Movie Review Sentiment Analysis) tf-idf Glove
  • 25. Problems with Small Data Add Linguistic Context (Semantics) Add Local Context Prevent Out of Vocabulary Issues
  • 27. Text Embeddings Examples In Practice Often built on top of pre-trained word embeddings • Doc2vec • Elmo • ULMFiT
  • 28. Text Embeddings Examples In Practice Training The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Language Supervised dog True Often built on top of pre-trained word embeddings • Doc2vec • Elmo • ULMFiT
  • 29. Text Embeddings CNN-Style The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Prediction https://arxiv.org/pdf/1408.5882.pdf Example
  • 30. Text Embeddings RNN-Style The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Output Memory 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … σ σ σ σ σ σ σ σ Prediction https://arxiv.org/pdf/1802.05365.pdf Example
  • 31. Add Linguistic Context (Semantics) Add Local Context Prevent Out of Vocabulary Issues Problems with Small Data
  • 32. The Power of Context We used a bytepair encoding (BPE) vocabulary… significantly improving upon the state of the art in 9 out of the 12 tasks studied - Improving Language Understanding by Generative Pre-Training* * https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language- unsupervised/language_understanding_paper.pdf
  • 33. Problems with Small Data Add Linguistic Context (Semantics) Add Local Context Prevent Out of Vocabulary Issues
  • 34. Do They Really Preserve Algorithmic Value? • Newer transfer learning techniques have made deep learning at low data volumes tractable • Even when operating on top of byte-pair encodings sufficient context is retained to achieve sota performance • 4x error reduction over tf-idf Reported numbers are the average of 5 runs of randomly sampled test/train splits each reporting the average of a 5-fold cv, within which Logistic Regression hyperparameters are optimized. Generated using Enso 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 Accuracy Number of Data Points Finetune Benchmark (Movie Review Sentiment Analysis) tf-idf Glove Finetune
  • 35. Takeaways • At low data volumes embeddings drastically improve accuracy via transfer learning • The transfer learning space moves very quickly. Adoption of Glove is very low, but already out of date • This is just basic framing. Practical use of embeddings is more complex. See our session at DSS to learn more
  • 36. Resources • Github library – Finetune (https://github.com/indicodatasolutions/finetune) • Github library – Enso (https://github.com/indicodatasolutions/enso) • Indico Machine learning newsletter (indico.io) • Deep Learning Book (https://www.deeplearningbook.org/)
  • 37. Questions? • slater@indico.io • Quora: https://www.quora.com/profile/Slater-Ryan-Victoroff
  • 38. The Real Problem With Text Select Features Optimize Hyper parameters Test/Train Split Train Model Evaluate Errors and View Test Error Feature Engineering? Standard Data Science?
  • 39. The Real Problem With Text Select Features Optimize Hyper parameters Test/Train Split Train Model Evaluate Errors and View Test Error Feature Engineering? Standard Data Science? Overfitting Test/Train Contamination
  • 40. The Real Problem With Text Select Features Optimize Hyper parameters Test/Train Split Train Model Evaluate Errors and View Test Error Feature Engineering? Standard Data Science? Overfitting Test/Train Contamination Manual feature engineering leads to inaccurate perceptions of performance