Despite all the advancements using AI and machine learning to create value around structured data, enterprises are not seeing the same benefits and ROI with unstructured content - all the text, images, documents, contracts, and customer interactions that make up more than 80% of data in most organizations. Traditional keyword-based approaches - including taxonomies, classifiers, expert systems, and pre-trained dictionary based systems - are simply too complex, too inflexible, and too expensive to maintain. It’s time for a new approach.
In this webinar, Indico’s Founder & CTO Slater Victoroff discusses modern transfer learning techniques for NLP to help you avoid common pitfalls when working in low-data environments.
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
Small Data for Big Problems: Practical Transfer Learning for NLP
1. Small Data for Big
Problems
Practical Transfer
Learning for NLP
2. Who Am I?
• Founder, CTO Indico
• Research done at Olin College of
Engineering
• Indico focuses Intelligent Process
Automation for Unstructured
Content
• Leverages Indico innovation in
Transfer Learning for text and image
content
3. Agenda
• Overview of Traditional Approaches to
Feature Engineering in NLP
• Introduction to Transfer Learning and
Text Embeddings
• Word Embeddings vs. Text
Embeddings
• Takeaways and Resources
4. Assumed
Knowledge
• Traditional NLP Basics (e.g. tf-idf
vectors)
• Traditional Data Science Basics (e.g.
Logistic Regression)
• Generic Math Background (e.g. Vector
Spaces)
5. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
Name
6. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
Name
Traditional Solution(s)
• Tf-idf
• Soundex/NYSIIS encoding
• Ignore – low algorithmic value
7. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
Name
Issues(s)
• Out of Vocabulary
Traditional Solution(s)
• Tf-idf
• Soundex/NYSIIS encoding
• Ignore – low algorithmic value
8. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Gender
• Location
• Age
9. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Gender
• Location
• Age
Traditional Solution(s)
• Tf-idf
• Hand-coded features (i.e.
gender)
• Location dictionary
10. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Gender
• Location
• Age
Issues(s)
• Local Context: His birthday vs
his daughter’s birthday
• Brittle gender detection
• Location detection
Traditional Solution(s)
• Tf-idf
• Hand-coded features (i.e.
gender)
• Location dictionary
11. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel
12. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel
Traditional Solution(s)
• Tf-idf
• Parse trees (soreness ->
elbow)
• Domain-specific lexicon
13. The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel
Issues(s)
• Linguistic Context (Semantics)
• Error-prone parse trees
• Maintaining the lexicon
Traditional Solution(s)
• Tf-idf
• Parse trees (soreness ->
elbow)
• Domain-specific lexicon
14. The Problem With Text
Problem Traditional Solution Traditional Problem
Linguistic Context • Stemming
• Synonym sets
• Lexicons
• Brittle
• Labor-intensive
• Messy real-world data
Local Context • Parse trees
• N-grams
• Phrase lexicon
• Inaccurate parsing
• Limited Context
• Messy real-world data
Out of Vocabulary Issues • Lemmatization
• Expanded vocabulary
• Ignore
• Computationally expensive
• Diminishing returns
• Messy real-world data
17. What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
Embedding Method
(e.g. Word2Vec)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
18. What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Embedding Method
(e.g. Word2Vec)
Linguistic Context
(e.g. Wikipedia)
19. Pitfalls
• Sufficient, Diverse Linguistic Context
• Clean Test/Train Splits
• The Curse of Dimensionality
• Effective Benchmarking
20. King
Queen
- man
+ woman
(Royalty)
How do Embeddings Work?
• Meaning is “encoded” into the
embedding space
• Individual dimensions are not
human interpretable
• Embedding method learns by
examining large corpora of
generic language
• Goal is accurate language
representation as a proxy for
downstream performance
23. “Word” Embeddings
Token Value
“great” [0.1, 0.3, …]
… …
Examples In Practice
Training
The quick brown fox _____ over the lazy dog
___ ___ ____ ___ jumps ___ __ ___ ___
CBOW
Skip Gram
• Word2vec
• GloVe
• fastText
24. Do They Really Preserve Algorithmic Value?
• Embeddings generally
outperform raw text at low data
volumes
• Leveraging large, generic text
corpora improves
generalizability
• This is 4 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Glove Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
31. Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Problems with
Small Data
32. The Power of Context
We used a bytepair encoding (BPE) vocabulary…
significantly improving upon the state of the art in 9 out of
the 12 tasks studied
- Improving Language Understanding by Generative Pre-Training*
* https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf
34. Do They Really Preserve Algorithmic Value?
• Newer transfer learning
techniques have made deep
learning at low data volumes
tractable
• Even when operating on top of
byte-pair encodings sufficient
context is retained to achieve
sota performance
• 4x error reduction over tf-idf
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Finetune Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
Finetune
35. Takeaways
• At low data volumes embeddings
drastically improve accuracy via
transfer learning
• The transfer learning space moves
very quickly. Adoption of Glove is very
low, but already out of date
• This is just basic framing. Practical use
of embeddings is more complex. See
our session at DSS to learn more
38. The Real Problem With Text
Select
Features
Optimize
Hyper
parameters
Test/Train
Split
Train
Model
Evaluate
Errors and
View Test
Error
Feature Engineering?
Standard Data Science?
39. The Real Problem With Text
Select
Features
Optimize
Hyper
parameters
Test/Train
Split
Train
Model
Evaluate
Errors and
View Test
Error
Feature Engineering?
Standard Data Science?
Overfitting
Test/Train Contamination
40. The Real Problem With Text
Select
Features
Optimize
Hyper
parameters
Test/Train
Split
Train
Model
Evaluate
Errors and
View Test
Error
Feature Engineering?
Standard Data Science?
Overfitting
Test/Train Contamination
Manual feature engineering
leads to inaccurate
perceptions of performance