Moshe Wasserblat, Intel AI, presents on Efficient Deep Learning in Natural Language Processing Production to an online NLP meetup audience, August 3, 2020. Visit https://www.meetup.com/NY-NLP for the New York NLP meetup.
2. BIO
2
● NICE Systems
● Led Speech & Text Analytics research group
● First company to productize Speech2Text, ED, Voice Biometric in Call-Center
● INTEL
● Innovate for our products
● Collaborate with top academic
● Explore compute features that disrupt our HW
3. AGENDA
3
● Efficiency
● Large model intro
● Inference efficiency: models with lower comp. complexity
● Examples
● SustiaNLP Workshop in EMNLP Nov. 2020
● Data challenges
● Extensibility: address new domain with limited data and minimal supervision
● Weakly-supervised ABSA example
5. The advantages of BERT
1. Efficient transfer learning
Leverage a large model that was pre-trained for a generic task using
a large amount of data for specific task using small amount of data.
high accuracy with smaller amount of data
2. Context embeddings.
Produces vectors that represent each word in a context of a
sentence. E.g. bank in “river bank” vs. “investment bank”
5
Task Specific
Classifier
Context embeddings
Input sentence
Task output
12/24 stacked layers of transformer encoder
(110/330M parameters)
6. 6
Pre-trained LMs have become extremely
large and deep
Pre-trained LMs have become extremely
large and deep
T5
11b
2.5
5
7.5
10
12.5
15
#par
b
Source: HuggingFace
7. 7
• Heavy computation
• Large memory footprint
• Hard to train/fine-tune
• Hard to deploy
How should we put these monsters in production?
9. Vectors for optimization
9
•Quantization of weights to int8 or other lower precision representation
•Pruning of weights and structural (complete layers, self-attention heads)
•Early prediction of samples by using predictors attached to shallow layers
•Sharing weights of self-attention and FFs modules across all model
blocks
•Training smaller models using Distillation and other novel techniques
•Replacing Transformer modules and searching for best architecture using
Neural Architecture Search
10. Quantization
10
•Quantization of BERT models to 16/8-bit weights
4x compression, minimal loss in accuracy
We Scaled Bert To Serve
1+ Billion Daily Requests
on CPUs
11. Pruning
11
It is possible, for some tasks, to prune up to 9 of the
top layers from a 12 layer model without degrading
the performance more than 3%.
Poor Man's BERT: Smaller and Faster Transformer Models
13. Naïve approach (Thieves on Sesame street, Krishna et al. ICLR20)
13
FF
Classifier
for fine
tuning
“Mulan is highly
recommended”
“The movie was
good as the book”
teacher
student
pseudo labels
annotated
labels
Unlabeled
examples
Labeled
examples
Task
Loss
Sent: POS
Sent: POS
14. *Distillation- mimic the output teacher probability
14
FF
Classifier
for fine
tuning
teacher
Unlabeled
examples
Distill
Loss
**mse
• Surprisingly work well
• Great for low resource tasks
Total
Loss
Task
Loss
student
*Hinton et al.**Tang et al.
16. Can we do more?
16
LSTM/CNN
>100x
Or CBOW
>1000x
17. 19
Real use-case example
• Named Entity Recognition (NER) is a widely used Information Extraction task in
many industrial applications and use cases
• Ramping up on a new domain can be difficult
§ Lots of unlabeled data, little of no labeled data and often not good enough for
training a model with good performance
Solution A
? Hire a linguist or data scientist to tune/build model
? Hire annotators to label more data or buy similar dataset
? Time/compute resource limitations
Solution B
? Pre-trained Language Models such as BERT, GPT, ELMo are great at low-
resource scenarios
? Require great compute and memory resources and suffer from high latency in
inference
? Deploying such models in production or on edge devices is a major issue This Photo by Unknown Author is licensed under CC BY
18. 20
65
70
75
80
85
90
95
150 300 750 3000
Accuracy
#samples
Name Entity Recognition (CoNLL-2003)
BERT Distil LSTM Distil ID-CNN
Compression Rate x1 x36 x36
•Train a small LSTM/CNN
model using BERT
•Utilizing unlabeled data
via Teacher
•Student competitive
with Teacher
Peter et al. NeurIPS19
19. 21
78
80
82
84
86
88
90
92
94
Agnews 0.4K
samples
Dair's Emotions
16K samples
IMDB 1K samples STS-2 7K samples
Accuracy Text Classification
BERT Distill LSTM Distill CBOW
Compression Rate x1 x100 x1500
•Train a small CBOW
model using BERT
•Utilizing unlabeled data
via Teacher
•Student competitive
with Teacher in specific
dataset
Wasserblat, more details coming soon
20. 22
Takeaways
• Compact models perform equally well as pre-trained LM in low-resource
scenarios, and with superior inference speed and with high compression rate
• Practical Tips:
• Set simpler classifier as baseline
• Finetune DistillBERT/BERT on your task
• High resource for labeled data:
Go with DistillBERT or other compact pre-trained models
• Low resources for labeled data:
Distill BERT to simpler NN and compare to BERT
21. 23
•Data and training efficiency: models requiring less training data and/or less computational
resources and/or time;
•Inference efficiency: models with lower comp. complexity of prediction/inference
https://sites.google.com/view/sustainlp2020
22. AGENDA
24
● Efficiency
● Large model intro
● Inference efficiency: models with lower comp. complexity
● Examples
● SustiaNLP Workshop 2020
● Data challenges
● Extensibility: address new domain with limited data and minimal supervision
● Weakly-supervised ABSA example
23. The NLP today
25
● Create a model to individual task and domain
● Need a large team of domain experts, large amount of labeled data
and very time consuming
● Hard to scale and adapt solutions across different domains
● No adaptation to business environment
24. 26
ABSAexampleandusage
the owner is super friendly and service is fastthe owner is super friendly and service is fastfriendly fast
ASP ASPopinion opinion
25. TheAdvantagesofthealgo.Advantage
Aspect Based SA Producing knowledge regarding specific aspects hence enables
to gain targeted business insight.
Unsupervised -
Domain Adaptive
Unsupervised method - does not require costly manually
tagged data for training
Explainable AI Displaying the relation between opinion terms and aspects
enables the interpretability of the results
• ABSA recommended amongst Top 10 ML Code Examples on Azure
and Included by MSFT in their NLP Recipes
• Published in EMNLP19
• ABSA used by University of British Columbia and the British Columbia CDC to
analyze COVID-19 related tweets in North America. See Jang et al, 2020.