Efficient Deep Learning in Natural Language Processing Production, with Moshe Wasserblat, Intel AI

Moshe Wasserblat
Intel AI Lab
NLP MeetUp, Aug. 2020

BIO
2
● NICE Systems
● Led Speech & Text Analytics research group
● First company to productize Speech2Text, ED, Voice Biometric in Call-Center
● INTEL
● Innovate for our products
● Collaborate with top academic
● Explore compute features that disrupt our HW

AGENDA
3
● Efficiency
● Large model intro
● Inference efficiency: models with lower comp. complexity
● Examples
● SustiaNLP Workshop in EMNLP Nov. 2020
● Data challenges
● Extensibility: address new domain with limited data and minimal supervision
● Weakly-supervised ABSA example

The advantages of BERT
1. Efficient transfer learning
Leverage a large model that was pre-trained for a generic task using
a large amount of data for specific task using small amount of data.
high accuracy with smaller amount of data
2. Context embeddings.
Produces vectors that represent each word in a context of a
sentence. E.g. bank in “river bank” vs. “investment bank”
5
Task Specific
Classifier
Context embeddings
Input sentence
Task output
12/24 stacked layers of transformer encoder
(110/330M parameters)

6
Pre-trained LMs have become extremely
large and deep
Pre-trained LMs have become extremely
large and deep
T5
11b
2.5
5
7.5
10
12.5
15
#par
b
Source: HuggingFace

7
• Heavy computation
• Large memory footprint
• Hard to train/fine-tune
• Hard to deploy
How should we put these monsters in production?

0
20
40
60
80
100
120
8
BERT
aLBERT
Year 2020: from accuracy to efficiency
MobileBERT
DistilBERT
TinyBERT
#par
M

Vectors for optimization
9
•Quantization of weights to int8 or other lower precision representation
•Pruning of weights and structural (complete layers, self-attention heads)
•Early prediction of samples by using predictors attached to shallow layers
•Sharing weights of self-attention and FFs modules across all model
blocks
•Training smaller models using Distillation and other novel techniques
•Replacing Transformer modules and searching for best architecture using
Neural Architecture Search

Quantization
10
•Quantization of BERT models to 16/8-bit weights
4x compression, minimal loss in accuracy
We Scaled Bert To Serve
1+ Billion Daily Requests
on CPUs

Pruning
11
It is possible, for some tasks, to prune up to 9 of the
top layers from a 12 layer model without degrading
the performance more than 3%.
Poor Man's BERT: Smaller and Faster Transformer Models

Distillation
12
teacher
Small BERT
student
Loss
TinyBERT
MobileBERT
DistilBERT
hard labels
probability/logit
embeddings
attentions

Naïve approach (Thieves on Sesame street, Krishna et al. ICLR20)
13
FF
Classifier
for fine
tuning
“Mulan is highly
recommended”
“The movie was
good as the book”
teacher
student
pseudo labels
annotated
labels
Unlabeled
examples
Labeled
examples
Task
Loss
Sent: POS
Sent: POS

*Distillation- mimic the output teacher probability
14
FF
Classifier
for fine
tuning
teacher
Unlabeled
examples
Distill
Loss
**mse
• Surprisingly work well
• Great for low resource tasks
Total
Loss
Task
Loss
student
*Hinton et al.**Tang et al.

BERT
2 BERT
15
Note: performance is
cited from the original
paper

Can we do more?
16
LSTM/CNN
>100x
Or CBOW
>1000x

19
Real use-case example
• Named Entity Recognition (NER) is a widely used Information Extraction task in
many industrial applications and use cases
• Ramping up on a new domain can be difficult
§ Lots of unlabeled data, little of no labeled data and often not good enough for
training a model with good performance
Solution A
? Hire a linguist or data scientist to tune/build model
? Hire annotators to label more data or buy similar dataset
? Time/compute resource limitations
Solution B
? Pre-trained Language Models such as BERT, GPT, ELMo are great at low-
resource scenarios
? Require great compute and memory resources and suffer from high latency in
inference
? Deploying such models in production or on edge devices is a major issue This Photo by Unknown Author is licensed under CC BY

20
65
70
75
80
85
90
95
150 300 750 3000
Accuracy
#samples
Name Entity Recognition (CoNLL-2003)
BERT Distil LSTM Distil ID-CNN
Compression Rate x1 x36 x36
•Train a small LSTM/CNN
model using BERT
•Utilizing unlabeled data
via Teacher
•Student competitive
with Teacher
Peter et al. NeurIPS19

21
78
80
82
84
86
88
90
92
94
Agnews 0.4K
samples
Dair's Emotions
16K samples
IMDB 1K samples STS-2 7K samples
Accuracy Text Classification
BERT Distill LSTM Distill CBOW
Compression Rate x1 x100 x1500
•Train a small CBOW
model using BERT
•Utilizing unlabeled data
via Teacher
•Student competitive
with Teacher in specific
dataset
Wasserblat, more details coming soon

22
Takeaways
• Compact models perform equally well as pre-trained LM in low-resource
scenarios, and with superior inference speed and with high compression rate
• Practical Tips:
• Set simpler classifier as baseline
• Finetune DistillBERT/BERT on your task
• High resource for labeled data:
Go with DistillBERT or other compact pre-trained models
• Low resources for labeled data:
Distill BERT to simpler NN and compare to BERT

23
•Data and training efficiency: models requiring less training data and/or less computational
resources and/or time;
•Inference efficiency: models with lower comp. complexity of prediction/inference
https://sites.google.com/view/sustainlp2020

AGENDA
24
● Efficiency
● Large model intro
● Inference efficiency: models with lower comp. complexity
● Examples
● SustiaNLP Workshop 2020
● Data challenges
● Extensibility: address new domain with limited data and minimal supervision
● Weakly-supervised ABSA example

The NLP today
25
● Create a model to individual task and domain
● Need a large team of domain experts, large amount of labeled data
and very time consuming
● Hard to scale and adapt solutions across different domains
● No adaptation to business environment

26
ABSAexampleandusage
the owner is super friendly and service is fastthe owner is super friendly and service is fastfriendly fast
ASP ASPopinion opinion

TheAdvantagesofthealgo.Advantage
Aspect Based SA Producing knowledge regarding specific aspects hence enables
to gain targeted business insight.
Unsupervised -
Domain Adaptive
Unsupervised method - does not require costly manually
tagged data for training
Explainable AI Displaying the relation between opinion terms and aspects
enables the interpretability of the results
• ABSA recommended amongst Top 10 ML Code Examples on Azure
and Included by MSFT in their NLP Recipes
• Published in EMNLP19
• ABSA used by University of British Columbia and the British Columbia CDC to
analyze COVID-19 related tweets in North America. See Jang et al, 2020.

Efficient Deep Learning in Natural Language Processing Production, with Moshe Wasserblat, Intel AI

Efficient Deep Learning in Natural Language Processing Production, with Moshe Wasserblat, Intel AI

Recommended

Recommended

More Related Content

Similar to Efficient Deep Learning in Natural Language Processing Production, with Moshe Wasserblat, Intel AI

Similar to Efficient Deep Learning in Natural Language Processing Production, with Moshe Wasserblat, Intel AI (20)

More from Seth Grimes

More from Seth Grimes (20)

Recently uploaded

Recently uploaded (20)

Efficient Deep Learning in Natural Language Processing Production, with Moshe Wasserblat, Intel AI