SlideShare a Scribd company logo
1 of 40
Download to read offline
BERT: Pre-training of Deep
Bidirectional Transformers for
Language Understanding
Devlin et al., 2018 (Google AI Language)
Presenter
Phạm Quang Nhật Minh
NLP Researcher
Alt Vietnam
al+ AI Seminar No. 7
2018/12/21
Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions
12/21/18 al+ AI Seminar No.7 2
Research context
• Language model pre-training has been used to
improve many NLP tasks
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018)
• ULMFit (Howard and Rudder, 2018)
• Two existing strategies for applying pre-trained
language representations to downstream tasks
• Feature-based: include pre-trained representations as
additional features (e.g., ELMo)
• Fine-tunning: introduce task-specific parameters and
fine-tune the pre-trained parameters (e.g., OpenAI GPT,
ULMFit)
12/21/18 al+ AI Seminar No.7 3
Limitations of current techniques
• Language models in pre-training are unidirectional,
they restrict the power of the pre-trained
representations
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward language
models
• Solution BERT: Bidirectional Encoder
Representations from Transformers
12/21/18 al+ AI Seminar No.7 4
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep
bidirectional Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve
state-of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks
12/21/18 al+ AI Seminar No.7 5
12/21/18 al+ AI Seminar No.7 6
BERT: Bidirectional Encoder
Representations from Transformers
Model architecture
• BERT’s model architecture is a multi-layer
bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M
12/21/18 al+ AI Seminar No.7 7
12/21/18 al+ AI Seminar No.7 8
Differences in pre-training model architectures: BERT,
OpenAI GPT, and ELMo
Transformer Encoders
• Transformer is an attention-based architecture for NLP
• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder
12/21/18 al+ AI Seminar No.7 9
Vaswani et al. (2017) Attention is all you need
Encoder Block
Encoder Block
Encoder Block
Input sequence
Inside an Encoder Block
12/21/18 al+ AI Seminar No.7 10
Source: https://medium.com/dissecting-
bert/dissecting-bert-part-1-d3c3d495cdb3
In BERT experiments, the number
of blocks N was chosen to be 12
and 24.
Blocks do not share weights with
each other
Transformer Encoders: Key Concepts
12/21/18 al+ AI Seminar No.7 11
Multi-head
self-attention
Self-
attention
Transformer
Encoders
Position
Encoding
Layer
NormalizationResidual
Connections
Position-wise
Feed Forward
Network
Self-Attention
12/21/18 al+ AI Seminar No.7 12
Image source: https://jalammar.github.io/illustrated-transformer/
Self-Attention in Detail
• Attention maps a query and a set of key-value pairs
to an output
• query, keys, and output are all vectors
12/21/18 al+ AI Seminar No.7 13
Input
Queries
Keys
Values
X1 X2
q1 q2
k1 k2
v1 v2
Use matrices WQ , WK and
WV to project input into
query, key and value vectors
Attention ', ), * = softmax
')1
23
* dk is the dimension of
key vectors
Multi-Head Attention
12/21/18 al+ AI Seminar No.7 14
X1
X2
...
Head #0 Head #1 Head #7
Concat
Linear Projection
Use a weight
matrix Wo
Position Encoding
• Position Encoding is used to make use of the order of the
sequence
• Since the model contains no recurrence and no convolution
• In Vawasni et al., 2017, authors used sine and cosine functions of
different frequencies
!"($%&,()) = sin
/01
10000
()
4!"#$%
!"($%&,()56) = cos
/01
10000
()
4!"#$%
• pos is the position and 9 is the dimension
12/21/18 al+ AI Seminar No.7 15
Input Representation
12/21/18 al+ AI Seminar No.7 16
• Token Embeddings: Use pretrained WordPiece embeddings
• Position Embeddings: Use learned Position Embeddings
• Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N.
Dauphin. Convolutionalsequence to sequence learning. arXiv preprint
arXiv:1705.03122v2, 2017.
• Added sentence embedding to every tokens of each sentence
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP]
Task#1: Masked LM
• 15% of the words are masked at random
• and the task is to predict the masked words based on its
left and right context
• Not all tokens were masked in the same way
(example sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
<MASK>”
• 10% were replaced by a random token: “My dog is
apple”
• 10% were left intact: “My dog is hairy”
12/21/18 al+ AI Seminar No.7 17
Task#2: Next Sentence Prediction
• Motivation
• Many downstream tasks are based on understanding the
relationship between two text sentences
• Question Answering (QA) and Natural Language Inference (NLI)
• Language modeling does not directly capture that
relationship
• The task is pre-training binarized next sentence
prediction task
12/21/18 al+ AI Seminar No.7 18
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon
[MASK] milk [SEP]
Label = isNext
Input = [CLS] the man [MASK] to the store [SEP]penguin [MASK] are
flight ##less birds [SEP]
Label = NotNext
Pre-training procedure
• Training data: BooksCorpus (800M words) + English
Wikipedia (2,500M words)
• To generate each training input sequences: sample
two spans of text (A and B) from the corpus
• The combined length is ≤ 500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked
LM likelihood and the mean next sentence
prediction likelihood
12/21/18 al+ AI Seminar No.7 19
Fine-tuning procedure
• For sequence-level classification task
• Obtain the representation of the input sequence by using the
final hidden state (hidden state at the position of the special
token [CLS]) ! ∈ #$
• Just add a classification layer and use softmax to calculate
label probabilities. Parameters W ∈ #&×$
( = softmax(!23
)
12/21/18 al+ AI Seminar No.7 20
Fine-tuning procedure
• For sequence-level classification task
• All of the parameters of BERT and W are fine-tuned
jointly
• Most model hyperparameters are the same as in
pre-training
• except the batch size, learning rate, and number of
training epochs
12/21/18 al+ AI Seminar No.7 21
Fine-tuning procedure
• Token tagging task (e.g., Named Entity Recognition)
• Feed the final hidden representation !" ∈ $% for each
token & into a classification layer for the tagset (NER
label set)
• To make the task compatible with WordPiece
tokenization
• Predict the tag for the first sub-token of a word
• No prediction is made for X
12/21/18 al+ AI Seminar No.7 22
Fine-tuning procedure
12/21/18 al+ AI Seminar No.7 23
Single Sentence Tagging Tasks: CoNLL-2003 NER
Fine-tuning procedure
• Span-level task: SQuAD v1.1
• Input Question:
Where do water droplets collide with ice
crystals to form precipitation?
• Input Paragraph:
.... Precipitation forms as smaller
dropletscoalesce via collision with other rain
dropsor ice crystals within a cloud. ...
• Output Answer:
within a cloud
12/21/18 al+ AI Seminar No.7 24
Fine-tuning procedure
• Span-level task: SQuAD v1.1
• Represent the input question and paragraph as a single
packed sequence
• The question uses the A embedding and the paragraph uses
the B embedding
• New parameters to be learned in fine-tuning are start
vector ! ∈ ℝ$ and end vector % ∈ ℝ$
• Calculate the probability of word & being the start of the
answer span
'( =
*+,-!
∑/ *+,-"
• The training objective is the log-likelihood the correct
and end positions
12/21/18 al+ AI Seminar No.7 25
Comparison of BERT and OpenAI GPT
OpenAI GPT BERT
Trained on BooksCorpus (800M) Trained on BooksCorpus (800M) +
Wikipedia (2,500M)
Use sentence separater ([SEP])
and classifier token ([CLS]) only at
fine-tuning time
BERT learns [SEP], [CLS] and
sentence A/B embeddings during
pre-training
Trained for 1M steps with a batch-
size of 32,000 words
Trained for 1M steps with a batch-
size of 128,000 words
Use the same learning rate of 5e-5
for all fine-tuning experiments
BERT choose a task-specific
learning rate which performs the
best on the development set
12/21/18 al+ AI Seminar No.7 26
Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions
12/21/18 al+ AI Seminar No.7 27
Experiments
• GLUE (General Language Understanding Evaluation)
benchmark
• Distribute canonical Train, Dev and Test splits
• Labels for Test set are not provided
• Datasets in GLUE:
• MNLI: Multi-Genre Natural Language Inference
• QQP: Quora Question Pairs
• QNLI: Question Natural Language Inference
• SST-2: Stanford Sentiment Treebank
• CoLA: The corpus of Linguistic Acceptability
• STS-B: The Semantic Textual Similarity Benchmark
• MRPC: Microsoft Research Paraphrase Corpus
• RTE: Recognizing Textual Entailment
• WNLI: Winograd NLI
12/21/18 al+ AI Seminar No.7 28
GLUE Results
12/21/18 al+ AI Seminar No.7 29
SQuAD v1.1
12/21/18 al+ AI Seminar No.7 30
Reference: https://rajpurkar.github.io/SQuAD-explorer
Named Entity Recognition
12/21/18 al+ AI Seminar No.7 31
SWAG
• The Situations with Adversarial Generations (SWAG)
• The only task-specific parameters is a vector ! ∈ ℝ$
• The probability distribution is the softmax over the four
choices
%& =
()*+,
∑./0
1
()*+,
12/21/18 al+ AI Seminar No.7 32
SWAG Result
12/21/18 al+ AI Seminar No.7 33
Ablation Studies
• To understand
• Effect of Pre-training Tasks
• Effect of model sizes
• Effect of number of training steps
• Feature-based approach with BERT
12/21/18 al+ AI Seminar No.7 34
Ablation Studies
• Main findings:
• “Next sentence prediction” (NSP) pre-training is important
for sentence-pair classification task
• Bidirectionality (using MLM pre-training task) contributes
significantly to the performance improvement
12/21/18 al+ AI Seminar No.7 35
Ablation Studies
• Main findings
• Bigger model sizes are better even for small-scale tasks
12/21/18 al+ AI Seminar No.7 36
Conclusions
• Unsupervised pre-training (pre-training language
model) is increasingly adopted in many NLP tasks
• Major contribution of the paper is to propose a
deep bidirectional architecture from Transformer
• Advance state-of-the-art for many important NLP tasks
12/21/18 al+ AI Seminar No.7 37
Links
• TensorFlow code and pre-trained models for BERT:
https://github.com/google-research/bert
• PyTorch Pretrained Bert:
https://github.com/huggingface/pytorch-
pretrained-BERT
• BERT-pytorch: https://github.com/codertimo/BERT-
pytorch
• BERT-keras: https://github.com/Separius/BERT-
keras
12/21/18 al+ AI Seminar No.7 38
Remark: Applying BERT for non-
English languages
• Pre-trained BERT models are provided for more than
100 languages (including Vietnamese)
• https://github.com/google-
research/bert/blob/master/multilingual.md
• Be careful with tokenization!!
• For Japanese (and Chinese): “spaces were added
around every character in the CJI Unicode rage before
applying WordPiece” => Not a good way to do
• Use SentencePiece:
https://github.com/google/sentencepiece
• We may need to pre-train BERT model
12/21/18 al+ AI Seminar No.7 39
References
1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.
(2018). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. arXiv
preprint arXiv:1810.04805.
2. Vaswani et al. (2017). Attention Is All You Need. arXiv
preprint arXiv:1706.03762.
https://arxiv.org/abs/1706.03762
3. The Annotated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.ht
ml, by harvardnlp.
4. Dissecting BERT: https://medium.com/dissecting-bert
12/21/18 al+ AI Seminar No.7 40

More Related Content

What's hot

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Po-Chuan Chen
 

What's hot (20)

BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMs
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representation
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 
NLBSE’22: Tool Competition
NLBSE’22: Tool CompetitionNLBSE’22: Tool Competition
NLBSE’22: Tool Competition
Sebastiano Panichella
 

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
 
Python for IoT
Python for IoTPython for IoT
Python for IoT
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptx
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminaries
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx
 
NLBSE’22: Tool Competition
NLBSE’22: Tool CompetitionNLBSE’22: Tool Competition
NLBSE’22: Tool Competition
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 

More from Minh Pham

More from Minh Pham (13)

Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPTPrompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
 
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
 
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
 
Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)
 
Giới thiệu về AIML
Giới thiệu về AIMLGiới thiệu về AIML
Giới thiệu về AIML
 
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiênMạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
 
Deep Contexualized Representation
Deep Contexualized RepresentationDeep Contexualized Representation
Deep Contexualized Representation
 
Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)
 
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
 
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
 
Research Methods in Natural Language Processing
Research Methods in Natural Language ProcessingResearch Methods in Natural Language Processing
Research Methods in Natural Language Processing
 
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbotCác bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Recently uploaded (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin et al., 2018 (Google AI Language) Presenter Phạm Quang Nhật Minh NLP Researcher Alt Vietnam al+ AI Seminar No. 7 2018/12/21
  • 2. Outline • Research context • Main ideas • BERT • Experiments • Conclusions 12/21/18 al+ AI Seminar No.7 2
  • 3. Research context • Language model pre-training has been used to improve many NLP tasks • ELMo (Peters et al., 2018) • OpenAI GPT (Radford et al., 2018) • ULMFit (Howard and Rudder, 2018) • Two existing strategies for applying pre-trained language representations to downstream tasks • Feature-based: include pre-trained representations as additional features (e.g., ELMo) • Fine-tunning: introduce task-specific parameters and fine-tune the pre-trained parameters (e.g., OpenAI GPT, ULMFit) 12/21/18 al+ AI Seminar No.7 3
  • 4. Limitations of current techniques • Language models in pre-training are unidirectional, they restrict the power of the pre-trained representations • OpenAI GPT used left-to-right architecture • ELMo concatenates forward and backward language models • Solution BERT: Bidirectional Encoder Representations from Transformers 12/21/18 al+ AI Seminar No.7 4
  • 5. BERT: Bidirectional Encoder Representations from Transformers • Main ideas • Propose a new pre-training objective so that a deep bidirectional Transformer can be trained • The “masked language model” (MLM): the objective is to predict the original word of a masked word based only on its context • ”Next sentence prediction” • Merits of BERT • Just fine-tune BERT model for specific tasks to achieve state-of-the-art performance • BERT advances the state-of-the-art for eleven NLP tasks 12/21/18 al+ AI Seminar No.7 5
  • 6. 12/21/18 al+ AI Seminar No.7 6 BERT: Bidirectional Encoder Representations from Transformers
  • 7. Model architecture • BERT’s model architecture is a multi-layer bidirectional Transformer encoder • (Vaswani et al., 2017) “Attention is all you need” • Two models with different sizes were investigated • BERTBASE: L=12, H=768, A=12, Total Parameters=110M • (L: number of layers (Transformer blocks), H is the hidden size, A: the number of self-attention heads) • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M 12/21/18 al+ AI Seminar No.7 7
  • 8. 12/21/18 al+ AI Seminar No.7 8 Differences in pre-training model architectures: BERT, OpenAI GPT, and ELMo
  • 9. Transformer Encoders • Transformer is an attention-based architecture for NLP • Transformer composed of two parts: Encoding component and Decoding component • BERT is a multi-layer bidirectional Transformer encoder 12/21/18 al+ AI Seminar No.7 9 Vaswani et al. (2017) Attention is all you need Encoder Block Encoder Block Encoder Block Input sequence
  • 10. Inside an Encoder Block 12/21/18 al+ AI Seminar No.7 10 Source: https://medium.com/dissecting- bert/dissecting-bert-part-1-d3c3d495cdb3 In BERT experiments, the number of blocks N was chosen to be 12 and 24. Blocks do not share weights with each other
  • 11. Transformer Encoders: Key Concepts 12/21/18 al+ AI Seminar No.7 11 Multi-head self-attention Self- attention Transformer Encoders Position Encoding Layer NormalizationResidual Connections Position-wise Feed Forward Network
  • 12. Self-Attention 12/21/18 al+ AI Seminar No.7 12 Image source: https://jalammar.github.io/illustrated-transformer/
  • 13. Self-Attention in Detail • Attention maps a query and a set of key-value pairs to an output • query, keys, and output are all vectors 12/21/18 al+ AI Seminar No.7 13 Input Queries Keys Values X1 X2 q1 q2 k1 k2 v1 v2 Use matrices WQ , WK and WV to project input into query, key and value vectors Attention ', ), * = softmax ')1 23 * dk is the dimension of key vectors
  • 14. Multi-Head Attention 12/21/18 al+ AI Seminar No.7 14 X1 X2 ... Head #0 Head #1 Head #7 Concat Linear Projection Use a weight matrix Wo
  • 15. Position Encoding • Position Encoding is used to make use of the order of the sequence • Since the model contains no recurrence and no convolution • In Vawasni et al., 2017, authors used sine and cosine functions of different frequencies !"($%&,()) = sin /01 10000 () 4!"#$% !"($%&,()56) = cos /01 10000 () 4!"#$% • pos is the position and 9 is the dimension 12/21/18 al+ AI Seminar No.7 15
  • 16. Input Representation 12/21/18 al+ AI Seminar No.7 16 • Token Embeddings: Use pretrained WordPiece embeddings • Position Embeddings: Use learned Position Embeddings • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutionalsequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017. • Added sentence embedding to every tokens of each sentence • Use [CLS] for the classification tasks • Separate sentences by using a special token [SEP]
  • 17. Task#1: Masked LM • 15% of the words are masked at random • and the task is to predict the masked words based on its left and right context • Not all tokens were masked in the same way (example sentence “My dog is hairy”) • 80% were replaced by the <MASK> token: “My dog is <MASK>” • 10% were replaced by a random token: “My dog is apple” • 10% were left intact: “My dog is hairy” 12/21/18 al+ AI Seminar No.7 17
  • 18. Task#2: Next Sentence Prediction • Motivation • Many downstream tasks are based on understanding the relationship between two text sentences • Question Answering (QA) and Natural Language Inference (NLI) • Language modeling does not directly capture that relationship • The task is pre-training binarized next sentence prediction task 12/21/18 al+ AI Seminar No.7 18 Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = isNext Input = [CLS] the man [MASK] to the store [SEP]penguin [MASK] are flight ##less birds [SEP] Label = NotNext
  • 19. Pre-training procedure • Training data: BooksCorpus (800M words) + English Wikipedia (2,500M words) • To generate each training input sequences: sample two spans of text (A and B) from the corpus • The combined length is ≤ 500 tokens • 50% B is the actual next sentence that follows A and 50% of the time it is a random sentence from the corpus • The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood 12/21/18 al+ AI Seminar No.7 19
  • 20. Fine-tuning procedure • For sequence-level classification task • Obtain the representation of the input sequence by using the final hidden state (hidden state at the position of the special token [CLS]) ! ∈ #$ • Just add a classification layer and use softmax to calculate label probabilities. Parameters W ∈ #&×$ ( = softmax(!23 ) 12/21/18 al+ AI Seminar No.7 20
  • 21. Fine-tuning procedure • For sequence-level classification task • All of the parameters of BERT and W are fine-tuned jointly • Most model hyperparameters are the same as in pre-training • except the batch size, learning rate, and number of training epochs 12/21/18 al+ AI Seminar No.7 21
  • 22. Fine-tuning procedure • Token tagging task (e.g., Named Entity Recognition) • Feed the final hidden representation !" ∈ $% for each token & into a classification layer for the tagset (NER label set) • To make the task compatible with WordPiece tokenization • Predict the tag for the first sub-token of a word • No prediction is made for X 12/21/18 al+ AI Seminar No.7 22
  • 23. Fine-tuning procedure 12/21/18 al+ AI Seminar No.7 23 Single Sentence Tagging Tasks: CoNLL-2003 NER
  • 24. Fine-tuning procedure • Span-level task: SQuAD v1.1 • Input Question: Where do water droplets collide with ice crystals to form precipitation? • Input Paragraph: .... Precipitation forms as smaller dropletscoalesce via collision with other rain dropsor ice crystals within a cloud. ... • Output Answer: within a cloud 12/21/18 al+ AI Seminar No.7 24
  • 25. Fine-tuning procedure • Span-level task: SQuAD v1.1 • Represent the input question and paragraph as a single packed sequence • The question uses the A embedding and the paragraph uses the B embedding • New parameters to be learned in fine-tuning are start vector ! ∈ ℝ$ and end vector % ∈ ℝ$ • Calculate the probability of word & being the start of the answer span '( = *+,-! ∑/ *+,-" • The training objective is the log-likelihood the correct and end positions 12/21/18 al+ AI Seminar No.7 25
  • 26. Comparison of BERT and OpenAI GPT OpenAI GPT BERT Trained on BooksCorpus (800M) Trained on BooksCorpus (800M) + Wikipedia (2,500M) Use sentence separater ([SEP]) and classifier token ([CLS]) only at fine-tuning time BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training Trained for 1M steps with a batch- size of 32,000 words Trained for 1M steps with a batch- size of 128,000 words Use the same learning rate of 5e-5 for all fine-tuning experiments BERT choose a task-specific learning rate which performs the best on the development set 12/21/18 al+ AI Seminar No.7 26
  • 27. Outline • Research context • Main ideas • BERT • Experiments • Conclusions 12/21/18 al+ AI Seminar No.7 27
  • 28. Experiments • GLUE (General Language Understanding Evaluation) benchmark • Distribute canonical Train, Dev and Test splits • Labels for Test set are not provided • Datasets in GLUE: • MNLI: Multi-Genre Natural Language Inference • QQP: Quora Question Pairs • QNLI: Question Natural Language Inference • SST-2: Stanford Sentiment Treebank • CoLA: The corpus of Linguistic Acceptability • STS-B: The Semantic Textual Similarity Benchmark • MRPC: Microsoft Research Paraphrase Corpus • RTE: Recognizing Textual Entailment • WNLI: Winograd NLI 12/21/18 al+ AI Seminar No.7 28
  • 29. GLUE Results 12/21/18 al+ AI Seminar No.7 29
  • 30. SQuAD v1.1 12/21/18 al+ AI Seminar No.7 30 Reference: https://rajpurkar.github.io/SQuAD-explorer
  • 31. Named Entity Recognition 12/21/18 al+ AI Seminar No.7 31
  • 32. SWAG • The Situations with Adversarial Generations (SWAG) • The only task-specific parameters is a vector ! ∈ ℝ$ • The probability distribution is the softmax over the four choices %& = ()*+, ∑./0 1 ()*+, 12/21/18 al+ AI Seminar No.7 32
  • 33. SWAG Result 12/21/18 al+ AI Seminar No.7 33
  • 34. Ablation Studies • To understand • Effect of Pre-training Tasks • Effect of model sizes • Effect of number of training steps • Feature-based approach with BERT 12/21/18 al+ AI Seminar No.7 34
  • 35. Ablation Studies • Main findings: • “Next sentence prediction” (NSP) pre-training is important for sentence-pair classification task • Bidirectionality (using MLM pre-training task) contributes significantly to the performance improvement 12/21/18 al+ AI Seminar No.7 35
  • 36. Ablation Studies • Main findings • Bigger model sizes are better even for small-scale tasks 12/21/18 al+ AI Seminar No.7 36
  • 37. Conclusions • Unsupervised pre-training (pre-training language model) is increasingly adopted in many NLP tasks • Major contribution of the paper is to propose a deep bidirectional architecture from Transformer • Advance state-of-the-art for many important NLP tasks 12/21/18 al+ AI Seminar No.7 37
  • 38. Links • TensorFlow code and pre-trained models for BERT: https://github.com/google-research/bert • PyTorch Pretrained Bert: https://github.com/huggingface/pytorch- pretrained-BERT • BERT-pytorch: https://github.com/codertimo/BERT- pytorch • BERT-keras: https://github.com/Separius/BERT- keras 12/21/18 al+ AI Seminar No.7 38
  • 39. Remark: Applying BERT for non- English languages • Pre-trained BERT models are provided for more than 100 languages (including Vietnamese) • https://github.com/google- research/bert/blob/master/multilingual.md • Be careful with tokenization!! • For Japanese (and Chinese): “spaces were added around every character in the CJI Unicode rage before applying WordPiece” => Not a good way to do • Use SentencePiece: https://github.com/google/sentencepiece • We may need to pre-train BERT model 12/21/18 al+ AI Seminar No.7 39
  • 40. References 1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. 2. Vaswani et al. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762 3. The Annotated Transformer: http://nlp.seas.harvard.edu/2018/04/03/attention.ht ml, by harvardnlp. 4. Dissecting BERT: https://medium.com/dissecting-bert 12/21/18 al+ AI Seminar No.7 40