BERT: Bidirectional Encoder Representations from Transformers

•

4 likes•1,986 views

BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.

Science

1
BERT: Bidirectional Encoder
Representations from Transformers
Liangqun Lu
MS in CS and PhD in Biology
2019 - 02 - 25
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT:
Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv [cs.CL].
arXiv. http://arxiv.org/abs/1810.04805.

Related Previous Work
● Attention: Neural Machine Translation by Jointly Learning to Align and
Translate (Bahdanau et al. 2014)
● Transformer: Attention is All you Need (Vaswani et al. 2017)
● ELMo: Deep Contextualized Word Representations (Peters et al. 2018)
● GPT: Improving language understanding by generative pre-training (Radford
et al. 2018)
2
Seq2seq NMT Attention Transformer
Bert
Glove ELMo GPTWord2Vec

Sequence to sequence neural network
● Many NLP tasks can be phrased as sequence-to-sequence:
○ Language translation (input → output)
○ Summarization (long text → short text)
○ Dialogue (previous utterances → next utterance)
○ Parsing (input text → output parse as sequence)
○ Code generation (natural language → Python code)
3
Encoder DecoderInput Output

NMT: Neural machine translation
4
● 2 RNN models are involved: Encoder and Decoder

Pros and cons of NMT
● Pros:
○ Better performance than previous statistical-based machine translation
○ Requires much less human engineering effort
○ A single neural network to be optimized end-to-end
● Cons:
○ less interpretable
○ difficult to control (can’t easily specify rules or guidelines for translation)
○ Information bottleneck
6

8
Attention provides a solution to the
bottleneck problem: each step of the
decoder, focus on a particular part of the
source sequence

Attention is great !
● Attention significantly improves NMT performance
● Attention helps with vanishing gradient problem
● Attention provides some interpretability
○ By inspecting attention distribution, we can
see the alignment between words which
shows that the neural network learns the
alignment
14
Attention is a way to focus on particular parts of the
input; Improves sequence-to-sequence a lot

Attention is a general Deep Learning technique
● More general definition of attention:
● Given a set of vector values, and a vector query, attention is a
technique to compute a weighted sum of the values, dependent on the
query.
● For example, in the seq2seq + attention model, each decoder hidden state
attends to the encoder hidden states.
15

● Intuition:
● The weighted sum is a selective summary of the information
contained in the values, where the query determines which values to
focus on.
● Attention is a way to obtain a fixed-size representation of an arbitrary
set of representations (the values), dependent on some other
representation (the query).
16

Transformer Overview
● Sequence-to-sequence Encoder to
Decoder
● Task: machine translation with parallel
corpus
● Predict each translated word
● Final cost/error function is standard
cross-entropy error on top of a softmax
classifier
17

Bert outline
● Contextual word representations
● Masked language model
● Next sentence prediction
● Model architecture
● Experiments
a. Sentence Pair Classification [MNLI]
b. Single Sentence Classification [SST-2]
c. Question Answering [SQuAD]
d. Single Sentence Tagging [CoNLL-NER]
24

SQuAD -- Stanford Question Answering Dataset
41

Conclusion
● BERT is strong pre-trained language model that uses bidirectional
transformer
● BERT can be fine-tuned to achieve good performance in many NLP tasks
● The source code is available at github
44

References
● Stanford CS224n: Natural Language Processing with Deep Learning
● Stanford CS231n: Convolutional Neural Networks for Visual Recognition
● http://people.ee.duke.edu/~lcarin/Kevin8.3.2018.pdf
● https://zhuanlan.zhihu.com/p/52282552
● https://zhuanlan.zhihu.com/p/46178084
● https://zhuanlan.zhihu.com/p/39034683
46

What's hot

BERT Finetuning Webinar Presentationbhavesh_physics

BERTKhang Pham

BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong

BERTMohd Shukri Hasan

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia

Natural language processing and transformer modelsDing Li

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim

Gpt1 and 2 model reviewSeoung-Ho Choi

Transformers AI PPT.pptxRahulKumar854607

Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters

[Paper Reading] Attention is All You NeedDaiki Tanaka

GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim

Introduction to Transformer ModelNuwan Sriyantha Bandara

Word embedding ShivaniChoudhary74

Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen

Transformers in 2021Grigory Sapunov

Pre trained language modelJiWenKim

Deep learning for NLP and TransformerArvind Devaraj

Transformers and BERT with SageMakerSuman Debnath

What's hot (20)

BERT Finetuning Webinar Presentation

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)

Natural language processing and transformer models

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Gpt1 and 2 model review

Transformers AI PPT.pptx

Deep Learning for Natural Language Processing: Word Embeddings

[Paper Reading] Attention is All You Need

GPT-2: Language Models are Unsupervised Multitask Learners

Introduction to Transformer Model

Word embedding

Neural Architectures for Named Entity Recognition

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...

Transformers in 2021

Pre trained language model

Deep learning for NLP and Transformer

Transformers and BERT with SageMaker

Similar to BERT: Bidirectional Encoder Representations from Transformers

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Universitat Politècnica de Catalunya

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe

A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal

Notes on attention mechanismKhang Pham

Natural Language Processing - Research and Application TrendsShreyas Suresh Rao

Nlp and transformer (v3s)H K Yoon

[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...IJET - International Journal of Engineering and Techniques

NLP using transformers Arvind Devaraj

Arabic named entity recognition using deep learning approachIJECEIAES

TensorflowKnoldus Inc.

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig

EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...csandit

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya

Fast and Accurate Preordering for SMT using Neural NetworksSDL

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc

Similar to BERT: Bidirectional Encoder Representations from Transformers (20)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...

A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH

Notes on attention mechanism

Natural Language Processing - Research and Application Trends

Nlp and transformer (v3s)

[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...

NLP using transformers

Arabic named entity recognition using deep learning approach

Tensorflow

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)

Fast and Accurate Preordering for SMT using Neural Networks

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...

Recently uploaded

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B

User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems

Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1

Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1

The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar

Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa

Four Spheres of the Earth Presentation.pptJoemSTuliba

Volatile Oils Pharmacognosy And Phytochemistry -INandakishor Bhaurao Deshmukh

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54

Topic 9- General Principles of International Law.pptxJorenAcuavera1

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh

Neurodevelopmental disorders according to the dsm 5 trssuser06f238

Radiation physics in Dental Radiology...navyadasi1992

Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051

ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju

Recently uploaded (20)

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx

User Guide: Capricorn FLX™ Weather Station

Carbon Dioxide Capture and Storage (CSS)

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx

Vision and reflection on Mining Software Repositories research in 2024

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.

The dark energy paradox leads to a new structure of spacetime.pptx

Bioteknologi kelas 10 kumer smapsa .pptx

Four Spheres of the Earth Presentation.ppt

Volatile Oils Pharmacognosy And Phytochemistry -I

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)

Topic 9- General Principles of International Law.pptx

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝

Neurodevelopmental disorders according to the dsm 5 tr

Radiation physics in Dental Radiology...

Forensic limnology of diatoms by Sanjai.pptx

ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf

BERT: Bidirectional Encoder Representations from Transformers

1. 1 BERT: Bidirectional Encoder Representations from Transformers Liangqun Lu MS in CS and PhD in Biology 2019 - 02 - 25 Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1810.04805.

2. Related Previous Work ● Attention: Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2014) ● Transformer: Attention is All you Need (Vaswani et al. 2017) ● ELMo: Deep Contextualized Word Representations (Peters et al. 2018) ● GPT: Improving language understanding by generative pre-training (Radford et al. 2018) 2 Seq2seq NMT Attention Transformer Bert Glove ELMo GPTWord2Vec

3. Sequence to sequence neural network ● Many NLP tasks can be phrased as sequence-to-sequence: ○ Language translation (input → output) ○ Summarization (long text → short text) ○ Dialogue (previous utterances → next utterance) ○ Parsing (input text → output parse as sequence) ○ Code generation (natural language → Python code) 3 Encoder DecoderInput Output

4. NMT: Neural machine translation 4 ● 2 RNN models are involved: Encoder and Decoder

5. NMT training 5

6. Pros and cons of NMT ● Pros: ○ Better performance than previous statistical-based machine translation ○ Requires much less human engineering effort ○ A single neural network to be optimized end-to-end ● Cons: ○ less interpretable ○ difficult to control (can’t easily specify rules or guidelines for translation) ○ Information bottleneck 6

7. 7

8. 8 Attention provides a solution to the bottleneck problem: each step of the decoder, focus on a particular part of the source sequence

9. 9

10. 10

11. 11

12. 12

13. 13

14. Attention is great ! ● Attention significantly improves NMT performance ● Attention helps with vanishing gradient problem ● Attention provides some interpretability ○ By inspecting attention distribution, we can see the alignment between words which shows that the neural network learns the alignment 14 Attention is a way to focus on particular parts of the input; Improves sequence-to-sequence a lot

15. Attention is a general Deep Learning technique ● More general definition of attention: ● Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query. ● For example, in the seq2seq + attention model, each decoder hidden state attends to the encoder hidden states. 15

16. ● Intuition: ● The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on. ● Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query). 16

17. Transformer Overview ● Sequence-to-sequence Encoder to Decoder ● Task: machine translation with parallel corpus ● Predict each translated word ● Final cost/error function is standard cross-entropy error on top of a softmax classifier 17

18. Scaled Dot-Production Attention 18

19. 19

20. 20

21. 21

22. 22

23. 23

24. Bert outline ● Contextual word representations ● Masked language model ● Next sentence prediction ● Model architecture ● Experiments a. Sentence Pair Classification [MNLI] b. Single Sentence Classification [SST-2] c. Question Answering [SQuAD] d. Single Sentence Tagging [CoNLL-NER] 24

25.

26.

27.

28.

29.

30.

31.

32.

33.

34. 34

35. 35

36.

37.

38.

39.

40.

41. SQuAD -- Stanford Question Answering Dataset 41

42.

43. 43 SQuAD1.1 Leaderboard

44. Conclusion ● BERT is strong pre-trained language model that uses bidirectional transformer ● BERT can be fine-tuned to achieve good performance in many NLP tasks ● The source code is available at github 44

45.

46. References ● Stanford CS224n: Natural Language Processing with Deep Learning ● Stanford CS231n: Convolutional Neural Networks for Visual Recognition ● http://people.ee.duke.edu/~lcarin/Kevin8.3.2018.pdf ● https://zhuanlan.zhihu.com/p/52282552 ● https://zhuanlan.zhihu.com/p/46178084 ● https://zhuanlan.zhihu.com/p/39034683 46

BERT: Bidirectional Encoder Representations from Transformers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BERT: Bidirectional Encoder Representations from Transformers

Similar to BERT: Bidirectional Encoder Representations from Transformers (20)

More from Liangqun Lu

More from Liangqun Lu (13)

Recently uploaded

Recently uploaded (20)

BERT: Bidirectional Encoder Representations from Transformers