SlideShare a Scribd company logo
1 of 77
BERT
Bidirectional Encoder
Representations from
TransformersMultimodal Dialogue Systems Seminar
Abdallah Bashir
CS UdS
19 Sep 2019
Saarbrucken
Germany 1
2
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
3
Keywords!
● Transformers
● Contextual Embeddings
● Bidirectional Training
● Fine-Tuning
● SOTA, alot of SOTA!
4
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Introduction
● 2018 was a turning point for Natural Language
Processing
○ BERT
○ OpenAI GPT
○ ELMO
○ ULM fit
[1]
5
Introduction
● BERT [2] (Bidirectional Encoder Representations
from Transformers) is an Open-Source Language
Representation Model developed by researchers in
Google AI.
● At that time, the paper presented SOTA results in
eleven NLP tasks.
6Models that outperformed bert mentioned at the end.
Main Contributions
The paper summarizes the contributions in main three
points
● demonstrate the importance of Bidirectional pre-training
and why it is better than unidirectional approach.
● With the usage of fine-tuning, BERT outperforms task-
specific architectures.
● Showing the performance achievements where BERT
advances the SOTA for eleven NLP tasks 7
8
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Related Work
● The paper briefly go through previous approaches
in pre-training general language representations,
which are listed under main three points:
○ Unsupervised Feature-based Approaches
○ Unsupervised Fine-tuning Approaches
○ Transfer Learning from Supervised Data
9
Unsupervised Feature-based
Approaches
● Approaches to Learning representations from
unlabeled text, i.e word embeddings included non-
neural approaches and neural approaches.
● Using pre-trained word-embeddings instead of
training it from scratch have proved significant
improvements in performance.
10
11
The GloVe word embedding of the word "stick" - a vector of 200 floats (rounded to two decimals). It goes on
for two hundred values. [1]
Non Contextual Embeddings
● Unidirectional
○ Feature-Based(ELMo)
○ Fine-tuning(OpenAI GPT).
● Bidirectional
○ BERT
Contextual Embeddings
12
ELMO, What about sentences’
context
13
14
[16]
15
● When ELMo contextual representations was used
in task-specific architecture, ELMo advanced the
SOTA benchmarks.
16
17
Transformer
Unsupervised Fine-tuning
Approaches
18
19
20
21
22
23
Self-Attention in a Nutshell!
Unsupervised Fine-tuning
Approaches
● Perks of using such approaches that only few
parameters need to be trained from scratch.
● OpenAI GPT [4] previously achieved SOTA on
many sentence level tasks in GLUE benchmark [19]
24
Transfer Learning from
Supervised Data
● Transfer learning is a means to extract knowledge
from a source setting and apply it to a different
target setting.
● Researchers in the field of Computer Vision have
used transfer learning continuously where they
fine-tune models pre-trained with ImageNet. 25
26
27
Outlines
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
● BERT implementation:
○ Pretraining
○ Fine-tuning
BERT | The Model
28
[16]
29
● BERT almost have the same model architecture
across different tasks with small changes between
the pre-trained architecture and the final
downstream architecture
30
Model Architecture
● BERT architecture consist of multi-layer
bidirectional Transformer encoder [5].
● BERT model provided in the paper came in two
sizes
BERTBASE BERTLARGE
Layers = 12 Layers = 24
Hidden size = 768 Hidden size = 1024
self-Attention heads = 12 self-Attention heads = 16
Total parameters = 110M Total parameters = 340M
31
● BERTbase was chosen to have the same size as
OpenAI GPT for benchmarks purposes.
● The difference between the two models is BERT
train the Language Model transformers in both
directions while the OpenAI GPT context trains it
only left-to-right.
Comparisons with OpenAI GPT
32
Model Input/Output
Representation
● The paper defines a ‘sentence’ as any sequence of
text. Like mentioned before, it could be one
sentence or two sentences stacked together.
● BERT input can be represented by either a single
sentence, or pair of sentences (like Question
Answering).
● BERT use WordPiece embeddings [23] with
30,000 token vocabulary. 33
[24]
34
● The first token in each sequence is always a
[CLS] token that is used in classification.
● Separating Sentences apart can be done in two
steps:
○ Separate between sentences using [SEP] token,
○ Adding a learned embedding to each token in order to
state which sentence it belongs to.
Model Input/Output
Representation
35
● Each token is represented by summing the
corresponding token, segment and position
embedding as seen in the above figure
36
37
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Pre-training BERT
● In order to pre-train BERT transformers
bidirectionally the paper proposed two
unsupervised tasks which are Masked LM and
Next Sentence Prediction(NSP).
38
Pre-training Data
● One of the reasons of BERT success is the
amount of data that it got trained on.
● The main two corpuses that were used to
pretrain the language models are:
○ the BooksCorpus (800M words)
○ English Wikipedia (2,500M words)
39
Masked Language Modeling
● Usually, Language Models can only be trained on
one specific direction.
● BERT handled this issue by using a “Masked
Language Model” from previous literature (Cloze
[24]).
● BERT do this by randomly masking 15% of the
wordpiece input tokens. Only the masked tokens
get to be predicted later.
40
[16]
41
● [MASK] tokens would appear only in the pretraining
and not during the fine-tuning!
● To alleviate the masking, after choosing the randomly
15% tokens, BERT either:
○ Switch the token to [MASK] (80% of the time).
○ Switch the token to a random other token (10% of the time).
○ Leave the token unchanged (10% of the time).
● The original token is then predicted with cross entropy
loss.
Masked Language Modeling
42
Next Sentence Prediction (NSP)
● In the pretraining phase, the model (if provided)
receives pairs of sentences and it will be trained to
predict if the second sentence is the subsequent of
the first.
● In the training data, 50% of the inputs are actual pairs
with a label [IsNext], where the other 50% have
random sentences from the corpus as the successor
for the first sentence [NotNext].
● NSP is crucial for downstream tasks like Question
Answering and Natural Language Inference. 43
Next Sentence Prediction (NSP)
44
● this stage in total is considered to be inexpensive
relative to the pretraining phase.
● For classification tasks (e.g sentiment analysis) we
add a classification FFN for the [CLS] token input
representation on top of the final output.
● For Question Answering alike tasks Bert train two
extra vectors that are responsible for marking the
beginning and the end of the answer.
Fine-tuning BERT
45
Classification Example
[16]
46
47
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Experiments
48
49
50
Experiments
● GLUE
○ General Language Understanding
Evaluation benchmark
● SQuAD v1.1
○ Stanford Question Answering Dataset
● SQuAD v2.0
○ extends the SQuAD 1.1
● SWAG
○ Situations With Adversarial Generations
(SWAG) dataset
1. GLUE
● “The General Language Understanding Evaluation
benchmark (GLUE) [19] is a collection of various
natural language understanding tasks.”
● For fine-tuning,the same approach of classification
in pre-training is used.
51
Table 1: GLUE Test results [2]
52
2. SQuAD v1.1
● “SQuAD v1.1 [26], The Stanford Question
Answering Dataset is a collection of 100k
crowdsourced question,answer pairs.”
● Input get represented as one sequence containing
the question and the text containing the answer.
53
Table 2: SQuAD 1.1 results
2. SQuAD v1.1
54
● The SQuAD 2.0 task extends the SQuAD 1.1 by:
○ making sure that no short answer exists in the
provided paragraph.
○ marking questions that do not have an answer
with [CLS] token at the start and the end. The
TriviaQA data was used for this model.
3. SQuAD v2.0
55
Table 3: SQuAD 2.0 results
56
● “The Situations With Adversarial Generations
(SWAG) dataset contains 113k sentence-pair
completion examples that evaluate grounded
common sense inference [28],
● Given a sentence, the task is to choose the most
plausible continuation among four choices”
● The paper fine-tune on the SWAG dataset by
creating four input sequences, each include the
given sentence and concatenated with a possible
continuation.
4. SWAG
57
Table 4: SWAG Result‫س‬
58
59
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Ablation
Studies
60
1. Effect of Pre-training Tasks
● The next experiments will showcase the
importance of the bidirectionality of BERT.
● the same pretraining data, fine-tuning scheme, and
hyperparameters as BERTBASE will be used
throughout the experiments.
61
Experiments Description
No NSP bidirectional model trained only
using the Masked Language
Model (MLM) without the Next
Sentence Prediction NSP.
LTR & No NSP A standard Left-to-Right (LTR)
language model which is
trained without the MLM and
the NSP. this model can be
compared to the OpenAI GPT.
1. Effect of Pre-training Tasks
62
Table 5: Ablation over the pre-training tasks using the BERTBASE architecture [2]
63
● The effect of Bert model size on fine-tuning tasks
was tested with different number of layers, hidden
units, and attention heads while using the same
hyperparameters.
● Results from fine-tuning on GLUE are shown in
Table 6 which include the average Dev Set
accuracy.
● It is clear that the larger the model, the better the
accuracy.
2. Effect of Model Size
64
Table 6: Ablation over BERT model size [2]
65
● Bert shows for the first time that increasing the
model can improve the results even for
downstream tasks that its data is very small
because it benefit from the larger, more expressive
pre-trained representations.
66
3. Feature-based Approach with
BERT
● There are other ways to use Bert for downstream
tasks other than fine-tuning which is using the
contextualized word embeddings that are
generated from pre-training BERT, and then use
these fixed features in other models.
● Advantages of using such an approach:
○ Some tasks require a task-specific model architecture and
can’t be modeled with a Transformer encoder
architecture. 67
● The two approaches were compared by applying Bert to
the CoNLL-2003 Named Entity Recognition (NER) task.
● These contextual embeddings are used as input to a
randomly initialized two-layer 768-dimensional BiLSTM
before the classification layer.
● The paper tried different approaches to represent the
embeddings (all shown in Table 7.) in the feature-based
approach. Concatenate the last four-hidden layers
output achieved the best.
3. Feature-based Approach with
BERT
68
Table 7: CoNLL-2003 Named Entity Recognition results [1]
69
In Conclusion
● Bert major contribution was in adding more
generalization to existing Transfer Learning
methods by using bidirectional architecture.
● Bert model also added more contribution to the
field. Fine-tuning Bert model will now tackle a lot
of NLP tasks.
70
71
Papers that outperformed Bert
● Roberta, FAIR
● Xlnet, CMU + Google AI
● CTRL, Salesforce Research
72
References
1. http://jalammar.github.io/illustrated-bert/
2. J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pretraining of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
3. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In NAACL.
4. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving
language understanding with unsupervised learning. Technical report, OpenAI.
5. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 6000–6010.
6. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai.
1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–
479.
7. Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from
multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853.
73
8. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation
with structural correspondence learning. In Proceedings of the 2006 conference on empirical
methods in natural language processing, pages 120–128. Association for Computational
Linguistics.
9. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp
Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in
statistical language modeling. arXiv preprint arXiv:1312.3005.
10. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove:
Global vectors for word representation. In Empirical Methods in Natural Language
Processing (EMNLP), pages 1532– 1543.
11. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A
simple and general method for semi-supervised learning. In Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics, ACL ’10, pages 384–394.
12. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural
information processing systems, pages 3294–3302.
13. Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for
learning sentence representations. In International Conference on Learning Representations.
74
14. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
documents. In International Conference on Machine Learning, pages 1188–1196.
15. Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based
objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557.
16. http://jalammar.github.io/illustrated-bert/
17. Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances
in neural information processing systems, pages 3079–3087.
18. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for
text classification. In ACL. Association for Computational Linguistics.
19. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel
Bowman. 2018a. Glue: A multi-task benchmark and analysis platform for natural language
understanding. In Proceedings of the 2018 EMNLP Workshop Black box NLP: Analyzing and
Interpreting Neural Networks for NLP, pages 353–355.
20. Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc Barrault, and Antoine Bordes.
2017. Supervised learning of universal sentence representations from natural language inference
data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
75
21. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in
translation: Contextualized word vectors. In NIPS.
22. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009. ImageNet: A Large-Scale
Hierarchical Image Database. In CVPR09.
23. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation
system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
24. Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism
Bulletin, 30(4):415–433
25. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching
movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–
27.
26. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+
questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 2383–2392.
27. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale
distantly supervised challenge dataset for reading comprehension. In ACL.
76
28. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale
adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing (EMNLP).
29. Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003
shared task: Language-independent named entity recognition. In CoNLL.
30. http://jalammar.github.io/illustrated-transformer/
77

More Related Content

What's hot

GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model reviewSeoung-Ho Choi
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language modelJiWenKim
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 

What's hot (20)

Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
Word embedding
Word embedding Word embedding
Word embedding
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
BERT
BERTBERT
BERT
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Gpt models
Gpt modelsGpt models
Gpt models
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 

Similar to Bert

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
 
Pay attention to MLPs
Pay attention to MLPsPay attention to MLPs
Pay attention to MLPsSeolhokim
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
Paper presentation on LLM compression
Paper presentation on LLM compression Paper presentation on LLM compression
Paper presentation on LLM compression SanjanaRajeshKothari
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingSubhajit Sahu
 
Exploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewExploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewVimukthi Wickramasinghe
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
 
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Machine learning testing  survey, landscapes and horizons, the Cliff NotesMachine learning testing  survey, landscapes and horizons, the Cliff Notes
Machine learning testing survey, landscapes and horizons, the Cliff NotesHeemeng Foo
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Modelssaurav singla
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)Marsan Ma
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 

Similar to Bert (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
Pay attention to MLPs
Pay attention to MLPsPay attention to MLPs
Pay attention to MLPs
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
Paper presentation on LLM compression
Paper presentation on LLM compression Paper presentation on LLM compression
Paper presentation on LLM compression
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning Training
 
Study_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_SystemsStudy_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_Systems
 
Exploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewExploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper review
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
 
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Machine learning testing  survey, landscapes and horizons, the Cliff NotesMachine learning testing  survey, landscapes and horizons, the Cliff Notes
Machine learning testing survey, landscapes and horizons, the Cliff Notes
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
nnUNet
nnUNetnnUNet
nnUNet
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Bert

  • 1. BERT Bidirectional Encoder Representations from TransformersMultimodal Dialogue Systems Seminar Abdallah Bashir CS UdS 19 Sep 2019 Saarbrucken Germany 1
  • 2. 2 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 3. 3 Keywords! ● Transformers ● Contextual Embeddings ● Bidirectional Training ● Fine-Tuning ● SOTA, alot of SOTA!
  • 4. 4 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 5. Introduction ● 2018 was a turning point for Natural Language Processing ○ BERT ○ OpenAI GPT ○ ELMO ○ ULM fit [1] 5
  • 6. Introduction ● BERT [2] (Bidirectional Encoder Representations from Transformers) is an Open-Source Language Representation Model developed by researchers in Google AI. ● At that time, the paper presented SOTA results in eleven NLP tasks. 6Models that outperformed bert mentioned at the end.
  • 7. Main Contributions The paper summarizes the contributions in main three points ● demonstrate the importance of Bidirectional pre-training and why it is better than unidirectional approach. ● With the usage of fine-tuning, BERT outperforms task- specific architectures. ● Showing the performance achievements where BERT advances the SOTA for eleven NLP tasks 7
  • 8. 8 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 9. Related Work ● The paper briefly go through previous approaches in pre-training general language representations, which are listed under main three points: ○ Unsupervised Feature-based Approaches ○ Unsupervised Fine-tuning Approaches ○ Transfer Learning from Supervised Data 9
  • 10. Unsupervised Feature-based Approaches ● Approaches to Learning representations from unlabeled text, i.e word embeddings included non- neural approaches and neural approaches. ● Using pre-trained word-embeddings instead of training it from scratch have proved significant improvements in performance. 10
  • 11. 11 The GloVe word embedding of the word "stick" - a vector of 200 floats (rounded to two decimals). It goes on for two hundred values. [1] Non Contextual Embeddings
  • 12. ● Unidirectional ○ Feature-Based(ELMo) ○ Fine-tuning(OpenAI GPT). ● Bidirectional ○ BERT Contextual Embeddings 12
  • 13. ELMO, What about sentences’ context 13
  • 14. 14
  • 16. ● When ELMo contextual representations was used in task-specific architecture, ELMo advanced the SOTA benchmarks. 16
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 24. Unsupervised Fine-tuning Approaches ● Perks of using such approaches that only few parameters need to be trained from scratch. ● OpenAI GPT [4] previously achieved SOTA on many sentence level tasks in GLUE benchmark [19] 24
  • 25. Transfer Learning from Supervised Data ● Transfer learning is a means to extract knowledge from a source setting and apply it to a different target setting. ● Researchers in the field of Computer Vision have used transfer learning continuously where they fine-tune models pre-trained with ImageNet. 25
  • 26. 26
  • 27. 27 Outlines ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 28. ● BERT implementation: ○ Pretraining ○ Fine-tuning BERT | The Model 28
  • 30. ● BERT almost have the same model architecture across different tasks with small changes between the pre-trained architecture and the final downstream architecture 30
  • 31. Model Architecture ● BERT architecture consist of multi-layer bidirectional Transformer encoder [5]. ● BERT model provided in the paper came in two sizes BERTBASE BERTLARGE Layers = 12 Layers = 24 Hidden size = 768 Hidden size = 1024 self-Attention heads = 12 self-Attention heads = 16 Total parameters = 110M Total parameters = 340M 31
  • 32. ● BERTbase was chosen to have the same size as OpenAI GPT for benchmarks purposes. ● The difference between the two models is BERT train the Language Model transformers in both directions while the OpenAI GPT context trains it only left-to-right. Comparisons with OpenAI GPT 32
  • 33. Model Input/Output Representation ● The paper defines a ‘sentence’ as any sequence of text. Like mentioned before, it could be one sentence or two sentences stacked together. ● BERT input can be represented by either a single sentence, or pair of sentences (like Question Answering). ● BERT use WordPiece embeddings [23] with 30,000 token vocabulary. 33
  • 35. ● The first token in each sequence is always a [CLS] token that is used in classification. ● Separating Sentences apart can be done in two steps: ○ Separate between sentences using [SEP] token, ○ Adding a learned embedding to each token in order to state which sentence it belongs to. Model Input/Output Representation 35
  • 36. ● Each token is represented by summing the corresponding token, segment and position embedding as seen in the above figure 36
  • 37. 37 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 38. Pre-training BERT ● In order to pre-train BERT transformers bidirectionally the paper proposed two unsupervised tasks which are Masked LM and Next Sentence Prediction(NSP). 38
  • 39. Pre-training Data ● One of the reasons of BERT success is the amount of data that it got trained on. ● The main two corpuses that were used to pretrain the language models are: ○ the BooksCorpus (800M words) ○ English Wikipedia (2,500M words) 39
  • 40. Masked Language Modeling ● Usually, Language Models can only be trained on one specific direction. ● BERT handled this issue by using a “Masked Language Model” from previous literature (Cloze [24]). ● BERT do this by randomly masking 15% of the wordpiece input tokens. Only the masked tokens get to be predicted later. 40
  • 42. ● [MASK] tokens would appear only in the pretraining and not during the fine-tuning! ● To alleviate the masking, after choosing the randomly 15% tokens, BERT either: ○ Switch the token to [MASK] (80% of the time). ○ Switch the token to a random other token (10% of the time). ○ Leave the token unchanged (10% of the time). ● The original token is then predicted with cross entropy loss. Masked Language Modeling 42
  • 43. Next Sentence Prediction (NSP) ● In the pretraining phase, the model (if provided) receives pairs of sentences and it will be trained to predict if the second sentence is the subsequent of the first. ● In the training data, 50% of the inputs are actual pairs with a label [IsNext], where the other 50% have random sentences from the corpus as the successor for the first sentence [NotNext]. ● NSP is crucial for downstream tasks like Question Answering and Natural Language Inference. 43
  • 45. ● this stage in total is considered to be inexpensive relative to the pretraining phase. ● For classification tasks (e.g sentiment analysis) we add a classification FFN for the [CLS] token input representation on top of the final output. ● For Question Answering alike tasks Bert train two extra vectors that are responsible for marking the beginning and the end of the answer. Fine-tuning BERT 45
  • 47. 47 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 49. 49
  • 50. 50 Experiments ● GLUE ○ General Language Understanding Evaluation benchmark ● SQuAD v1.1 ○ Stanford Question Answering Dataset ● SQuAD v2.0 ○ extends the SQuAD 1.1 ● SWAG ○ Situations With Adversarial Generations (SWAG) dataset
  • 51. 1. GLUE ● “The General Language Understanding Evaluation benchmark (GLUE) [19] is a collection of various natural language understanding tasks.” ● For fine-tuning,the same approach of classification in pre-training is used. 51
  • 52. Table 1: GLUE Test results [2] 52
  • 53. 2. SQuAD v1.1 ● “SQuAD v1.1 [26], The Stanford Question Answering Dataset is a collection of 100k crowdsourced question,answer pairs.” ● Input get represented as one sequence containing the question and the text containing the answer. 53
  • 54. Table 2: SQuAD 1.1 results 2. SQuAD v1.1 54
  • 55. ● The SQuAD 2.0 task extends the SQuAD 1.1 by: ○ making sure that no short answer exists in the provided paragraph. ○ marking questions that do not have an answer with [CLS] token at the start and the end. The TriviaQA data was used for this model. 3. SQuAD v2.0 55
  • 56. Table 3: SQuAD 2.0 results 56
  • 57. ● “The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded common sense inference [28], ● Given a sentence, the task is to choose the most plausible continuation among four choices” ● The paper fine-tune on the SWAG dataset by creating four input sequences, each include the given sentence and concatenated with a possible continuation. 4. SWAG 57
  • 58. Table 4: SWAG Result‫س‬ 58
  • 59. 59 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 61. 1. Effect of Pre-training Tasks ● The next experiments will showcase the importance of the bidirectionality of BERT. ● the same pretraining data, fine-tuning scheme, and hyperparameters as BERTBASE will be used throughout the experiments. 61
  • 62. Experiments Description No NSP bidirectional model trained only using the Masked Language Model (MLM) without the Next Sentence Prediction NSP. LTR & No NSP A standard Left-to-Right (LTR) language model which is trained without the MLM and the NSP. this model can be compared to the OpenAI GPT. 1. Effect of Pre-training Tasks 62
  • 63. Table 5: Ablation over the pre-training tasks using the BERTBASE architecture [2] 63
  • 64. ● The effect of Bert model size on fine-tuning tasks was tested with different number of layers, hidden units, and attention heads while using the same hyperparameters. ● Results from fine-tuning on GLUE are shown in Table 6 which include the average Dev Set accuracy. ● It is clear that the larger the model, the better the accuracy. 2. Effect of Model Size 64
  • 65. Table 6: Ablation over BERT model size [2] 65
  • 66. ● Bert shows for the first time that increasing the model can improve the results even for downstream tasks that its data is very small because it benefit from the larger, more expressive pre-trained representations. 66
  • 67. 3. Feature-based Approach with BERT ● There are other ways to use Bert for downstream tasks other than fine-tuning which is using the contextualized word embeddings that are generated from pre-training BERT, and then use these fixed features in other models. ● Advantages of using such an approach: ○ Some tasks require a task-specific model architecture and can’t be modeled with a Transformer encoder architecture. 67
  • 68. ● The two approaches were compared by applying Bert to the CoNLL-2003 Named Entity Recognition (NER) task. ● These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer. ● The paper tried different approaches to represent the embeddings (all shown in Table 7.) in the feature-based approach. Concatenate the last four-hidden layers output achieved the best. 3. Feature-based Approach with BERT 68
  • 69. Table 7: CoNLL-2003 Named Entity Recognition results [1] 69
  • 70. In Conclusion ● Bert major contribution was in adding more generalization to existing Transfer Learning methods by using bidirectional architecture. ● Bert model also added more contribution to the field. Fine-tuning Bert model will now tackle a lot of NLP tasks. 70
  • 71. 71 Papers that outperformed Bert ● Roberta, FAIR ● Xlnet, CMU + Google AI ● CTRL, Salesforce Research
  • 72. 72
  • 73. References 1. http://jalammar.github.io/illustrated-bert/ 2. J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 3. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In NAACL. 4. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI. 5. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010. 6. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467– 479. 7. Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853. 73
  • 74. 8. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120–128. Association for Computational Linguistics. 9. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. 10. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532– 1543. 11. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 384–394. 12. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302. 13. Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations. 74
  • 75. 14. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196. 15. Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557. 16. http://jalammar.github.io/illustrated-bert/ 17. Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087. 18. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In ACL. Association for Computational Linguistics. 19. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018a. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Black box NLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355. 20. Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics. 75
  • 76. 21. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS. 22. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09. 23. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. 24. Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433 25. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19– 27. 26. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392. 27. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL. 76
  • 77. 28. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 29. Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In CoNLL. 30. http://jalammar.github.io/illustrated-transformer/ 77

Editor's Notes

  1. Transformers: a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease.
  2. Models that outperformed bert mentioned at the end
  3. ELMo [3] extract context features by using a bi-directional LSTM language model. Each token is represented by concatenating left-to-right and right-to-left representations.
  4. grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation)
  5. Transformers: a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Openai and bert depends on it.
  6. by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
  7. Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
  8. The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
  9. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence
  10. an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
  11. Say the following sentence is an input sentence we want to translate:”The animal didn't cross the street because it was too tired” What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm. When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”. As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.
  12. 4:openai gpt 17.Semi-supervised sequence learning. In Advances in neural information processing systems, 18.Universal language model fine-tuning for text classificatio
  13. Previous work ,Computer vision
  14. input/output, transformers
  15. Pretraining: the model at first is trained on large unlabeled data. Fine-tuning: model is initialised with the embeddings that was trained in the previous step. Afterwards all the parameters are fine-tuned using the labeled data from the chosen downstream task.
  16. Two aizes
  17. Mention roberta
  18. The Next Sentence Prediction NSP task in the paper is related to [13] and [15], the only difference that [13] and [15] transfer only sentence embeddings to downstream tasks where BERT transfer all the parameters to the various NLP downstream tasks.
  19. In the next slides BERT fine-tuning results on 11 NLP tasks will be discussed
  20. Fine-tuning: we go through the same steps as earlier, the final hidden vector C which corresponds to the first input token [CLS] and get passed to a FFN which is used for classification
  21. We can see from Table 1. That both BERTBASE and BERTLARGE outperform all previous state-of-the-art with 4.5% and 7.0% respective average accuracy improvements. Another finding that BERTLARGE outperforms BERTBASE across all tasks
  22. Probability of wordi Pi to be the first of the sentence is calculated by using softmax over all other words in the text sequence Ti : the output vector S: start vector The same formula is used to predict the word that is at the end of the sentence
  23. Bert Large achieved SOTA in performance in the SQuAD. The model that achieved the highest score was ensemble of Bertlarge model with augmenting the dataset with TriviaQA [27] before fine-tuning with SQuAD. “The best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 as a single system.”
  24. From Table 3. We can see that Bert achieved a +5.1 F1 improvement over the previous best systems
  25. “ BERTLARGE outperforms the authors’ baseline ESIM+ELMo system by +27.1% and OpenAI GPT by 8.3%”
  26. The paper have performed numerous useful ablation studies over different aspects of the Bert architecture in order to understand why are they important.
  27. But bert was trained with larger training dataset, our input representation, and our fine-tuning scheme.
  28. In Table 5. We can see that removing NSP reduces the performance especially in QNLI, MNLI, and SQuAD 1.1.
  29. #L = the number of layers; #H = hidden size; #A = number of attention heads. “LM (ppl)” is the masked LM perplexity of held-out training data. Using BERT large improved performance from BERT base in GLUE selected tasks even if BERT base already had a great number of parameters (110M) compared to the largest tested model in Transformer (100M).
  30. Computational efficient. Using the already pre-trained contextual embeddings with relatively small models is considered to be less expensive than fine-tuning.
  31. BERTLARGE performs competitively with state-of-the-art methods the best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which is only 0.3 F1 behind fine-tuning the entire model This demonstrates that BERT is effective for both finetuning and feature-based approaches.