BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Transformer & BERT
Review by Hyunwoong (github.com/gusdnd852)
1

Presentation Outline
1. The Transformer Network (Backbone of BERT)
1.1 Attention Mechanism (Seq2Seq with Attention)
1.2 Transformer Architecture (Transformer)
2. Background : emergence of BERT (Introduction & Related Work)
2.1 Word Embedding (Word Representation in ML)
2.2 Embedding from Language Model (ELMo)
2.3 Unsupervised Pre-train, Supervised Fine Tune (GPT)
3. BERT & Experiment (Proposed Method & Experimental Result)
2

3

4
There are plain Seq2Seq model
Reference : Raimi
Karim

5
If sequence is long, it shows very bad performance
But, Why?
Reference : Raimi
Karim

6
Context vector size is fixed, we can not put every information
Reference : Raimi
Karim

7
Network confused.
“so many information in small vector. which part is important?”
Reference : Raimi
Karim

8
So, We will use output of each time step
Reference : Raimi
Karim

9
Firstly, we can get the encoder hidden state
Reference : Raimi
Karim

10
And, compute dot product of encoder hidden state and
output of decoder(firstly, BOS’s output) to get relationship of words
Reference : Raimi
Karim

11
We call this framework as Query, Key, Value
Reference : Raimi
Karim

12
Query
Key Key Key Key
Reference : Raimi
Karim

13
Query
Key Key Key Key
And compute softmax score to adjust dot product to [0,1]
Reference : Raimi
Karim

14
Multiply score with keys (hidden state of encoders)
Query
Key Key Key Key
Value Value Value Value
Reference : Raimi
Karim

15
Add Every Value. Now we call this Align (Attention)
Query
Key Key Key Key
Value Value Value Value
Align (attention)
Reference : Raimi
Karim

16
Align (Attention) is used as input
Reference : Raimi
Karim

17
NMT by jointly learning and align (2014)
Reference : Raimi
Karim

18
Effective Approaches to Attention-based NMT (2015)
Reference : Raimi
Karim

19
1.2 Transformer Architecture

20
1. Parallelism (Recurrent to Pos Encoding)
2. Consider self attention (not only enc-dec attention)

21
There are <s></s><PAD><UNK> tokens (max_len = 6)
Reference : Kim Dong Hwa
Input data format.

22
Data size is like this (zero initialized)
d_model = 4, max_len = 6
I : [0, 0, 0, 0]
am : [0, 0, 0, 0]
kim : [0, 0, 0, 0]
<PAD> : [0, 0, 0, 0]
<PAD> : [0, 0, 0, 0]
<PAD> : [0, 0, 0, 0]
Input data format.
data = torch.zeros(batch_size, max_len, d_model)
size = data.size() # [?, 6, 4]
? Is batchsize (number of sentence in here)

23
nn.Embedding(vocab_size, d_model) work like this
Input Embedding

24
Pos Encoding – 1 (Weight Matrix)
Positional Encoding

25
Pos Encoding – 2 (Sinusoid Function)
Positional Encoding

26
Embedding Dropout

27
Multi-Head Attention

28

29

30

31
Multi-Head Attention (Apply Mask)

32

33

34

35

36
Add & Norm

37
Add & Norm

38
Add & Norm

39
Positionwise Feed Foward

40

41
Add & Norm

42
Decoder

43
Masked Multi-Head Attention

44

45

46

47
Masked Multi-Head Attention (Apply Mask)

48
Masked Multi-Head Attention (Apply Mask)

49

50
Add & Norm

51
Encoder–Decoder Multi-Head Attention

52

53

54

55

56
Add & Norm

57

58

59
Add & Norm

60
Linear & Softmax

61
Summary

62

63
How to represent word in computer ?

64
… We Want Dense Representation !!
Sparse Representation (One-Hot)

65
Dense Representation (Word2Vec)
CBOWSkip Gram

66
Dense Representation (Word2Vec)
CBOWSkip Gram

67
Dense Representation (Improved Word2Vec)
"apple" : "ap", "app", "appl", "apple"

68
Word Embedding with Pre-training

69
But They can not consider contextual meaning of word

70
I Love an Apple. It is delicious than Banana.
I Love the Apple. It is better than Samsung.
same embedding vector… is it ok?

71
School, Home, Hospital, Church, Temple …
They will have similar embedding vector. Because they only
consider about whether they appear at the same time
[going, to , _____ , I , am] are appear at the same time.
4/28 : Today, I am going to _____ , I am so happy.
4/29 : Today, I am going to _____ , I am so sad…
4/30 : Today, I am going to _____ , I am so exciting!

72
But, think again... Is Church similar with Temple?
Is School similar with Home? …Really?
~~
~~

73

74
Language Model : To assign probability to sequence.
Consider previous words and Predict next word

75
Consider neighbor word
Word2Vec
Consider previous words
Language Model
Language Model : To assign probability to sequence.

76
Embedding from Language Model (ELMo)
ELMo uses 2 Language Model consisted of Multi Layer RNN(LSTM) : BiLM

77
To embed the word ”play”, ELMo uses the result of each layer
inside the dotted rectangle above as the material.

78
1. Concatenate
Forward LM & Backward LM
2. Multiply
Wight at each layer’s outputs
3. Add
Every layer’s outputs
4. Scale
by multiplying constant

79
We can use ELMo representation for downstream task with embedding
And Now, Model knows about word’s contextual meaning.

80
But, there are still a big problem
We still lack the data specific to a particular task

81
Text Classification

82
Named Entity Recognition

83
Machine Translation

84
Question & Answering

85

86
But we have a lot of unlabeled text data.

87
But we have a lot of unlabeled text data.

88
We want universal NLU Model trained unlabeled data like human
So, GPT tries semi-supervised approach with pre-train & fine tune
to solve that problem (lack of task specific dataset)

89
Unsupervised Pre-train
Unsupervised Corpus of token U = u1, u2, … , un
The maximum likelihood of maximizing a standard LM objective is:
K = window size
Theta = NN’s parameter

90
Unsupervised Corpus of token U = u1, u2, … , un
We = Embedding , Wp = Positional
P(u) is output of model (next word prediction)

91

92
Supervised Fine Tune

93
Supervised Fine Tune
Supervised Corpus of token X = x1, x2, … , xm, label = Y
pass through the linear output layer once more to predict the y value

94
Auxiliary Loss
Supervised Corpus of token X = x1, x2, … , xm, label = Y
pass through the linear output layer once more to predict the y value

95
Framework

97
3. BERT & Experiments
Pre-training

98
Pre-training

99
Pre-training

100
Pre-training : Segmentation

101

102

103

104

105
Preprocessing : Next Sentence

106
Preprocessing : Next Sentence

107
Preprocessing : Length & Mask

108
Preprocessing : Length & Mask

109
Preprocessing : 학습되는 데이터

110
BERT vs Transformer

111
BERT vs Transformer

112
BERT vs Transformer

113
BERT vs Transformer

114
BERT vs Transformer

115
Token Embedding

116
Segment Embedding

117
Positional Embedding

118
Norm

119
GELU

120
Forwarding

121
Loss

122
Next Sentence Loss

123
Next Sentence Loss

124
Mask Loss

125
Mask Loss

126
Fine Tune

127
Fine Tune

128
Experiment Result

129
Experiment Result

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

More from gohyunwoong

More from gohyunwoong (7)

Recently uploaded

Recently uploaded (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding