This presentation is for SotA models in NLP called Transformer & BERT review materials. I reviewed many model in here Word2Vec, ELMo, GPT, ... etc
reference 1 : Kim Dong Ha (https://www.youtube.com/watch?v=xhY7m8QVKjo)
reference 2 : Raimi Karim (https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)
4. 4
1.1 Attention Mechanism (Seq2Seq with Attention)
There are plain Seq2Seq model
Reference : Raimi
Karim
5. 5
1.1 Attention Mechanism (Seq2Seq with Attention)
If sequence is long, it shows very bad performance
But, Why?
Reference : Raimi
Karim
6. 6
1.1 Attention Mechanism (Seq2Seq with Attention)
Context vector size is fixed, we can not put every information
Reference : Raimi
Karim
7. 7
1.1 Attention Mechanism (Seq2Seq with Attention)
Network confused.
“so many information in small vector. which part is important?”
Reference : Raimi
Karim
8. 8
1.1 Attention Mechanism (Seq2Seq with Attention)
So, We will use output of each time step
Reference : Raimi
Karim
9. 9
1.1 Attention Mechanism (Seq2Seq with Attention)
Firstly, we can get the encoder hidden state
Reference : Raimi
Karim
10. 10
1.1 Attention Mechanism (Seq2Seq with Attention)
And, compute dot product of encoder hidden state and
output of decoder(firstly, BOS’s output) to get relationship of words
Reference : Raimi
Karim
11. 11
1.1 Attention Mechanism (Seq2Seq with Attention)
We call this framework as Query, Key, Value
Reference : Raimi
Karim
13. 13
1.1 Attention Mechanism (Seq2Seq with Attention)
Query
Key Key Key Key
And compute softmax score to adjust dot product to [0,1]
Reference : Raimi
Karim
14. 14
1.1 Attention Mechanism (Seq2Seq with Attention)
Multiply score with keys (hidden state of encoders)
Query
Key Key Key Key
Value Value Value Value
Reference : Raimi
Karim
15. 15
1.1 Attention Mechanism (Seq2Seq with Attention)
Add Every Value. Now we call this Align (Attention)
Query
Key Key Key Key
Value Value Value Value
Align (attention)
Reference : Raimi
Karim
16. 16
1.1 Attention Mechanism (Seq2Seq with Attention)
Align (Attention) is used as input
Reference : Raimi
Karim
17. 17
1.1 Attention Mechanism (Seq2Seq with Attention)
Align (Attention) is used as input
NMT by jointly learning and align (2014)
Reference : Raimi
Karim
18. 18
1.1 Attention Mechanism (Seq2Seq with Attention)
Effective Approaches to Attention-based NMT (2015)
Reference : Raimi
Karim
Align (Attention) is used as input
63. 2.1 Word Embedding (Word Representation in ML)
63
How to represent word in computer ?
64. 2.1 Word Embedding (Word Representation in ML)
64
… We Want Dense Representation !!
Sparse Representation (One-Hot)
65. 2.1 Word Embedding (Word Representation in ML)
65
Dense Representation (Word2Vec)
CBOWSkip Gram
66. 2.1 Word Embedding (Word Representation in ML)
66
Dense Representation (Word2Vec)
CBOWSkip Gram
67. 2.1 Word Embedding (Word Representation in ML)
67
Dense Representation (Improved Word2Vec)
"apple" : "ap", "app", "appl", "apple"
68. 2.1 Word Embedding (Word Representation in ML)
68
Word Embedding with Pre-training
69. 2.1 Word Embedding (Word Representation in ML)
69
But They can not consider contextual meaning of word
70. 2.1 Word Embedding (Word Representation in ML)
70
I Love an Apple. It is delicious than Banana.
I Love the Apple. It is better than Samsung.
same embedding vector… is it ok?
71. 2.1 Word Embedding (Word Representation in ML)
71
School, Home, Hospital, Church, Temple …
They will have similar embedding vector. Because they only
consider about whether they appear at the same time
[going, to , _____ , I , am] are appear at the same time.
4/28 : Today, I am going to _____ , I am so happy.
4/29 : Today, I am going to _____ , I am so sad…
4/30 : Today, I am going to _____ , I am so exciting!
72. 2.1 Word Embedding (Word Representation in ML)
72
But, think again... Is Church similar with Temple?
Is School similar with Home? …Really?
~~
~~
74. 2.2 Embedding from Language Model (ELMo)
74
Language Model : To assign probability to sequence.
Consider previous words and Predict next word
75. 2.2 Embedding from Language Model (ELMo)
75
Consider neighbor word
Word2Vec
Consider previous words
Language Model
Language Model : To assign probability to sequence.
76. 2.2 Embedding from Language Model (ELMo)
76
Embedding from Language Model (ELMo)
ELMo uses 2 Language Model consisted of Multi Layer RNN(LSTM) : BiLM
77. 2.2 Embedding from Language Model (ELMo)
77
Embedding from Language Model (ELMo)
To embed the word ”play”, ELMo uses the result of each layer
inside the dotted rectangle above as the material.
78. 2.2 Embedding from Language Model (ELMo)
78
Embedding from Language Model (ELMo)
1. Concatenate
Forward LM & Backward LM
2. Multiply
Wight at each layer’s outputs
3. Add
Every layer’s outputs
4. Scale
by multiplying constant
79. 2.2 Embedding from Language Model (ELMo)
79
Embedding from Language Model (ELMo)
We can use ELMo representation for downstream task with embedding
And Now, Model knows about word’s contextual meaning.
80. 2.2 Embedding from Language Model (ELMo)
80
Embedding from Language Model (ELMo)
But, there are still a big problem
We still lack the data specific to a particular task
81. 2.2 Embedding from Language Model (ELMo)
81
We still lack the data specific to a particular task
Text Classification
82. 2.2 Embedding from Language Model (ELMo)
82
We still lack the data specific to a particular task
Named Entity Recognition
83. 2.2 Embedding from Language Model (ELMo)
83
We still lack the data specific to a particular task
Machine Translation
84. 2.2 Embedding from Language Model (ELMo)
84
We still lack the data specific to a particular task
Question & Answering
88. 2.3 Unsupervised Pre-train, Supervised Fine Tune (GPT)
88
We want universal NLU Model trained unlabeled data like human
So, GPT tries semi-supervised approach with pre-train & fine tune
to solve that problem (lack of task specific dataset)
89. 2.3 Unsupervised Pre-train, Supervised Fine Tune (GPT)
89
Unsupervised Pre-train
Unsupervised Corpus of token U = u1, u2, … , un
The maximum likelihood of maximizing a standard LM objective is:
K = window size
Theta = NN’s parameter
90. 2.3 Unsupervised Pre-train, Supervised Fine Tune (GPT)
90
Unsupervised Pre-train
Unsupervised Corpus of token U = u1, u2, … , un
We = Embedding , Wp = Positional
P(u) is output of model (next word prediction)
93. 2.3 Unsupervised Pre-train, Supervised Fine Tune (GPT)
93
Supervised Fine Tune
Supervised Corpus of token X = x1, x2, … , xm, label = Y
pass through the linear output layer once more to predict the y value
94. 2.3 Unsupervised Pre-train, Supervised Fine Tune (GPT)
94
Auxiliary Loss
Supervised Corpus of token X = x1, x2, … , xm, label = Y
pass through the linear output layer once more to predict the y value