Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
1. Paper Review
Attention Is All You Need
(Ashish et al. 2017) [Arxiv pre-print link]
Strong reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Santiago Pascual de la Puente
June 07, 2018
TALP UPC, Barcelona
2. Table of contents
1. Introduction
2. The Transformer
A Myriad of Attentions
Point-Wise Feed Forward Networks
The Transformer Block
3. Interfacing Token Sequences
Embeddings
Positional Encoding
4. Results
5. Conclusions
1/37
4. Introduction
Recurrent neural networks (RNNs) and their cell variants are firmly
established as state of the art in sequence modeling and transduction
(e.g. machine translation).
In transduction we map a sequence X = {x1, · · · , xT } to another one
Y = {y1, · · · , yM } where T and M can be different, xt ∈ Rde
and
ym ∈ Rdd
.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
2/37
5. Introduction
1. The encoder RNN will encode source symbols X = {x1, · · · , xT }
into useful abstractions to mix up contextual contents →
H = {h1, · · · , hT }, where ht = tanh(Wxt + Uht−1 + b).
2. Last encoder state hT is typically taken as the summary of the
input, and it is injected into the decoder initial state hd
0 = hT .
3. The decoder RNN will generate one-by-one the target sequence
(autoregressive) by feeding back its previous prediction ym−1 as
input, also conditioned in h0 encoder summarization.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
3/37
6. Introduction
Encoding a sentence into one vector would be super amazing, but it is
unfeasible. In the real world we need a mechanism that gives the decoder
hints on where to look from encoder to weight the source vectors, not
just get the last → ATTENTION MECHANISM.
• cm =
T−1
t=0 αm
t · ht
• Each cm is a row (and
additional input to dec), and
each αm
t is an orange square.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-
translation/seq2seq-translation.ipynb 4/37
7. Introduction
• RNNs factor computation along symbol time positions, generating
ht out of ht−1 → cannot parallelize in training:
ht = tanh(Wxt + Uht−1 + b)
• Attention is used with SOTA transduction RNNs → model
dependencies without regard to their distance in the input or output
sequences.
5/37
8. Introduction
• Let’s get rid of recurrence and rely entirely on attentions to draw
global dependencies b/w input and output.
• The Transformer is born, significantly boosting parallelization and
reaching new SOTA in translation.
6/37
10. The Transformer
We will have a new encoder-decoder structure, without any recurrence:
only fully connected layers (independent at every time-step) and
self-attention to merge global info in the sequences.
• Encoder will map X = {x1, · · · , xT } to a sequence of continuous
representations Z = {z1, · · · , zT }.
• Given Z the decoder will generate Y = {y1, · · · , yN }
• Still auto-regressive! But no recurrent connections at all.
7/37
12. Attention Generic Formulation
• Attention function maps a query and a set of key-value pairs to an
output: query, keys, values, and output are all vectors:
o = f (q, k, v)
• Output is computed as a weighted sum of the values.
• Weight assigned to each value is computed by a compatibility
function of the query with the corresponding key.
o =
T−1
t=0
g(qi , ki
t ) · vt
9/37
13. Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
10/37
14. Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
• FAST TRICK: compute the att on a set of queries simultaneously,
packing matrices Q, K, V .
Attention(Q, K, V ) = Softmax(
QKT
2
√
dk
)V
11/37
15. The Fault In Our Scale
Wait... why do we scale the output from the matching function between
query and key by 2
√
dk ?
12/37
16. The Fault In Our Scale
Two most commonly used attention methods (to merge k and q):
• Additive: MLP with one hidden layer where vectors are
concatenated at input of MLP.
• Multiplicative: dot-product seen here → MUCH faster and more
space-efficient.
For small values of dk both behave similarly, but additive outperforms
dot-product for larger dk .
Suspicion: for large values dk , dot-products grow large in magnitude,
pushing Softmax into regions with extremely small gradients.
Assume components of q and k are independent random variables with
µ = 0 and σ = 1 ⇒ is q · k =
dk
i=1 qi · ki with µ = 0 and σ = 2
√
dk .
We counteract this effect by scaling 1
2
√
dk
.
13/37
17. Multi-Head Attention
Multi-head attention allows the model to jointly attend to information
from different representation subspaces at different positions. With a
single attention head, averaging inhibits this.
14/37
18. Multi-Head Attention
MultiHead(Q, K, V ) = Concat(head1, · · · .headh)W0
headi = Attention(QW Q
i , KW K
i , VW V
i )
W Q
i ∈ Rdmodel ×dk
, W K
i ∈ Rdmodel ×dk
, W V
i ∈ Rdmodel ×dv
, W 0
∈ Rhdv ×dmodel
In this work h = 8 and dk = dv = dmodel
h = 64.
15/37
19. Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
16/37
20. Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
17/37
21. Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
3. The decoder has the same self-attention mechanism. BUT!...
prevent leftward information flow (it must be autoregressive).
18/37
22. Decoder Attention Mask
Prevent leftward information flow inside of scaled dot-product attention,
by masking out (setting to − inf) all values in the input of the Softmax
which correspond to ”illegal” connections.
19/37
23. Point-Wise Feed Forward Networks
Simply an MLP to each time position with the same parameters:
FFN(x) = max(0, xW1 + b1)W2 + b2
These can be seen as two Convolutions1D with kwidth = 1. The
dimensionality of input and output is dmodel = 512 and inner layer has
dimensionality dff = 2048.
20/37
25. The Transformer Block
If we mix a spoon of Multi-Head Attention, another of Point-Wise FFN,
a pinch of res-connections and a spoon of Add&LayerNorm ops we obtain
the Transformer block:
22/37
26. The Transformer Block
We can see how N stacks of these blocks form the whole Transformer
END-TO-END network. Note the extra enc-dec-attention in the
decoder blocks.
23/37
28. Embeddings
As in seq2seq models, we use learned embeddings to convert input tokens
and output tokens to dense vectors of dimension dmodel . There is also (of
course) an output linear transformation to go from dmodel to number of
classes and Softmax.
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
24/37
29. Embeddings
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
25/37
32. Positional Encoding
• Are we processing sequences? YES.
• Are we taking care of this fact? NO.
So let’s work it out.
28/37
33. Positional Encoding
• In order for the model to make use of the order of the sequence, we
must inject some information about the relative or absolute position
of the tokens in the sequence.
• Add positional encodings joint with the embeddings, summing them
up such that the positional info is merged in the input.
PE(pos, 2i ) = sin(
pos
10000
2i
dmodel
)
PE(pos, 2i+1) = cos(
pos
10000
2i
dmodel
)
Where i is the dimension and pos the position (time-step). Each
dimension corresponds to a sinusoid, with wavelengths forming a
geometric progression. The frequency and offset of the wave is different
for each dimension.
29/37
34. Positional Encoding
At every time-step we will have a combination of sinusoids telling us
where are relative to the beginning (with combination of phases).
Advantage of these codes: generalization to any length in test (cyclic
nature of sinusoids rather than growing indefinitely).
30/37
37. Results
• On the WMT 2014 English-to-German translation task, the big
transformer model (Transformer (big)) outperforms the best
previously reported models (including ensembles) by more than 2.0
BLEU! (new SOTA of 28.4).
• Training took 3.5 days on 8 P100 GPUs. Even their base model
surpasses all previously published models and ensembles, at a
fraction of the training cost of any of the competitive models.
• On the WMT 2014 English-to-French translation task, the big
model achieves a BLEU score of 41.0, outperforming all of the
previously published single models, at less than 1
4 the training cost.
32/37
43. Conclusions
• The Transformer is the first sequence transduction model based
entirely on attention ( replacing the recurrent layers most commonly
used in encoder-decoder architectures with multi-headed
self-attention).
• For translation tasks, the Transformer can be trained significantly
faster than architectures based on recurrent or convolutional layer.
• New SOTA on WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks.
• Code used to train and evaluate original models is available at
https://github.com/tensorflow/tensor2tensor. .
37/37