SlideShare a Scribd company logo
1 of 44
Download to read offline
Paper Review
Attention Is All You Need
(Ashish et al. 2017) [Arxiv pre-print link]
Strong reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Santiago Pascual de la Puente
June 07, 2018
TALP UPC, Barcelona
Table of contents
1. Introduction
2. The Transformer
A Myriad of Attentions
Point-Wise Feed Forward Networks
The Transformer Block
3. Interfacing Token Sequences
Embeddings
Positional Encoding
4. Results
5. Conclusions
1/37
Introduction
Introduction
Recurrent neural networks (RNNs) and their cell variants are firmly
established as state of the art in sequence modeling and transduction
(e.g. machine translation).
In transduction we map a sequence X = {x1, · · · , xT } to another one
Y = {y1, · · · , yM } where T and M can be different, xt ∈ Rde
and
ym ∈ Rdd
.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
2/37
Introduction
1. The encoder RNN will encode source symbols X = {x1, · · · , xT }
into useful abstractions to mix up contextual contents →
H = {h1, · · · , hT }, where ht = tanh(Wxt + Uht−1 + b).
2. Last encoder state hT is typically taken as the summary of the
input, and it is injected into the decoder initial state hd
0 = hT .
3. The decoder RNN will generate one-by-one the target sequence
(autoregressive) by feeding back its previous prediction ym−1 as
input, also conditioned in h0 encoder summarization.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
3/37
Introduction
Encoding a sentence into one vector would be super amazing, but it is
unfeasible. In the real world we need a mechanism that gives the decoder
hints on where to look from encoder to weight the source vectors, not
just get the last → ATTENTION MECHANISM.
• cm =
T−1
t=0 αm
t · ht
• Each cm is a row (and
additional input to dec), and
each αm
t is an orange square.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-
translation/seq2seq-translation.ipynb 4/37
Introduction
• RNNs factor computation along symbol time positions, generating
ht out of ht−1 → cannot parallelize in training:
ht = tanh(Wxt + Uht−1 + b)
• Attention is used with SOTA transduction RNNs → model
dependencies without regard to their distance in the input or output
sequences.
5/37
Introduction
• Let’s get rid of recurrence and rely entirely on attentions to draw
global dependencies b/w input and output.
• The Transformer is born, significantly boosting parallelization and
reaching new SOTA in translation.
6/37
The Transformer
The Transformer
We will have a new encoder-decoder structure, without any recurrence:
only fully connected layers (independent at every time-step) and
self-attention to merge global info in the sequences.
• Encoder will map X = {x1, · · · , xT } to a sequence of continuous
representations Z = {z1, · · · , zT }.
• Given Z the decoder will generate Y = {y1, · · · , yN }
• Still auto-regressive! But no recurrent connections at all.
7/37
The Transformer
8/37
Attention Generic Formulation
• Attention function maps a query and a set of key-value pairs to an
output: query, keys, values, and output are all vectors:
o = f (q, k, v)
• Output is computed as a weighted sum of the values.
• Weight assigned to each value is computed by a compatibility
function of the query with the corresponding key.
o =
T−1
t=0
g(qi , ki
t ) · vt
9/37
Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
10/37
Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
• FAST TRICK: compute the att on a set of queries simultaneously,
packing matrices Q, K, V .
Attention(Q, K, V ) = Softmax(
QKT
2
√
dk
)V
11/37
The Fault In Our Scale
Wait... why do we scale the output from the matching function between
query and key by 2
√
dk ?
12/37
The Fault In Our Scale
Two most commonly used attention methods (to merge k and q):
• Additive: MLP with one hidden layer where vectors are
concatenated at input of MLP.
• Multiplicative: dot-product seen here → MUCH faster and more
space-efficient.
For small values of dk both behave similarly, but additive outperforms
dot-product for larger dk .
Suspicion: for large values dk , dot-products grow large in magnitude,
pushing Softmax into regions with extremely small gradients.
Assume components of q and k are independent random variables with
µ = 0 and σ = 1 ⇒ is q · k =
dk
i=1 qi · ki with µ = 0 and σ = 2
√
dk .
We counteract this effect by scaling 1
2
√
dk
.
13/37
Multi-Head Attention
Multi-head attention allows the model to jointly attend to information
from different representation subspaces at different positions. With a
single attention head, averaging inhibits this.
14/37
Multi-Head Attention
MultiHead(Q, K, V ) = Concat(head1, · · · .headh)W0
headi = Attention(QW Q
i , KW K
i , VW V
i )
W Q
i ∈ Rdmodel ×dk
, W K
i ∈ Rdmodel ×dk
, W V
i ∈ Rdmodel ×dv
, W 0
∈ Rhdv ×dmodel
In this work h = 8 and dk = dv = dmodel
h = 64.
15/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
16/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
17/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
3. The decoder has the same self-attention mechanism. BUT!...
prevent leftward information flow (it must be autoregressive).
18/37
Decoder Attention Mask
Prevent leftward information flow inside of scaled dot-product attention,
by masking out (setting to − inf) all values in the input of the Softmax
which correspond to ”illegal” connections.
19/37
Point-Wise Feed Forward Networks
Simply an MLP to each time position with the same parameters:
FFN(x) = max(0, xW1 + b1)W2 + b2
These can be seen as two Convolutions1D with kwidth = 1. The
dimensionality of input and output is dmodel = 512 and inner layer has
dimensionality dff = 2048.
20/37
Point-Wise Feed Forward Networks
21/37
The Transformer Block
If we mix a spoon of Multi-Head Attention, another of Point-Wise FFN,
a pinch of res-connections and a spoon of Add&LayerNorm ops we obtain
the Transformer block:
22/37
The Transformer Block
We can see how N stacks of these blocks form the whole Transformer
END-TO-END network. Note the extra enc-dec-attention in the
decoder blocks.
23/37
Interfacing Token Sequences
Embeddings
As in seq2seq models, we use learned embeddings to convert input tokens
and output tokens to dense vectors of dimension dmodel . There is also (of
course) an output linear transformation to go from dmodel to number of
classes and Softmax.
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
24/37
Embeddings
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
25/37
Embeddings
26/37
Positional Encoding
• Are we processing sequences? YES.
• Are we taking care of this fact?
27/37
Positional Encoding
• Are we processing sequences? YES.
• Are we taking care of this fact? NO.
So let’s work it out.
28/37
Positional Encoding
• In order for the model to make use of the order of the sequence, we
must inject some information about the relative or absolute position
of the tokens in the sequence.
• Add positional encodings joint with the embeddings, summing them
up such that the positional info is merged in the input.
PE(pos, 2i ) = sin(
pos
10000
2i
dmodel
)
PE(pos, 2i+1) = cos(
pos
10000
2i
dmodel
)
Where i is the dimension and pos the position (time-step). Each
dimension corresponds to a sinusoid, with wavelengths forming a
geometric progression. The frequency and offset of the wave is different
for each dimension.
29/37
Positional Encoding
At every time-step we will have a combination of sinusoids telling us
where are relative to the beginning (with combination of phases).
Advantage of these codes: generalization to any length in test (cyclic
nature of sinusoids rather than growing indefinitely).
30/37
Results
Results
31/37
Results
• On the WMT 2014 English-to-German translation task, the big
transformer model (Transformer (big)) outperforms the best
previously reported models (including ensembles) by more than 2.0
BLEU! (new SOTA of 28.4).
• Training took 3.5 days on 8 P100 GPUs. Even their base model
surpasses all previously published models and ensembles, at a
fraction of the training cost of any of the competitive models.
• On the WMT 2014 English-to-French translation task, the big
model achieves a BLEU score of 41.0, outperforming all of the
previously published single models, at less than 1
4 the training cost.
32/37
Results
Enc Layer2
33/37
Results
Enc Layer6
34/37
Results
Dec Layer2
35/37
Results
Dec-SRC Layer2
36/37
Conclusions
Conclusions
• The Transformer is the first sequence transduction model based
entirely on attention ( replacing the recurrent layers most commonly
used in encoder-decoder architectures with multi-headed
self-attention).
• For translation tasks, the Transformer can be trained significantly
faster than architectures based on recurrent or convolutional layer.
• New SOTA on WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks.
• Code used to train and evaluate original models is available at
https://github.com/tensorflow/tensor2tensor. .
37/37
Thanks!
@santty128
37/37

More Related Content

What's hot

Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning健程 杨
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear ComplexitySangwoo Mo
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representationhyunyoung Lee
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanismSwatiNarkhede1
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUSri Geetha
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion ModelsSangwoo Mo
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural NetworksSangwoo Mo
 

What's hot (20)

Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representation
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Bert
BertBert
Bert
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
 

Similar to Attention is all you need (UPC Reading Group 2018, by Santi Pascual)

240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...Jisang Yoon
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptxKokilaK25
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfRajJain516913
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptxhtn540
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphstuxette
 

Similar to Attention is all you need (UPC Reading Group 2018, by Santi Pascual) (20)

240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
SPAA11
SPAA11SPAA11
SPAA11
 
Dsp manual print
Dsp manual printDsp manual print
Dsp manual print
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptx
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
Fx3111501156
Fx3111501156Fx3111501156
Fx3111501156
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 

Attention is all you need (UPC Reading Group 2018, by Santi Pascual)

  • 1. Paper Review Attention Is All You Need (Ashish et al. 2017) [Arxiv pre-print link] Strong reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html Santiago Pascual de la Puente June 07, 2018 TALP UPC, Barcelona
  • 2. Table of contents 1. Introduction 2. The Transformer A Myriad of Attentions Point-Wise Feed Forward Networks The Transformer Block 3. Interfacing Token Sequences Embeddings Positional Encoding 4. Results 5. Conclusions 1/37
  • 4. Introduction Recurrent neural networks (RNNs) and their cell variants are firmly established as state of the art in sequence modeling and transduction (e.g. machine translation). In transduction we map a sequence X = {x1, · · · , xT } to another one Y = {y1, · · · , yM } where T and M can be different, xt ∈ Rde and ym ∈ Rdd . https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb 2/37
  • 5. Introduction 1. The encoder RNN will encode source symbols X = {x1, · · · , xT } into useful abstractions to mix up contextual contents → H = {h1, · · · , hT }, where ht = tanh(Wxt + Uht−1 + b). 2. Last encoder state hT is typically taken as the summary of the input, and it is injected into the decoder initial state hd 0 = hT . 3. The decoder RNN will generate one-by-one the target sequence (autoregressive) by feeding back its previous prediction ym−1 as input, also conditioned in h0 encoder summarization. https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb 3/37
  • 6. Introduction Encoding a sentence into one vector would be super amazing, but it is unfeasible. In the real world we need a mechanism that gives the decoder hints on where to look from encoder to weight the source vectors, not just get the last → ATTENTION MECHANISM. • cm = T−1 t=0 αm t · ht • Each cm is a row (and additional input to dec), and each αm t is an orange square. https://github.com/spro/practical-pytorch/blob/master/seq2seq- translation/seq2seq-translation.ipynb 4/37
  • 7. Introduction • RNNs factor computation along symbol time positions, generating ht out of ht−1 → cannot parallelize in training: ht = tanh(Wxt + Uht−1 + b) • Attention is used with SOTA transduction RNNs → model dependencies without regard to their distance in the input or output sequences. 5/37
  • 8. Introduction • Let’s get rid of recurrence and rely entirely on attentions to draw global dependencies b/w input and output. • The Transformer is born, significantly boosting parallelization and reaching new SOTA in translation. 6/37
  • 10. The Transformer We will have a new encoder-decoder structure, without any recurrence: only fully connected layers (independent at every time-step) and self-attention to merge global info in the sequences. • Encoder will map X = {x1, · · · , xT } to a sequence of continuous representations Z = {z1, · · · , zT }. • Given Z the decoder will generate Y = {y1, · · · , yN } • Still auto-regressive! But no recurrent connections at all. 7/37
  • 12. Attention Generic Formulation • Attention function maps a query and a set of key-value pairs to an output: query, keys, values, and output are all vectors: o = f (q, k, v) • Output is computed as a weighted sum of the values. • Weight assigned to each value is computed by a compatibility function of the query with the corresponding key. o = T−1 t=0 g(qi , ki t ) · vt 9/37
  • 13. Scaled Dot-Product Attention • Input: queries and keys of dimension dk and values of dimension dv . • Compute the dot products of the query with all keys, divide each by 2 √ dk and apply Softmax → obtain weights on the values. 10/37
  • 14. Scaled Dot-Product Attention • Input: queries and keys of dimension dk and values of dimension dv . • Compute the dot products of the query with all keys, divide each by 2 √ dk and apply Softmax → obtain weights on the values. • FAST TRICK: compute the att on a set of queries simultaneously, packing matrices Q, K, V . Attention(Q, K, V ) = Softmax( QKT 2 √ dk )V 11/37
  • 15. The Fault In Our Scale Wait... why do we scale the output from the matching function between query and key by 2 √ dk ? 12/37
  • 16. The Fault In Our Scale Two most commonly used attention methods (to merge k and q): • Additive: MLP with one hidden layer where vectors are concatenated at input of MLP. • Multiplicative: dot-product seen here → MUCH faster and more space-efficient. For small values of dk both behave similarly, but additive outperforms dot-product for larger dk . Suspicion: for large values dk , dot-products grow large in magnitude, pushing Softmax into regions with extremely small gradients. Assume components of q and k are independent random variables with µ = 0 and σ = 1 ⇒ is q · k = dk i=1 qi · ki with µ = 0 and σ = 2 √ dk . We counteract this effect by scaling 1 2 √ dk . 13/37
  • 17. Multi-Head Attention Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. 14/37
  • 18. Multi-Head Attention MultiHead(Q, K, V ) = Concat(head1, · · · .headh)W0 headi = Attention(QW Q i , KW K i , VW V i ) W Q i ∈ Rdmodel ×dk , W K i ∈ Rdmodel ×dk , W V i ∈ Rdmodel ×dv , W 0 ∈ Rhdv ×dmodel In this work h = 8 and dk = dv = dmodel h = 64. 15/37
  • 19. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 16/37
  • 20. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 2. Encoder contains self-attention layers: all keys, values and queries come from same place, the previous encoder layer output. Thus each position in the encoder can attend to all positions in the encoder’s previous layer. 17/37
  • 21. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 2. Encoder contains self-attention layers: all keys, values and queries come from same place, the previous encoder layer output. Thus each position in the encoder can attend to all positions in the encoder’s previous layer. 3. The decoder has the same self-attention mechanism. BUT!... prevent leftward information flow (it must be autoregressive). 18/37
  • 22. Decoder Attention Mask Prevent leftward information flow inside of scaled dot-product attention, by masking out (setting to − inf) all values in the input of the Softmax which correspond to ”illegal” connections. 19/37
  • 23. Point-Wise Feed Forward Networks Simply an MLP to each time position with the same parameters: FFN(x) = max(0, xW1 + b1)W2 + b2 These can be seen as two Convolutions1D with kwidth = 1. The dimensionality of input and output is dmodel = 512 and inner layer has dimensionality dff = 2048. 20/37
  • 24. Point-Wise Feed Forward Networks 21/37
  • 25. The Transformer Block If we mix a spoon of Multi-Head Attention, another of Point-Wise FFN, a pinch of res-connections and a spoon of Add&LayerNorm ops we obtain the Transformer block: 22/37
  • 26. The Transformer Block We can see how N stacks of these blocks form the whole Transformer END-TO-END network. Note the extra enc-dec-attention in the decoder blocks. 23/37
  • 28. Embeddings As in seq2seq models, we use learned embeddings to convert input tokens and output tokens to dense vectors of dimension dmodel . There is also (of course) an output linear transformation to go from dmodel to number of classes and Softmax. In the Transformer, all these 3 matrices are tied (same parameters apply), and in the embeddings layers weights are multiplied by 2 √ dmodel . 24/37
  • 29. Embeddings In the Transformer, all these 3 matrices are tied (same parameters apply), and in the embeddings layers weights are multiplied by 2 √ dmodel . 25/37
  • 31. Positional Encoding • Are we processing sequences? YES. • Are we taking care of this fact? 27/37
  • 32. Positional Encoding • Are we processing sequences? YES. • Are we taking care of this fact? NO. So let’s work it out. 28/37
  • 33. Positional Encoding • In order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. • Add positional encodings joint with the embeddings, summing them up such that the positional info is merged in the input. PE(pos, 2i ) = sin( pos 10000 2i dmodel ) PE(pos, 2i+1) = cos( pos 10000 2i dmodel ) Where i is the dimension and pos the position (time-step). Each dimension corresponds to a sinusoid, with wavelengths forming a geometric progression. The frequency and offset of the wave is different for each dimension. 29/37
  • 34. Positional Encoding At every time-step we will have a combination of sinusoids telling us where are relative to the beginning (with combination of phases). Advantage of these codes: generalization to any length in test (cyclic nature of sinusoids rather than growing indefinitely). 30/37
  • 37. Results • On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU! (new SOTA of 28.4). • Training took 3.5 days on 8 P100 GPUs. Even their base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models. • On the WMT 2014 English-to-French translation task, the big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1 4 the training cost. 32/37
  • 43. Conclusions • The Transformer is the first sequence transduction model based entirely on attention ( replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention). • For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layer. • New SOTA on WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks. • Code used to train and evaluate original models is available at https://github.com/tensorflow/tensor2tensor. . 37/37