Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020

Xavier Giro-i-Nieto
Associate Professor
Universitat Politecnica de Catalunya
@DocXavi
xavier.giro@upc.edu
Attention Mechanisms
Lecture 14
[course site]

2
Acknowledgments
Amaia Salvador
Doctor
Universitat Politècnica de Catalunya
Marta R. Costa-jussà
Associate Professor
Universitat Politècnica de Catalunya

Outline
1. Motivation
2. Seq2Seq
3. Key, Query and Value
4. Seq2Seq + Attention
5. Attention mechanisms
6. Study case: Image captioning
3

Outline
1. Motivation
2. Seq2Seq
4

Motivation: Vision
5
Image:
H x W x 3
bird
The whole input volume is used to predict the output...

Motivation: Vision
6
Image:
H x W x 3
bird
...despite the fact that not all pixels are equally important
The whole input volume is used to predict the output...

Motivation: Automatic Speech Recognition (ASR)
7
Figure: distill.pub
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational
speech recognition. ICASSP 2016.
Transcribed letters correspond with just a few phonemes.

Motivation: Neural Machine Translation (NMT)
8
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The translated words are often related to a subset of the words from the source
sentence.
(Edge thicknesses represent the attention weights found by the attention model)

Outline
1. Motivation
2. Seq2Seq
9

Seq2Seq with RNN
10
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014.
The Seq2Seq scheme can be implemented with RNNs:
● trigger the output generation with an input <go> symbol.
● the predicted word at timestep t, becomes the input at t+1.

11
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014.
Seq2Seq with RNN

Seq2Seq for NMT
12
Slide: Abigail See, Matthew Lamm (Stanford CS224N), 2020

13
Seq2Seq for NMT
How does the length of the source sentence aﬀect the size (d) of its encoding ?
Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020

14
How does the length of the input sentence aﬀect the size (d) of its encoded
representation ?
Ray Mooney, ACL 2014 Workshop on Semantic Parsing.
Seq2Seq for NMT
"You can't cram the meaning of
a whole %&!$# sentence into a
single $&!#* vector!"

15
Seq2Seq for NMT: Information bottleneck

Outline
1. Motivation
2. Seq2Seq
16

Query, Keys & Values
Databases store information as pair of keys and values (K,V).
17
Figure: Nikhil Shah, “Attention? An Other Perspective! [Part 1]” (2020)
Example:
<Key> <Value>
myfile.pdf

The (K,Q,V) terminology used to retrieve information from databases is adopted to
formulate attention.
18

Attention is a mechanism to compute a context vector (c) for a query (Q) as a
weighted sum of values (V).
19

20
Key (K) / Values (V) Query (Q)
Usually, both keys and values correspond to the encoder states.

Outline
1. Motivation
2. Seq2Seq
21

22
Chis Olah & Shan Cate, “Attention and Augmented Recurrent Neural Networks” distill.pub 2016.
Seq2Seq + Attention

23
Seq2Seq + Attention: NMT

24

25

26

27
Softmax

29
Context (c)

30

31

32

33Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and
translate." ICLR 2015.

34
Chis Olah & Shan Cate, “Attention and Augmented Recurrent Neural Networks” distill.pub 2016.
Seq2Seq + Attention

Outline
1. Motivation
2. Seq2Seq
35

36
Slide: Marta R. Costa-jussà
Attention Mechanisms
How to compute
these attention
scores ?
● Additive attention
● Multiplicative
● Scaled dot product

Outline
1. Motivation
2. Seq2Seq
○ Additive
37

38
Attention Mechanisms: Additive Attention
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and
The ﬁrst attention mechanism used a one-hidden layer feed-forward network to
calculate the attention alignment:

39
hj
zi
Attention
weight
(aj
)
Key
(k)
Query
(q)
The ﬁrst attention mechanism used a one-hidden layer feed-forward network to
calculate the attention alignment:

40
hj
zi
Attention
weight
(aj
)
The original attention mechanism uses a one-hidden layer feed-forward network
to calculate the attention alignment:
a(q,k) = va
T
tanh(W1
q + W2
k)
a(q,k) = va
T
tanh(Wa
[q;k])
It is often expressed as two separate
linear transformations W1
and W2
for zi
and hi
, respectively:
concatenation
Key
(k)
Query
(q)

Outline
1. Motivation
2. Seq2Seq
○ Multiplicative
41

42
Attention Mechanisms: Multiplicative Attention
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Eﬀective Approaches to Attention-based Neural Machine Translation.
EMNLP 2015
A computational eﬃcient approach is just learning matrix multplication:
a(q,k) = qT
W k
=

43
Attention Mechanisms: Multiplicative Attention
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Eﬀective Approaches to Attention-based Neural Machine Translation.
EMNLP 2015
If the dimension of q & k are equal (eg. in seq2seq, by learning a linear
projection), an even more computational eﬃcient approach is a dot product
between query and key:
a(q,k) = qT
k
=

Outline
1. Motivation
2. Seq2Seq
○ Scaled dot product
44

45
Attention Mechanisms: Scaled dot product
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need.
NeurIPS 2017.
Multiplicative attention performs worse than additive attention for large
dimensionality of the decoder state (dk
) unless a scaling factor is added:
a(q,k) = ( qT
k ) / √ dk
=

46
Attention Mechanisms: Example
Key (K) / Values (V) Queries (Q)
Softmax
dh
Q KT =
=
C
VC At

Attention Mechanisms: Example
Figure: Ilharco, Ilharco, Turc, Dettmers, Ferreira, Lee, “High Performance Natural Language Processing”. EMNLP 2020.

Outline
1. Motivation
2. Seq2Seq
48

Image Captioning without Attention
LSTMLSTM LSTM
CNN LSTM
A bird flying
...
<EOS>
Features:
D
49
...
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. CVPR 2015.
Limitation: All output predictions are based on the ﬁnal and static output
of the encoder

CNN
Image:
H x W x 3
50
Image Captioning with Attention
Slide: Amaia Salvador

Attention for Image Captioning
CNN
Image:
H x W x 3
L x
visual
embeddings
h0
51
c0
The ﬁrst context
vector is the
average of the L
embeddings
a1
L x Attention weights
y1
Predicted word
y0
First word (<start> token)

CNN
Image:
H x W x 3
h0
y1c1
Visual features weighted
with attention give the next
context
h1
a2 y2
52
a1 y1
c0 y0
Predicted word in
previous timestep

CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
53
a1 y1
c0 y0

54Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015

55Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015

56
Side-note: attention can be computed with previous or current hidden state
CNN
Image:
H x W x 3
h1
v y1
h2 h3
v y2
a1
y1
v y0average
c1
a2
y2
c2
a3
y3
c3

57
Exercise
Given a query vector and two keys:
● q = [0.3, 0.2, 0.1]
● k1
= [0.1, 0.3, 0.1]
● k2
= [0.6, 0.4, 0.2]
a) What are the attention weights a1 and a2 computing the dot product ?
b) Considering dh
=3, what are the attention weights 1&2 when computing the
scaled dot product ?
c) Which key vector will receive more attention ?

58
Learn more
Alex Graves, UCL x Deepmind 2020 [slides]

59
Learn more
● Tutorials
○ Sebastian Ruder, Deep Learning for NLP Best Practices # Attention (2017).
○ Chris Olah, Shan Carter, “Attention and Augmented Recurrent Neural Networks”. Distill 2016.
○ Lilian Weng, “Attention? Attention!”. Lil’Log 2017.
● Demos
○ François Fleuret (EPFL)
● Scientiﬁc publications
○ Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are rnns: Fast autoregressive
transformers with linear attention. ICML 2020.
○ Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero,
Yee Whye Teh, Tim Harley, Razvan Pascanu, “Multiplicative Interactions and Where to Find Them”. ICLR 2020.
[tweet]
○ Woo, Sanghyun, Jongchan Park, Joon-Young Lee, and In So Kweon. "Cbam: Convolutional block attention
module." ECCV 2018.
○ Lu, Xiankai, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. "See More, Know More:
Unsupervised Video Object Segmentation with Co-Attention Siamese Networks." CVPR 2019.
○ Wang, Ziqin, Jun Xu, Li Liu, Fan Zhu, and Ling Shao. "Ranet: Ranking attention network for fast video object
segmentation." ICCV 2019.

Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020

Similar to Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (16)

Recently uploaded

Recently uploaded (20)

Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020