Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
6. Motivation: Vision
6
Image:
H x W x 3
bird
...despite the fact that not all pixels are equally important
The whole input volume is used to predict the output...
7. Motivation: Automatic Speech Recognition (ASR)
7
Figure: distill.pub
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational
speech recognition. ICASSP 2016.
Transcribed letters correspond with just a few phonemes.
8. Motivation: Neural Machine Translation (NMT)
8
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The translated words are often related to a subset of the words from the source
sentence.
(Edge thicknesses represent the attention weights found by the attention model)
10. Seq2Seq with RNN
10
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014.
The Seq2Seq scheme can be implemented with RNNs:
● trigger the output generation with an input <go> symbol.
● the predicted word at timestep t, becomes the input at t+1.
11. 11
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014.
Seq2Seq with RNN
13. 13
Seq2Seq for NMT
How does the length of the source sentence affect the size (d) of its encoding ?
Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020
14. 14
How does the length of the input sentence affect the size (d) of its encoded
representation ?
Ray Mooney, ACL 2014 Workshop on Semantic Parsing.
Seq2Seq for NMT
"You can't cram the meaning of
a whole %&!$# sentence into a
single $&!#* vector!"
15. 15
Seq2Seq for NMT: Information bottleneck
Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020
17. Query, Keys & Values
Databases store information as pair of keys and values (K,V).
17
Figure: Nikhil Shah, “Attention? An Other Perspective! [Part 1]” (2020)
Example:
<Key> <Value>
myfile.pdf
18. The (K,Q,V) terminology used to retrieve information from databases is adopted to
formulate attention.
18
Figure: Nikhil Shah, “Attention? An Other Perspective! [Part 1]” (2020)
Query, Keys & Values
19. Attention is a mechanism to compute a context vector (c) for a query (Q) as a
weighted sum of values (V).
19
Figure: Nikhil Shah, “Attention? An Other Perspective! [Part 1]” (2020)
Query, Keys & Values
20. 20
Key (K) / Values (V) Query (Q)
Usually, both keys and values correspond to the encoder states.
Query, Keys & Values
Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020
36. 36
Slide: Marta R. Costa-jussà
Attention Mechanisms
How to compute
these attention
scores ?
● Additive attention
● Multiplicative
● Scaled dot product
38. 38
Attention Mechanisms: Additive Attention
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and
translate." ICLR 2015.
The first attention mechanism used a one-hidden layer feed-forward network to
calculate the attention alignment:
39. 39
Attention Mechanisms: Additive Attention
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and
translate." ICLR 2015.
hj
zi
Attention
weight
(aj
)
Key
(k)
Query
(q)
The first attention mechanism used a one-hidden layer feed-forward network to
calculate the attention alignment:
40. 40
Attention Mechanisms: Additive Attention
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and
translate." ICLR 2015.
hj
zi
Attention
weight
(aj
)
The original attention mechanism uses a one-hidden layer feed-forward network
to calculate the attention alignment:
a(q,k) = va
T
tanh(W1
q + W2
k)
a(q,k) = va
T
tanh(Wa
[q;k])
It is often expressed as two separate
linear transformations W1
and W2
for zi
and hi
, respectively:
concatenation
Key
(k)
Query
(q)
42. 42
Attention Mechanisms: Multiplicative Attention
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation.
EMNLP 2015
A computational efficient approach is just learning matrix multplication:
a(q,k) = qT
W k
=
43. 43
Attention Mechanisms: Multiplicative Attention
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation.
EMNLP 2015
If the dimension of q & k are equal (eg. in seq2seq, by learning a linear
projection), an even more computational efficient approach is a dot product
between query and key:
a(q,k) = qT
k
=
45. 45
Attention Mechanisms: Scaled dot product
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need.
NeurIPS 2017.
Multiplicative attention performs worse than additive attention for large
dimensionality of the decoder state (dk
) unless a scaling factor is added:
a(q,k) = ( qT
k ) / √ dk
=
46. 46
Attention Mechanisms: Example
Key (K) / Values (V) Queries (Q)
Softmax
dh
Q KT =
=
C
VC At
Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020
49. Image Captioning without Attention
LSTMLSTM LSTM
CNN LSTM
A bird flying
...
<EOS>
Features:
D
49
...
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. CVPR 2015.
Limitation: All output predictions are based on the final and static output
of the encoder
50. CNN
Image:
H x W x 3
50
Image Captioning with Attention
Slide: Amaia Salvador
51. Attention for Image Captioning
CNN
Image:
H x W x 3
L x
visual
embeddings
h0
51
c0
The first context
vector is the
average of the L
embeddings
a1
L x Attention weights
y1
Predicted word
y0
First word (<start> token)
Slide: Amaia Salvador
52. Attention for Image Captioning
CNN
Image:
H x W x 3
h0
y1c1
Visual features weighted
with attention give the next
context
h1
a2 y2
52
a1 y1
c0 y0
Predicted word in
previous timestep
Slide: Amaia Salvador
53. Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
53
a1 y1
c0 y0
Slide: Amaia Salvador
54. Attention for Image Captioning
54Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
55. Attention for Image Captioning
55Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
56. 56
Side-note: attention can be computed with previous or current hidden state
CNN
Image:
H x W x 3
h1
v y1
h2 h3
v y2
a1
y1
v y0average
c1
a2
y2
c2
a3
y3
c3
Attention for Image Captioning
57. 57
Exercise
Given a query vector and two keys:
● q = [0.3, 0.2, 0.1]
● k1
= [0.1, 0.3, 0.1]
● k2
= [0.6, 0.4, 0.2]
a) What are the attention weights a1 and a2 computing the dot product ?
b) Considering dh
=3, what are the attention weights 1&2 when computing the
scaled dot product ?
c) Which key vector will receive more attention ?
59. 59
Learn more
● Tutorials
○ Sebastian Ruder, Deep Learning for NLP Best Practices # Attention (2017).
○ Chris Olah, Shan Carter, “Attention and Augmented Recurrent Neural Networks”. Distill 2016.
○ Lilian Weng, “Attention? Attention!”. Lil’Log 2017.
● Demos
○ François Fleuret (EPFL)
● Scientific publications
○ Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are rnns: Fast autoregressive
transformers with linear attention. ICML 2020.
○ Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero,
Yee Whye Teh, Tim Harley, Razvan Pascanu, “Multiplicative Interactions and Where to Find Them”. ICLR 2020.
[tweet]
○ Woo, Sanghyun, Jongchan Park, Joon-Young Lee, and In So Kweon. "Cbam: Convolutional block attention
module." ECCV 2018.
○ Lu, Xiankai, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. "See More, Know More:
Unsupervised Video Object Segmentation with Co-Attention Siamese Networks." CVPR 2019.
○ Wang, Ziqin, Jun Xu, Li Liu, Fan Zhu, and Ling Shao. "Ranet: Ranking attention network for fast video object
segmentation." ICCV 2019.