Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020

Slides @DocXavi
[https://deep-self-supervised-learning.carrd.co/]
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center

Slides @DocXavi
2
Acknowledgements
Víctor
Campos
Junting
Pan
Xunyu
Lin
Sebastian
Palacio
Carlos
Arenas
Oscar
Mañas

Slides @DocXavi
4
Source: OpenSelfSup by MMLAB (CUHK & Nanyang)

Slides @DocXavi
5
Outline
1. Representation Learning
2. Unsupervised Learning
3. Self-supervised Learning
4. Predictive methods
5. Contrastive methods

Slides @DocXavi
6

Slides @DocXavi
Representation + Learning pipeline
Raw data (ex: images)
Feature
Extraction
Classifier Decisor y = ‘CAT’
X1: weight
X2: age
Probabilities:
CAT: 0.7
DOG: 0.3
7
Slide concept: Santiago Pascual (UPC 2019). Garﬁeld is a character created by Jim Davis. Garﬁeld and Odie are characters created by Jim Davis.

Slides @DocXavi
Feature
Extraction
X1: weight
X2: age
Probabilities:
CAT: 0.7
DOG: 0.3
Neural
Network
Shall we
extract
features now?
8
Representation + Learning pipeline

Slides @DocXavi
Probabilities:
CAT: 0.7
DOG: 0.3
Neural
Network
We CAN inject the raw
data, and features will
be learned in the NN !!
End to End concept
9
End-to-end Representation Learning

Slides @DocXavi
10Krizhevsky, Alex, Ilya Sutskever, and Geoﬀrey E. Hinton. "Imagenet classiﬁcation with deep convolutional neural
networks." NIPS 2012
649,63 citations (June 2020)

Slides @DocXavi
11
● 1,000 object classes
(categories).
● Labeled Images:
○ 1.2 M train
○ 100k test.
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet large
scale visual recognition challenge." International Journal of Computer Vision 115, no. 3 (2015): 211-252. [web]

Slides @DocXavi
12
Outline

Slides @DocXavi
13

Slides @DocXavi
Types of machine learning
Yann Lecun’s Black Forest cake
14

Slides @DocXavi
15
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning

Slides @DocXavi
Supervised learning
16
Slide credit: Kevin McGuinness
y^

Slides @DocXavi
17
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning

Slides @DocXavi
Unsupervised learning
18

Slides @DocXavi
Unsupervised learning
19
● It is the nature of how intelligent beings percept
the world.
● It can save us tons of eﬀorts to build an AI agent
compared to a totally supervised fashion.
● Vast amounts of unlabelled data.

Slides @DocXavi
20
Outline

Slides @DocXavi
21

Slides @DocXavi
22
Self-supervised learning
(...)
“Doing this properly and reliably is the greatest
challenge in ML and AI of the next few years in my
opinion.”

Slides @DocXavi
24
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● A pretext (or surrogate) task must be invented by withholding a part of the
unlabeled data and training the NN to predict it.
Unlabeled data
(X)

Slides @DocXavi
25
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● By deﬁning a proxy loss, the NN learns representations, which should be
valuable for the actual downstream task.
^
y
loss
Representations learned without labels

Slides @DocXavi
26

Slides @DocXavi
27
Transfer Learning
Target data
E.g. PASCAL
Target model
Target labels
Small
amount of
data/labels
Instead of training a deep network from scratch for your task...

Slides @DocXavi
28
Transfer Learning
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data/labels
Small
amount of
data/labels
Instead of training a deep network from scratch for your task:
● Take a network trained on a diﬀerent source domain and/or task
● Adapt it for your target domain and/or task

Slides @DocXavi
29
Transfer Learning
labels
Train deep net on “nearby” task using standard
backprop
● Supervised learning (e.g. ImageNet).
● Self-supervised learning (e.g. Language
Model)
Cut oﬀ top layer(s) of network and replace with
supervised objective for target domain
Fine-tune network using backprop with labels
for target domain until validation loss starts to
increase
conv2
conv3
fc1
conv1
surrogate loss
surrogate data
fc2 + softmax
real labelsreal data
real loss
my_fc2 + softmax

Slides @DocXavi
30
Self-supervised + Transfer Learning
#Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin):

Slides @DocXavi
31
Transfer Learning: Video-lectures
Kevin McGuinness (UPC DLCV 2016) Ramon Morros (UPC DLAI 2018)

Slides @DocXavi
32
Predictive vs Contrastive Methods
Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”

Slides @DocXavi
33

Slides @DocXavi
34
Outline
3. Predictive Methods
○ Autoencoder

Slides @DocXavi
35
Autoencoder (AE)
Fig: “Deep Learning Tutorial” Stanford
Autoencoders:
● Predict at the output the
same input data.
● Do not need labels.

Slides @DocXavi
36
Autoencoder (AE)
Slide: Kevin McGuinness (DLCV UPC 2017)
Encoder
W1
Decoder
W2
hdata reconstruction
Loss
(reconstruction error)
Latent variables
(representation/features)
1. Initialize a NN by solving an autoencoding
problem.

Slides @DocXavi
37
Autoencoder (AE)
Slide: Kevin McGuinness (DLCV UPC 2017)
Latent variables
Encoder
W1
hdata Classifier
WC
prediction
y Loss
(cross entropy)
1. Initialize a NN solving an autoencoding
problem.
2. Train for ﬁnal task with “few” labels.

Slides @DocXavi
38
Autoencoder (AE)
What other application does the AE target address ?

Slides @DocXavi
39
Autoencoder (AE)
Dimensionality reduction:
Use the hidden layer as a
feature extractor of any
desired size.

Slides @DocXavi
40
Denoising Autoencoder (DAE)
Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. "Extracting and composing robust features
with denoising autoencoders." ICML 2008.
1. Corrupt the input signal, but predict the clean version at its output.
Encoder
W1
Decoder
W2
hdata reconstruction
Loss
(reconstruction error)
corrupted
data

Slides @DocXavi
41
Latent variables
Encoder
W1
hdata Classifier
WC
prediction
y Loss
(cross entropy)
1. Corrupt the input signal, but predict the clean version at its output.

Slides @DocXavi
42
Source: “Tutorial: Your First Model (DAE)” by Opendeep.org

Slides @DocXavi
43
Eigen, David, Dilip Krishnan, and Rob Fergus. "Restoring an image taken through a window covered with dirt or rain." ICCV
2013.

Slides @DocXavi
44
#BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional
transformers for language understanding." NAACL 2019 (best paper award).
Take a sequence of words on input,
mask out 15% of the words, and
ask the system to predict the
missing words (or a distribution of
words)
Masked Autoencoder (MAE)

Slides @DocXavi
45
Outline
○ Spatial relations

Slides @DocXavi
46
Spatial veriﬁcation
Predict the relative position between two image patches.
#RelPos Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context
prediction." ICCV 2015.

Slides @DocXavi
47
#Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016.
Train a neural network to solve a jigsaw puzzle.

Slides @DocXavi
48
#Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016.
Train a neural network to solve a jigsaw puzzle.

Slides @DocXavi
49
Outline
○ Temporal relations

Slides @DocXavi
50
Temporal coherence
#Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin):
Temporal order of frames is
exploited as the supervisory
signal for learning.

Slides @DocXavi
51
Temporal sorting
Lee, Hsin-Ying, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. "Unsupervised representation learning by sorting
sequences." ICCV 2017.
Sort the sequence of frames.

Slides @DocXavi
52#T-CAM Wei, Donglai, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. "Learning and using the arrow of time."
CVPR 2018.
Predict whether the video moves forward or backward.
Arrow of time

Slides @DocXavi
53
Outline
○ Cycle Consistency

Slides @DocXavi
54#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of
Time”. CVPR 2019. [code] [Slides by Carles Ventura]
Cycle-consistency
Training: Tracking backward and then forward with pixel embeddings, to later train a
NN to match the pairs of associated pixel embeddings.

Slides @DocXavi
55#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of
Time”. CVPR 2019. [code] [Slides by Carles Ventura]
Cycle-consistency
Inference (video object segmentation): Track pixels across video sequences.

Slides @DocXavi
56
Outline
○ Video Frame Prediction

Slides @DocXavi
57
Video Frame Prediction
Slide credit:
Yann LeCun

Slides @DocXavi
58
Next word prediction (language model)
Figure:
TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal
of machine learning research 3, no. Feb (2003): 1137-1155.
Self-supervised
learning

Slides @DocXavi
59
Next sentence prediction (TRUE/FALSE)
#BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional
transformers for language understanding." NAACL 2019 (best paper award).
BERT is also pre-trained for a
binarized next sentence
prediction task:
Select some sentences for A and
B, where 50% of the data B is the
next sentence of A, and the
remaining 50% of the data B are
randomly selected in the corpus,
and learn the correlation.

Slides @DocXavi
60
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." In ICML 2015. [Github]
Learning video representations (features) by...

Slides @DocXavi
61
(1) frame reconstruction (AE):
LSTMs." ICML 2015. [Github]

Slides @DocXavi
62
(2) frame prediction

Slides @DocXavi
63
Unsupervised learned features (lots of data) are
ﬁne-tuned for activity recognition (small data).

Slides @DocXavi
64
Outline
○ Colorization

Slides @DocXavi
65
Zhang, R., Isola, P., & Efros, A. A. Colorful image colorization. ECCV 2016.
Colorization (still image)
Surrogate task: Colorize a gray scale image.

Slides @DocXavi
66Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
Pixel tracking
Pre-training: A NN is trained to point the location in a reference frame with the most
similar color...

Slides @DocXavi
Pre-training: A NN is trained to point the location in a reference frame with the most
similar color… computing a loss with respect to the actual color.
Pixel tracking

Slides @DocXavi
For each grayscale pixel to colorize, ﬁnd the color of most similar pixel embedding
(representation) from the ﬁrst frame.
Pixel tracking

Slides @DocXavi
69Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by colorizing
videos." ECCV 2018. [blog]
Pixel tracking
Application: colorization from a reference frame.
Reference frame Grayscale video Colorized video

Slides @DocXavi
70
Outline
4. Contrastive Methods

Slides @DocXavi
71

Slides @DocXavi
The Gelato Bet
"If, by the ﬁrst day of autumn (Sept 23) of 2015, a
method will exist that can match or beat the
performance of R-CNN on Pascal VOC detection,
without the use of any extra, human annotations
(e.g. ImageNet) as pre-training, Mr. Malik promises
to buy Mr. Efros one (1) gelato (2 scoops: one
chocolate, one vanilla)."

Slides @DocXavi
Hadsell, Raia, Sumit Chopra, and Yann LeCun. "Dimensionality reduction by learning an invariant mapping." CVPR 2006.
Learn Gw
(X) with pairs (X1
,X2
) which are similar (Y=0) or dissimilar (Y=1).
Contrastive Loss
similar pair (X1
,X2
) dissimilar pair (X1
,X2
)

Slides @DocXavi
Contrastive loss between:
● Encoded features of a query
image.
● A running queue of encoded
keys.
Losses over representations
(instead of pixels) facilitate
abstraction.
#MoCo He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. . Momentum Contrast for Unsupervised Visual Representation Learning.
CVPR 2020. [code]
Momentum Contrast (MoCo)

Slides @DocXavi
Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. Unsupervised feature learning via non-parametric instance discrimination. CVPR 2018.
Pretext task: Treat each image instance as a distinct class of its own and train a
classiﬁer to distinguish whether twp crops (views) belong to the same image.
Momentum Contrast (MoCo)

Slides @DocXavi
77
Transfer Learning from ImageNet labels (super. IM-1M)
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data/labels
Small
amount of
data/labels

Slides @DocXavi
78
Transfer Learning without ImageNet labels (MoCo IN-1M)
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data
Small
amount of
data/labels
Slide concept: Kevin McGuinness

Slides @DocXavi
79
Transfer Learning (super. IN-1M vs MoCo IN-1M)
Transferring to VOC object detection: a Faster R-CNN detector (C4-backbone)
is ﬁne-tuned end-to-end.

Slides @DocXavi
82
Correlated views by data augmentation
Original image cc-by: Von.grzanka
#SimCLR Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoﬀrey Hinton. "A simple framework for contrastive
learning of visual representations." arXiv preprint arXiv:2002.05709 (2020). [tweet]

Slides @DocXavi
83
https://paperswithcode.com/sota/self-supervised-image-classiﬁcation-on
Coming next...

Slides @DocXavi
Unsupervised Representation Learning
OpenSelfSup by MMLab [tweet]
CMC: https://arxiv.org/abs/1906.05849
MoCo: https://arxiv.org/abs/1911.05722
PIRL: https://arxiv.org/abs/1912.01991
SimCLR: https://arxiv.org/abs/2002.05709
MoCov2: https://arxiv.org/abs/2003.04297
Shortcut removal lens: https://arxiv.org/abs/2002.08822
SSL study: https://arxiv.org/abs/2003.14323
SimCLRv2: https://arxiv.org/abs/2006.10029
SwAV: https://arxiv.org/abs/2006.09882
DCL: https://arxiv.org/abs/2007.00224
Linear probing SOTA: https://paperswithcode.com/sota/self-supervised-image-classification-on
Fine-tuning SOTA (1%): https://paperswithcode.com/sota/semi-supervised-image-classification-on-1
Fine-tuning SOTA (10%): https://paperswithcode.com/sota/semi-supervised-image-classification-on-2
By Oscar
Mañas

Slides @DocXavi
85
Outline
a. Predictive Methods
b. Contrastive Methods

Slides @DocXavi
86
Suggested reading
Alex Graves, Kelly Clancy, “Unsupervised learning: The curious pupil”. Deepmind blog 2019.

Slides @DocXavi
87
Suggested talks
Demis Hassabis (Deepmind) Aloysha Efros (UC Berkeley)
Andrew Zisserman (Oxford/Deepmind) Yann LeCun (Facebook AI)

Slides @DocXavi
88
Related lecture
Aravind Srinivas, “Deep Unsupervised Learning”, UC Berkeley 2020

Slides @DocXavi
89
Video lectures on Unsupervised Learning
Kevin McGuinness, UPC DLCV 2016 Xavier Giró, UPC DLAI 2017

Slides @DocXavi
Extended (and a bit outdated) version of this talk
90
Lecture from in Master in Computer Vision Barcelona 2019

Slides @DocXavi
91
Coming related tutorial
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language, Vision, Audio and
Speech. ICMR 2020.
Monday 26th October, 2020

Slides @DocXavi
92
Related lecture
Self-supervised Audiovisual Learning [slides] [video]

Slides @DocXavi
Deep Learning @ Barcelona, Catalonia

Slides @DocXavi
Deep Learning @ UPC TelecomBCN
Foundations
● MSc course [2017] [2018] [2019]
● BSc course [2018] [2019] [2020]
Multimedia
Vision: [2016] [2017][2018][2019]
Language & Speech: [2017] [2018] [2019]
Reinforcement Learning
● [2020 Spring] [2020 Autumn]

Slides @DocXavi
4th (face-to-face) & 5th edition (online) start November 2020. Sign up here.

Slides @DocXavi
Thank you
xavier.giro@upc.edu
@DocXavi
Amanda
Duarte
Xavi
Giró
Míriam
Bellver
Benet
Oriol
Carles
Ventura
Oscar
Mañas
Maria
Gonzalez
Laia
Tarrés
Peter
Muschick
Giannis
Kazakos
Lucas
Ventura
Andreu
Girbau
Dèlia
Fernández
Eduard
Ramon
Pol
Caselles
Victor
Campos
Cristina
Puntí
Juanjo
Nieto

Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020

Similar to Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (16)

Recently uploaded

Recently uploaded (20)

Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020