Deep neural networks have achieved outstanding results in various applications such as vision, language, audio, speech, or reinforcement learning. These powerful function approximators typically require large amounts of data to be trained, which poses a challenge in the usual case where little labeled data is available. During the last year, multiple solutions have been proposed to leverage this problem, based on the concept of self-supervised learning, which can be understood as a specific case of unsupervised learning. This talk will cover its basic principles and provide examples in the field of multimedia.
7. Slides @DocXavi
Representation + Learning pipeline
Raw data (ex: images)
Feature
Extraction
Classifier Decisor y = ‘CAT’
X1: weight
X2: age
Probabilities:
CAT: 0.7
DOG: 0.3
7
Slide concept: Santiago Pascual (UPC 2019). Garfield is a character created by Jim Davis. Garfield and Odie are characters created by Jim Davis.
8. Slides @DocXavi
Raw data (ex: images)
Feature
Extraction
Classifier Decisor y = ‘CAT’
X1: weight
X2: age
Probabilities:
CAT: 0.7
DOG: 0.3
Neural
Network
Shall we
extract
features now?
8
Representation + Learning pipeline
Slide concept: Santiago Pascual (UPC 2019). Garfield is a character created by Jim Davis. Garfield and Odie are characters created by Jim Davis.
9. Slides @DocXavi
Raw data (ex: images)
Classifier Decisor y = ‘CAT’
Probabilities:
CAT: 0.7
DOG: 0.3
Neural
Network
We CAN inject the raw
data, and features will
be learned in the NN !!
End to End concept
9
End-to-end Representation Learning
Slide concept: Santiago Pascual (UPC 2019). Garfield is a character created by Jim Davis. Garfield and Odie are characters created by Jim Davis.
10. Slides @DocXavi
10Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural
networks." NIPS 2012
649,63 citations (June 2020)
11. Slides @DocXavi
11
● 1,000 object classes
(categories).
● Labeled Images:
○ 1.2 M train
○ 100k test.
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet large
scale visual recognition challenge." International Journal of Computer Vision 115, no. 3 (2015): 211-252. [web]
19. Slides @DocXavi
Unsupervised learning
19
● It is the nature of how intelligent beings percept
the world.
● It can save us tons of efforts to build an AI agent
compared to a totally supervised fashion.
● Vast amounts of unlabelled data.
24. Slides @DocXavi
24
Self-supervised learning
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● A pretext (or surrogate) task must be invented by withholding a part of the
unlabeled data and training the NN to predict it.
Unlabeled data
(X)
25. Slides @DocXavi
25
Self-supervised learning
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● By defining a proxy loss, the NN learns representations, which should be
valuable for the actual downstream task.
^
y
loss
Representations learned without labels
27. Slides @DocXavi
27
Transfer Learning
Target data
E.g. PASCAL
Target model
Target labels
Small
amount of
data/labels
Instead of training a deep network from scratch for your task...
Slide credit: Kevin McGuinness
28. Slides @DocXavi
28
Transfer Learning
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data/labels
Small
amount of
data/labels
Slide credit: Kevin McGuinness
Instead of training a deep network from scratch for your task:
● Take a network trained on a different source domain and/or task
● Adapt it for your target domain and/or task
29. Slides @DocXavi
29
Transfer Learning
Slide credit: Kevin McGuinness
labels
Train deep net on “nearby” task using standard
backprop
● Supervised learning (e.g. ImageNet).
● Self-supervised learning (e.g. Language
Model)
Cut off top layer(s) of network and replace with
supervised objective for target domain
Fine-tune network using backprop with labels
for target domain until validation loss starts to
increase
conv2
conv3
fc1
conv1
surrogate loss
surrogate data
fc2 + softmax
real labelsreal data
real loss
my_fc2 + softmax
30. Slides @DocXavi
30
Self-supervised + Transfer Learning
#Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin):
39. Slides @DocXavi
39
Autoencoder (AE)
Fig: “Deep Learning Tutorial” Stanford
Dimensionality reduction:
Use the hidden layer as a
feature extractor of any
desired size.
40. Slides @DocXavi
40
Denoising Autoencoder (DAE)
Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. "Extracting and composing robust features
with denoising autoencoders." ICML 2008.
1. Corrupt the input signal, but predict the clean version at its output.
2. Train for final task with “few” labels.
Encoder
W1
Decoder
W2
hdata reconstruction
Loss
(reconstruction error)
corrupted
data
43. Slides @DocXavi
43
Eigen, David, Dilip Krishnan, and Rob Fergus. "Restoring an image taken through a window covered with dirt or rain." ICCV
2013.
44. Slides @DocXavi
44
#BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional
transformers for language understanding." NAACL 2019 (best paper award).
Take a sequence of words on input,
mask out 15% of the words, and
ask the system to predict the
missing words (or a distribution of
words)
Masked Autoencoder (MAE)
46. Slides @DocXavi
46
Spatial verification
Predict the relative position between two image patches.
#RelPos Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context
prediction." ICCV 2015.
47. Slides @DocXavi
47
Spatial verification
#Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016.
Train a neural network to solve a jigsaw puzzle.
48. Slides @DocXavi
48
Spatial verification
#Jigsaw Noroozi, M., & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV 2016.
Train a neural network to solve a jigsaw puzzle.
50. Slides @DocXavi
50
Temporal coherence
#Shuffle&Learn Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code] (Slides by Xunyu Lin):
Temporal order of frames is
exploited as the supervisory
signal for learning.
51. Slides @DocXavi
51
Temporal sorting
Lee, Hsin-Ying, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. "Unsupervised representation learning by sorting
sequences." ICCV 2017.
Sort the sequence of frames.
52. Slides @DocXavi
52#T-CAM Wei, Donglai, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. "Learning and using the arrow of time."
CVPR 2018.
Predict whether the video moves forward or backward.
Arrow of time
54. Slides @DocXavi
54#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of
Time”. CVPR 2019. [code] [Slides by Carles Ventura]
Cycle-consistency
Training: Tracking backward and then forward with pixel embeddings, to later train a
NN to match the pairs of associated pixel embeddings.
55. Slides @DocXavi
55#TimeCycle Xiaolong Wang, Allan Jabri, Alexei A. Efros, “Learning Correspondence from the Cycle-Consistency of
Time”. CVPR 2019. [code] [Slides by Carles Ventura]
Cycle-consistency
Inference (video object segmentation): Track pixels across video sequences.
58. Slides @DocXavi
58
Next word prediction (language model)
Figure:
TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language model." Journal
of machine learning research 3, no. Feb (2003): 1137-1155.
Self-supervised
learning
59. Slides @DocXavi
59
Next sentence prediction (TRUE/FALSE)
#BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional
transformers for language understanding." NAACL 2019 (best paper award).
BERT is also pre-trained for a
binarized next sentence
prediction task:
Select some sentences for A and
B, where 50% of the data B is the
next sentence of A, and the
remaining 50% of the data B are
randomly selected in the corpus,
and learn the correlation.
60. Slides @DocXavi
60
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." In ICML 2015. [Github]
Learning video representations (features) by...
Video Frame Prediction
61. Slides @DocXavi
61
(1) frame reconstruction (AE):
Learning video representations (features) by...
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." ICML 2015. [Github]
Video Frame Prediction
62. Slides @DocXavi
62
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." ICML 2015. [Github]
Learning video representations (features) by...
(2) frame prediction
Video Frame Prediction
63. Slides @DocXavi
63
Unsupervised learned features (lots of data) are
fine-tuned for activity recognition (small data).
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." ICML 2015. [Github]
Video Frame Prediction
65. Slides @DocXavi
65
Zhang, R., Isola, P., & Efros, A. A. Colorful image colorization. ECCV 2016.
Colorization (still image)
Surrogate task: Colorize a gray scale image.
66. Slides @DocXavi
66Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
Pixel tracking
Pre-training: A NN is trained to point the location in a reference frame with the most
similar color...
67. Slides @DocXavi
67Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
Pre-training: A NN is trained to point the location in a reference frame with the most
similar color… computing a loss with respect to the actual color.
Pixel tracking
68. Slides @DocXavi
68Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
For each grayscale pixel to colorize, find the color of most similar pixel embedding
(representation) from the first frame.
Pixel tracking
69. Slides @DocXavi
69Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by colorizing
videos." ECCV 2018. [blog]
Pixel tracking
Application: colorization from a reference frame.
Reference frame Grayscale video Colorized video
73. Slides @DocXavi
The Gelato Bet
"If, by the first day of autumn (Sept 23) of 2015, a
method will exist that can match or beat the
performance of R-CNN on Pascal VOC detection,
without the use of any extra, human annotations
(e.g. ImageNet) as pre-training, Mr. Malik promises
to buy Mr. Efros one (1) gelato (2 scoops: one
chocolate, one vanilla)."
Source: Ankesh Anand, “Contrastive Self-Supervised Learning” (2020)”
74. Slides @DocXavi
Hadsell, Raia, Sumit Chopra, and Yann LeCun. "Dimensionality reduction by learning an invariant mapping." CVPR 2006.
Learn Gw
(X) with pairs (X1
,X2
) which are similar (Y=0) or dissimilar (Y=1).
Contrastive Loss
similar pair (X1
,X2
) dissimilar pair (X1
,X2
)
75. Slides @DocXavi
Contrastive loss between:
● Encoded features of a query
image.
● A running queue of encoded
keys.
Losses over representations
(instead of pixels) facilitate
abstraction.
#MoCo He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. . Momentum Contrast for Unsupervised Visual Representation Learning.
CVPR 2020. [code]
Momentum Contrast (MoCo)
76. Slides @DocXavi
Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. Unsupervised feature learning via non-parametric instance discrimination. CVPR 2018.
Pretext task: Treat each image instance as a distinct class of its own and train a
classifier to distinguish whether twp crops (views) belong to the same image.
Momentum Contrast (MoCo)
77. Slides @DocXavi
77
Transfer Learning from ImageNet labels (super. IM-1M)
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data/labels
Small
amount of
data/labels
Slide credit: Kevin McGuinness
78. Slides @DocXavi
78
Transfer Learning without ImageNet labels (MoCo IN-1M)
Source data
E.g. ImageNet
Source model
Source labels
Target data
E.g. PASCAL
Target model
Target labels
Transfer Learned
Knowledge
Large
amount of
data
Small
amount of
data/labels
Slide concept: Kevin McGuinness
79. Slides @DocXavi
79
Transfer Learning (super. IN-1M vs MoCo IN-1M)
Transferring to VOC object detection: a Faster R-CNN detector (C4-backbone)
is fine-tuned end-to-end.
82. Slides @DocXavi
82
Correlated views by data augmentation
Original image cc-by: Von.grzanka
#SimCLR Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. "A simple framework for contrastive
learning of visual representations." arXiv preprint arXiv:2002.05709 (2020). [tweet]