Simple Framework for Contrastive Learning Visual Representations

A Simple Framework for Contrastive
Learning ofVisual Representations
Ting Chen, et al., “A Simple Framework for Contrastive Learning of Visual Representations”
8th March, 2020
PR12 Paper Review
JinWon Lee
Samsung Electronics

References
• The Illustrated SimCLR Framework
 https://amitness.com/2020/03/illustrated-simclr/
• Exploring SimCLR: A Simple Framework for Contrastive Learning of
Visual Representations
 https://towardsdatascience.com/exploring-simclr-a-simple-framework-for-
contrastive-learning-of-visual-representations-158c30601e7e
• SimCLR: Contrastive Learning ofVisual Representations
 https://medium.com/@nainaakash012/simclr-contrastive-learning-of-visual-
representations-52ecf1ac11fa

Introduction
• Learning effective visual representations without human supervision
is a long-standing problem.
• Most mainstream approaches fall into one of two classes: generative
or discriminative.
 Generative approaches – pixel level generation is computationally expensive
and may not be necessary for representation learning.
 Discriminative approaches learn representations using objective function like
supervised learning but pretext tasks have relied on somewhat ad-hoc
heuristics, which limits the generality of learn representations.

Contrastive Learning – Intuition

Contrastive Learning
• Contrastive methods aim to learn representations by enforcing
similar elements to be equal and dissimilar elements to be different.

Contrastive Learning – Data
• Example pairs of images which are similar and images which are
different are required for training a model
Images from “The Illustrated SimCLR Framework”

Supervised & Self-supervisedApproach

Contrastive Learning – Representstions

Contrastive Learning – Similarity Metric

Contrastive Learning – Noise Contrastive
Estimator Loss
• x+ is a positive example and x- is a negative example
• sim(.) is a similarity function
• Note that each positive pair (x,x+) we have a set of K negatives

SimCLR – Overview
• A stochastic data augmentation module that transforms any given
data example randomly resulting in two correlated views of the same
example.
 Random crop and resize(with random flip), color distortions, and Gaussian
blur
• ResNet50 is adopted as a encoder, and the output vector is from GAP
layer. (2048-dimension)
• Two layer MLP is used in projection head. (128-dimensional latent
space)
• No explicit negative sampling. 2(N-1) augmented examples within a
minibatch are used for negative samples. (N is a batch size)
• Cosine similarity function is a used similarity metric.
• Normalized temperature-scaled cross entropy(NT-Xent) loss is used.

SimCLR - Overview
• Training
 Batch size : 256~8192
 A batch size of 8192 gives 16382 negative examples per positive pair from
both augmentation views.
 To stabilize the training, LARS optimizer is used.
 Aggregating BN mean and variance over all devices during training.
 With 128TPU v3 cores, it takes ~1.5 hours to train ResNet-50 with a batch
size of 4096 for 100 epochs
• Dataset – ImageNet 2012 dataset
• To evaluate the learned representations, linear evaluation protocol is
used.

Step by Step Example

Data Augmentation for Contrastive
Representation Learning
• Many existing approaches define contrastive prediction tasks by
changing architecture.
• The authors use only simple data augmentation methods, this simple
design choice conveniently decouples the predictive task from other
components such as the NN architecture.

Data Augmentation for Contrastive
Representation Learning
Augmentations in the red boxes are used

Linear Evaluation under Individual or
Composition of Data Augmentation

Evaluation of Data Augmentation
• Asymmetric data transformation method
is used.
 Only one branch of the frame work is applied
the target transformation(s).
• No single transformation suffices to learn
good representations, even though the
model can almost perfectly identify the
positive pairs.
• Random cropping and random color
distortion stands out
 When using only random cropping as data
augmentation is that most patches from an
image share a similar color distortion

Contrastive Learning Needs Stronger Data
Augmentation
• Stronger color augmentation substantially improves the linear
evaluation of the learned unsupervised models.
• A sophisticated augmentation policy(such as AutoAugment) does not
work better than simple cropping + (stronger) color distortion
• Unsupervised contrastive learning benefits from stronger (color) data
augmentation than supervised learning.
• Data augmentation that does not yield accuracy benefits for supervised
learning can still help considerably with contrastive learning.

Unsupervised Contrastive Learning Benefits from
Bigger Models
• Unsupervised learning benefits more from bigger models than its
supervised counter part.

Nonlinear Projection Head
• Nonlinear projection is better than a linear projection(+3%) and
much better than no projection(>10%)

Nonlinear Projection Head
• The hidden layer before the projection head is a better
representation than the layer after.
• The importance of using the representation before the nonlinear
projection is due to loss of information induced by the contrastive
loss. In particular z = g(h) is trained to be invariant to data
transformation.

Loss Function
• l2 normalization along with temperature effectively weights
different examples, and an appropriate temperature can help the
model learn from hard negatives.
• Unlike cross-entropy, other objective functions do not weigh the
negatives by their relative hardness.

Larger Batch Size and LongerTraining
• When the training epochs is small, larger batch size have a significant
advantage.
• Larger batch sizes provide more negative examples, facilitating
convergence, and training longer also provides more negative examples,
improving the results.

Comparison with SOTA – Linear Evaluation

Comparison with SOTA – Semi-supervised
Learning

Appendix – Effects of LongerTraining for
Supervised Learning
• There is no significant benefit from training supervised models
longer on ImageNet.
• Stronger data augmentation slightly improves the accuracy of
ResNet-50 (4x) but does not help on ResNet-50.

Appendix
– CIFAR-10 Dataset Results

Conclusion
• SimCLR differs from standard supervised learning on ImageNet only
in the choice of data augmentation, the use of nonlinear head at the
end of the network, and the loss function.
• Composition of data augmentations plays a critical role in defining
effective predictive tasks.
• Nonlinear transformation between the representation and the
contrastive loss substantially improves the quality of the learned
representations.
• Contrastive learning benefits from larger batch sizes and more
training steps compared to supervised learning.
• SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative
improvement over previous state-of-the-art, matching the
performance of a supervised ResNet-50.

Simple Framework for Contrastive Learning Visual Representations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Simple Framework for Contrastive Learning Visual Representations

Similar to Simple Framework for Contrastive Learning Visual Representations (20)

More from Jinwon Lee

More from Jinwon Lee (20)

Recently uploaded

Recently uploaded (20)

Simple Framework for Contrastive Learning Visual Representations