Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020

Image Segmentation with Deep Learning
Xavier Giro-i-Nieto
UPC & BSC Barcelona
Carles Ventura
UOC Barcelona

Xavier Giro-i-Nieto
Associate Professor at Universitat Politecnica
de Catalunya (UPC) in Barcelona, Catalonia.
IDEAI Center for
Intelligent Data Science
& Artiﬁcial Intelligence
@DocXavi
xavier.giro@upc.edu

https://sites.google.com/view/dlbcn2018/home https://sites.google.com/view/dlbcn2019/home
Deep Learning Barcelona Symposium

Foundations
● MSc course [2017] [2018] [2019]
● BSc course [2018] [2019] [2020]
Multimedia Applications
Vision: [2016] [2017][2018][2019]
Language & Speech: [2017] [2018] [2019]
Reinforcement Learning
● [2020 Spring] [2020 Autumn]
Deep Learning @ UPC TelecomBCN

4th (face-to-face) & 5th edition (online) start November 2020. Sign up here.
Online Postgraduate Course
Àgata
Lapedriza
(UOC)
Xavier
Giró
(UPC-BSC)
Xavier
Suau
(Apple)
Marta
Ruiz
(UPC)
Carles
Ventura
(UOC)
Jordi
Pons
(Dolby)
Jordi
Torres
(BSC)
Elisenda
Bou
(Vilynx)
Daniel
Fojo
(Glovo)

Acknowledgements
6
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya
[DLCV 2016]
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
[DLCV 2017]
Míriam Bellver
miriam.bellver@bsc.edu
PhD Candidate
Barcelona Supercomputing Center
[DLCV 2018] [DLCV 2018]

From image to pixels classiﬁcation (segmentation)
7
Slide inspired by cs231n lecture from Stanford University.
Image
Segmentation
Object Detection
Image
Classification
“chair”, “bin” “chair” “bin” “chair” “bin”

Segmentation
Segmentation: Deﬁne the accurate boundaries of all objects in an image
predicting a class map for each pixel
8

● Autonomous driving
Segmentation Applications

● Medical imaging
Image source: DRIVE Digital Retinal Image Vessel Extraction

● Robotic applications

● Scene understanding

Outline
From Global to Local-scale Image Classification
Semantic Segmentation
● Deconvolution (or transposed convolution)
● Dilated Convolution
● Skip Connections
Instance Segmentation
● Proposal-Based
● Recurrent
● Instance Embedding
Panoptic Segmentation
13

14
Figure: Jeremy Jordan (2018)
From Image to Pixel Classiﬁcation (Segmentation)

15

Slide: CS231n (Stanford University)
CNN COW
Extract
patch
Run through
a CNN
Classify
center pixel
Repeat for
every pixel
16
Naive approach: Train a sliding window classiﬁer.

CNN COW
Extract
patch
Run through
a CNN
Classify
center pixel
Repeat for
every pixel
17
Naive approach: Train a sliding window classiﬁer.

CNN
Convolutionize: Run “fully convolutional” network to get all pixels at once.
18
From Global to Local-scale Image Classiﬁcation

CNN
Convolutionize: Run “fully convolutional” network to get all pixels at once.
19
Slide concept: CS231n (Stanford University)

Convolutionize: Formulate each neuron in a fully connected (FC) layer as a
convolutional ﬁlter (kernel) of a convolutional layer:
20
3x2x2 tensor
(RGB image of 2x2)
2 fully connected
neurons
3x2x2 * 2 weights
2 convolutional ﬁlters of 3 x 2 x 2
(same size as input tensor)
3x2x2 * 2 weights

21
A model trained for image classification on low-definition images can provide local
response when fed with high-definition images.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR
2015. (original figure has been modified)

22Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR
2015. (original ﬁgure has been modiﬁed)
CNN
Convolutionize: Run “fully convolutional” network to get all pixels at once...

23
Campos, V., Jou, B., & Giro-i-Nieto, X. . From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction.
Image and Vision Computing. (2017)
The FC to Conv redeﬁnition allows generating heatmaps of the class prediction over
the input images.

24
Limitation:
Pooling layers in the CNN will
decrease the spatial deﬁnition of the
output.
Figure: Alicja Kwasniewska (ISSonDL 2020)

25
CNN
Limitation: Pooling layers in the CNN will decrease the spatial deﬁnition of
the output.

Outline
● Dilated Convolutions
● Proposal-Based
● Recurrent
26

Label every pixel!
Don’t diﬀerentiate
instances (cows)
Classic computer
vision problem
27

Detect instances,
give category, label
pixels
“simultaneous
detection and
segmentation” (SDS)
Labels are
class-aware and
instance-aware
28

Outline
Instance Segmentation Methods
● Proposal-Based
● Recurrent
29

30Slide Credit: https://www.jeremyjordan.me/semantic-segmentation/

31
CNN
Limitation of convolutionizing CNNs for image classiﬁcation:
Pooling layers in the CNN will decrease the spatial deﬁnition of the output.

Learnable upsampling
2015.

33
Slide: Alicja Kwasniewska (ISSonDL 2020)
Learnable Upsample: Transposed Convolution

Reminder: Convolutional Layer
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
34
Slide credit: CS231n (Stanford University)

Dot product
between filter
and input
35

Dot product
between filter
and input
36

37

Dot product
between filter
and input
38

Dot product
between filter
and input
39

3 x 3 “deconvolution”, stride 2 pad 1
40
Learnable upsampling with Transposed Convolutions

Input gives
weight for
filter values
41

Slide Credit: CS231n
Input gives
weight for
filter values
Sum where
output overlaps
42

Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. ICCV 2015.
“Regular” VGG “Upside down” VGG
43

44
Limitation of upsampling from deep CNN layers: Deeper layers
are specialized for higher-level semantic tasks, not in capturing
ﬁne-grained details required for segmentation.
Highest activations along CNN depth
Learnable Upsample

Skip Connections
“skip
connections”
Solution: Combine
predictions from features
at diﬀerent depths.
2015.
combination

46#U-Net Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image
segmentation." MICCAI 2015
Skip connections to intermediate layers

47
Receptive Field
Receptive field: Part of the input data that is visible to a neuron.
It increases as we stack more convolutional layers (i.e. neurons in deeper layers
have larger receptive fields).
André Araujo, Wade Norris, Jack Sim, “Computing Receptive Fields of Convolutional Neural Networks”. Distill.pub
2019.
Problem: Receptive field may be limited, and pixel-wise predictions at
the deepest layer may not be aware of the whole image.

48
Receptive Field: Dilated (atrous) convolutions
Slide: Alicja Kwasniewska (ISSonDL 2020)

Dilated Convolutions
● By adding more layers:
○ The receptive field grows exponentially.
○ The number of learnable parameters (filter weights) grows linearly.
49
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. ICLR 2016.

Dilated Convolutions
50Source: https://github.com/vdumoulin/conv_arithmetic

Dilated Convolutions + Spatial Pyramid Pooling (SPP)
51
#SPP He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual
recognition. TPAMI 2015.
#PSPNet Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. CVPR 2017.

State-of-the-art models
52
● DeepLab v3+: Atrous Convolutions + Spatial Pyramid Pooling + Encoder-Decoder
#DeepLabv3+ Chen, L. C., Zhu, Y., Papandreou, G., Schroﬀ, F., & Adam, H. (2018). Encoder-decoder with atrous
separable convolution for semantic image segmentation. ECCV 2018

Outline
● Proposal-Based
● Recurrent
53

Proposal-based
54
Typical object detection/segmentation pipelines:
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90

Proposal-based
55
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90
NMS: Non-Maximum Suppression

Proposal-based
56
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90
Binary
Map
Binary
Map

Proposal-based
Slide Credit: CS231nHariharan et al. Simultaneous Detection and Segmentation. ECCV 2014
External
Segment
proposals
Mask out background
with mean image
Similar to R-CNN, but with segment proposals
57

Proposal based: Detection - Faster R-CNN
Conv
layers
Region Proposal Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
58
Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015
Learn proposals end-to-end sharing parameters with the classification network

He et al. Mask R-CNN. ICCV 2017
Proposal-based Instance Segmentation: Mask R-CNN
Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks
and class labels
59

Mask R-CNN
Object Detection Object Detection and Segmentation

Mask R-CNN: RoI Align
RoI Pool from Fast R-CNN
Hi-res input image:
3 x 800 x 600
with region
proposal
Convolution
and Pooling
Hi-res conv features:
C x H x W
with region proposal
Fully-connected
layers
Max-pool within
each grid cell
RoI conv features:
C x h x w
for region proposal
Fully-connected layers expect
low-res conv features:
C x h x w
x/16 & rounding → misalignment ! + not differentiable
61

Limitations of Proposal-based models
63
1. Two objects might share the same bounding box: Only
one will be kept after NMS step.
2. Choice of NMS threshold is application dependant
3. Same pixel can be assigned to multiple instances
4. Number of predictions is limited by the number of
proposals.

Single-shot Instance Segmentation
64
● Improving RetinaNet (single-shot object detector) in three ways:
○ Integrating instance mask prediction
○ Making the loss function adaptive and more stable
○ Including hard examples in training
#RetinaMask Fu et al. RetinaMars: Learning to predict masks improves state-of-the-art single-shot detection for free.
ArXiv 2019

65
CNN Cat
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classiﬁcation with deep convolutional neural networks” NIPS 2012

66
Cat
Grass
Stone
CNN
RNN
CNN
CNN
RNN

67
CNN
RNN
CNN
CNN
RNN
CNN
CNN
CNN

Recurrent Instance Segmentation
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 68
Sequential mask generation

Salvador, A., Bellver, Campos. V, M., Baradad, M., Marqués, F., Torres, J., & Giro-i-Nieto, X. (2018) From Pixels to Object
Sequences: Recurrent Semantic Instance Segmentation.

#RVOS Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto.
“RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
time
(frame sequence)
space
(object sequence)

Outline
Segmentation Datasets
● Proposal-Based
● Recurrent
● DETR
71

Semantic + Instance = Panoptic Segmentation
72#PS Kirillov, A., He, K., Girshick, R., Rother, C., & Dollár, P. (2019). Panoptic segmentation. CVPR 2019.

Panoptic Segmentation: methods
73
● UPSNet: A Unified Panoptic Segmentation Network
Mask R-CNN design
#UPSNET Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., & Urtasun, R. (2019). Upsnet: A unified panoptic segmentation
network. CVPR 2019.

Panoptic Segmentation: methods
74
● UPSNet: A Unified Panoptic Segmentation Network
Xioing et al. UPSNet: A Unified Panoptic Segmentation Network. CVPR 2019

Summary
Semantic Segmentation Methods
Instance Segmentation Methods
● Proposal-Based
● Recurrent
75

Latest advances
● Bolya et al. YOLACT Real-time Instance Segmentation. ICCV 2019
● #Axial-DeepLab Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., & Chen, L. C. (2020).
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020.
● #SOLO Wang, X., Kong, T., Shen, C., Jiang, Y., & Li, L. (2019). Solo: Segmenting objects by locations.
ECCV 2020
● Fast Semantic Segmentation with MobileNet in PyTorch.
76

● 20 categories
● +10,000 images
● Semantic segmentation GT
● Instance segmentation GT
● 540 categories
● +10,000 images
● Dense annotations
● Objects + stuﬀ
Pascal Visual Object Classes Pascal Context
77

● Real indoor & outdoor scenes
● 80 categories
● +300,000 images
● 2M instances
● Partial annotations
● Objects, but no stuﬀ
COCO Common Objects in Context
78
● Real general scenes
● +150 categories
● +22,000 images
● Instance + parts segmentation GT
● Objects and stuﬀ
ADE20K

79
● 350 categories
● +950,000 of images
● 2,700,00 instance segmentations
● Objects
Open Images V6

80
● 1,000 categories
● 164,000 of images
● 2,200,00 instance segmentations
● 11.2 objects instance from 3.4
categories on average per image
(more complex images than Open
Images and MS COCO)
● Objects
LVIS

● Real driving scenes
● 30 categories
● +25,000 images
● 20,000 partial annotations
● 5,000 dense annotations
● Depth, GPS and other metadata
● Real driving scenes covering 6
continents with variety of
weather/season/time of
day/camera/viewpoint
● 152 categories
● 25,000 images
● Instance + parts segmentation GT
CityScapes Mapillary Vistas Dataset
81

Hands on
Carles Ventura
cventuraroy@uoc.edu
Lecturer
Universitat Oberta de Catalunya

Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020

Similar to Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (16)

Recently uploaded

Recently uploaded (20)

Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020