Recurrent Instance Segmentation (UPC Reading Group)

Recurrent Instance
Segmentation
Slides by Manel Baradad
Computer Vision Reading Group, UPC
9th September, 2016
Bernardino Romera-Paredes, Philip H. S. Torr
[arxiv] (25 Nov 2015) - ECCV 2016

Contents
2
● Introduction
● Structure
○ FCN
○ ConvLSTM
○ Spatial Inhibition module
○ Post processing
● Loss function
● Experiments
○ Multiple Person Segmentation
○ Plants Leaf Segmentation
● Conclusions

Introduction
● Detecting and delineating each distinct object of a
specific class appearing in an image
● Contributions:
○ End-to-end approach for semantic instance
segmentation
○ Derivation of a loss function for this problem
● Two particular classes tested:
○ Multiple Person Segmentation
○ Plants Leaf Segmentation and Counting
● It is not an attention based model, though the goal
is attention on regions
3

Structure
h
1
h
2
Ŷ1
Ŷ2
s2
4
s1
h
t

Structure
h
1
h
2
Ŷ1
Ŷ2
5
s1
s2
h
t

Structure
Stopping condition:
h
1
h
2
Ŷ1
Ŷ2
s2
s1
6
Ŷn + 1
^
sn + 1^
h
t
hn + 1 ^
^

Fully convolutional network
h
1
h
2
Ŷ1
Ŷ2
s2
s1
7
h
t
Ŷn + 1
^
sn + 1^
hn + 1 ^
^

Fully convolutional network
● Objective: obtain features that serve as the input of the ConvLSTM
● The article builds upon other good FCN’s for the semantic segmentation task
● Specific for each of the two experiments performed (explained later)
Example: For the Multiple Person Segmentation FCN-8
8

ConvLSTM
h
1
h
2
Ŷ1
Ŷ2
s2
s1
9
h
t
Ŷn + 1
^
sn + 1^
hn + 1 ^
^

LSTM: Recurrent structure
● Ability to produce sequential output
● Provides memory
○ Implicitly model occlusion, segmenting non-occluded instances first, and keeping in
its state the regions of the image that have already been segmented
○ Consider potential relationships from different instances in the image (i.e. all the
instances of are always or never found together)
ConvLSTM
10

ConvLSTM
ConvLSTM: “Standard” LSTM replacing the Fully connected layers ( ) for
Convolutional layers
ct
h
t
ht-1
ct-1
h
t
ft
it ot
11
Extracted from: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

ConvLSTM
Why Conv instead of FC Layer for LSTM?
● Similar advantages of Conv Layers with respect to FC Layers
○ Suitable for learning filters
○ Useful for spatially invariant inputs such as images
○ Require less memory for the parameters
12

ConvLSTM
13
k(ht
): sum of the absolute
values across channels of ht
Ŷt
: predicted mask

Spatial inhibition module
h
1
h
2
Ŷ1
Ŷ2
s2
s1
14
h
t
Ŷn + 1
^
sn + 1^
hn + 1 ^
^

Spatial inhibition module
15
Region proposals
Scores

Region proposals
Region proposals
Scores
16

Region proposals
Region proposals
Scores
17
Value ranges:
Discriminate only one instance: Convolution + log-softmax
Adapt output to binary mask
At inference time, a pixel is assigned to an instance if the predicted value is
higher than 0.5 (though values are usually saturated, very close to 0 or 1)

Scores
Region proposals
Scores
18

19
Region proposals
Scores
Scores
Really simple: The “intelligence” of the scoring must be learned in the previous
states
Look at the values of the hidden state and apply a linear function

Post Processing
Results are further improved using a Conditional Random Field:
● Refine regions, as the ConvLSTM operates on a low resolution
representation of the image
● Outside of the trainable modules
20
RIS (Recurrent Instance
Segmentation)
RIS + CRF post processing

Structure
Stopping condition:
h
1
h
2
Ŷ1
Ŷ2
s2
s1
21
Ŷn + 1
^
sn + 1^
h
t
hn + 1 ^
^

Loss Function
● End-to-end training
22

Loss Function
1-Compute the intersection over the union for all
pairs of Predicted/GT masks
23
Ŷt
Yt

Loss Function
1-Compute the Intersection over the union for all
pairs of Predicted/GT masks
0.9
0
0
0.1
0.8
0.1
...
...
...
...
24

0.9
0
0
0.1
0.8
0.1
...
...
...
...
Loss Function
2-Find best matching:
25

0.9
0
0
0.1
0.8
0.1
...
...
...
...
Simply find maximum weight bipartite matching
(being the weights )
Easily solved using the Hungarian Algorithm
(polynomial time )
Loss Function
26

Loss Function
Loss: - Sum of the Intersections over the union for
the best matching
27

Loss Function
3-Also take into account the scores
28
s1
s2
s3
s4
s5
Where:
is the binary cross entropy:
is the Iverson bracket which:
Is 1 if the condition is true and 0 else

Loss Function
3-Also take into account the similarities
29
s1
s2
s3
s4
s5
Simply:
For matched instances
For unmatched instances
> 0, and we want it small
1 - s5

Loss Function
4-Add everything together
30
> 0, and we want it big > 0, and we want it small

● For each iteration:
○ Forward propagate
○ Find optimal matching
○ Once we have the matching, backpropagate the gradients of the loss function, with
the values previously found
● The minimization of the loss function is ignored when backpropagating
Loss Function
4-Add everything together
31

Integrates the model on the FCN-8s network developed in Long, J., Shelhamer, E.,
Darrell, T.: Fully convolutional networks for semantic segmentation
Experiments: Multiple
Person Segmentation
32
ConvLSTM introduced before the
upsampling layer

Multiple Person Segmentation
Trained using the MSCOCO dataset and the training images of the Pascal VOC
2012 dataset
1. Fix the weights of the FCN-8s except for the last layer, and learn the
parameters of that last layer, together with the ConvLSTM and the spatial
inhibition module
2. Fine-tune the whole network
33

Multiple Person Segmentation
The FCN features show a “mix” between semantic and instance segmentation
34

Experiments: Plants Leaf Segmentation
Learn the fully convolutional network from scratch: 5 convolutional layers
+ReLU.
Computer Vision Problems in Plant Phenotyping (CVPPP) dataset: 161 images
Low SBD because of low resolution (though Difference in count is good)
35
SBD is a measure about the accuracy of the segmentation of the instances

Plants Leaf Segmentation
There are systems that perform better at the moment (better resolution…)
36
Mengye Ren, Richard S. Zemel: End-to-End Instance Segmentation and Counting with Recurrent
Attention (30th May 2016). The article studied in this presentation was published the 25th Nov 2015
RIS+CRF Ren & ZemelRen & Zemel

Conclusions
The model integrates in a single pipeline all the required functions to segment
instances, and their parameters are jointly learned end-to-end
The model uses a recurrent structure that is able to track visited areas in the image
as well as to handle occlusion among instances, similarly to humans
The defined loss function accurately represents the instance segmentation
objective
The experiments show promising performance
37

Appendix
38
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation

Appendix
39
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation

Recurrent Instance Segmentation (UPC Reading Group)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Recurrent Instance Segmentation (UPC Reading Group)

Similar to Recurrent Instance Segmentation (UPC Reading Group) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Recurrent Instance Segmentation (UPC Reading Group)