Cross-City Adaptation of Road Scene Segmenters

更具適應性的AOI
National Tsing Hua University
Min Sun
孫民
VSLab

Myself please visit aliensunmin.github.io

Vision Science Lab (VSLab)
Analyzing
Street Views
Understanding
Personal Videos
3D & Robot Vision Human Sensing
Research Topicsin ComputerVision & Machine Learning
Wearable Camera Applications
Make3D
3

Challenges
p AOI is similar to fine-grained Recognition
p How to adapt to changes (e.g., due to different sensors/viewpoints)?
What kind of bird? Attentionshould help
image source
http://yassersouri.github.io/pages/fast-bird-part.html
Domain shift
image source
http://vision.cs.uml.edu/adaptation.html
Domain adaption should help

AttentionModel
p Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015

AttentionModel
Soft-attention
hard-attention

AttentionModel
p Fine-grained Recognition Fu et al. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition. CVPR 2017

Motivation – Adapting Changes
p State-of-the-art segmenter suffers from domain shift
n The appearances of road scenes are quite different across domains (cities).
Taipei
Rio
Cairo
New York
Frankfurt
Tokyo

Motivation
p Effect of domain shift:
n Domain bias will result in inferior performance on target domain when one
applies a segmenter trained on the source domain.
Feature Space
Linear Classifier
Frankfurt (Src)
Taipei (Tgt)
Segmenter
trained on
Src Domain
Frankfurt (Src)
Taipei (Tgt)

No More Discrimination:
Cross City Adaptation of Road Scene
Segmenters
Yi-Hsin Chen Wei-Yu Chen Yu-Ting Chen Bo-Cheng Tsai Yu-Chiang Frank Wang Min Sun

Motivation
p Goal: use domain adaptation to mitigate the effect of domain shift.
p Approaches:
n Supervised Fine-Tuning: CAN access the label on the target domain.
• Straightforward but time-consuming and expensive.
n Unsupervised Adaptation: CAN’T access the label on the target domain.
• More challenging but low cost.
Pixel labeling of one
Cityscapes image takes
90 minutes on average.[4]
[4] M. Cordts, M.Omran, S. Ramos, T. Rehfeld,M. Enzweiler, R. Benenson,U.Franke,S.Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene
understanding,” in CVPR,IEEE,2016.
a
Practical in real life !

Data Collection
p Use Google Streetview API to download images of different cities.
n Randomly sample locations at each city to ensure sufficient variations in visual appearance.
p Use Time-Machine feature to collect images pairs at the same location but different
times.
Tokyo
Rome
Rio
Taipei
T1 T2 T1 T2
Location B Location A
TimeTime
Unlabeled Image Pairs
Same location / Different Times

Our Dataset
p We propose a new dataset of complex road scenes, with:
n Diverse Appearance: includes 4 different cities across continents.
n Temporal Information: each city includes 1600 image pairs which provide helpful supervision
without any human interaction.
n Dense pixel annotations : each city includes 100 high-quality annotated images.
Please visit : https://yihsinchen.github.io/segmentation_adaptation/

Overview
𝐿"#"$% = 𝐿"$'( + 𝜆* 𝐿* + 𝜆,%$'' 𝐿,%$''

Global Domain Alignment
p How to extend the idea of domain adversarial learning for adapting
cross-domain image segmentation?
n We take each gridin the 𝒇 𝒄 𝟕 feature map of the FCN-based segmenter as an
instance.

Global Domain Alignment
p Our objective is to minimize by iteratively update domain classifier
and feature extractor :
• and : the images from source and target domain respectively.
• : the number of grids in each map.
• and the feature maps of source and target domain images.
• : the probability that the grid n of image x belongs to the source domain, where is the
sigmoid function.

Class-wise Domain Alignment
p Let each class do domain adversarial learning individually.
p But we must first address some problem :
n Under the unsupervisedsetting, we don’t have any label on target domain to
link with source domain.
• Can’t do domain adversarial learning against source domain.
n In global domain adaptation, we define each grid 𝑛 in the feature space as one
instance.
• Can’t directly use the labels which are in the image(pixel) space.
a pseudo label
agrid-level soft label
Up-Sampling
Input Image Network Prediction
feature space Pixel space

Class-wise Domain Alignment --- Grid-Level Soft Label
p (In source domain)
n Calculate grid-wise soft label Φ2
, (𝐼4) as the
probability of grid 𝑛 belonging to class 𝑐:
• 𝑖: is the pixel index in image space.
• 𝑛: is the grid index in feature space.
• 𝑅(𝑛): is the set of pixels that correspond to grid
n.
• 𝑦= 𝐼4 : denotes the ground truth label of pixel 𝑖.
Pixel-Level Ground-truthGrid-Level Soft Label

Class-wise Domain Alignment --- Pseudo Label
p (In target domain)
n Calculate target-domain grid-wise soft
pseudo label Φ2
, (𝐼>) as the probability of
grid 𝑛 belonging to class 𝑐:
• 𝑖: is the pixel index in image space.
• 𝑛: is the grid index in feature space.
• 𝑅(𝑛): is the set of pixels that correspond to
grid n.
• 𝜙=
,
𝐼> : is the pixel-wise soft pseudo label of
pixel 𝑖 corresponding to class c
Pixel-Level Pseudo LabelGrid-Level Soft Label

Class-wise Domain Alignment
p Due to the pseudo label and soft label, we could “link” each class
between source and target domain.
p Using the same adversarial learning framework can be achieved.
Road
Car
Source Domain
(Ground Truth)
Target Domain
(Pseudo Label)
High
Low
Links of
Road
Probability
bar

Class-wise Domain Alignment ⎯ Static-Object
Prior
Static-Object
Prior
• Static-objects: building, road, sidewalk…etc.
• Non-static-objects: person, car, motorbike…etc

Class-wise Domain Alignment ⎯ Static-Object Prior
p Download image pair at the same location but different times

p Perform Dense Match (find matched points)

p Identify superpixel containing k>=3 matched points as the static object
prior

Class-wise Domain Alignment---Static-Object Prior
p Use static-object prior to refine pseudo label.
p For pixel that belongs to static-object prior, we suppress its probability
corresponding to non-static objects.
• 𝑃'"$"=,(𝐼>) : the set of pixels belong to static-object prior .
• 𝐶'"$"=,: the set of static-object classes.
• Static-objects: building, road, sidewalk…etc.
• Non-static-objects: person, car, motorbike…etc

Experiments
p We adapt a model pretrained on
Cityscapes to other cities in our
dataset.
nSource domain:
• The training set of Cityscapes.
• 2975 road scene images with annotation.
nTarget domain:
• 4 different cities of our datasets.
• Each city have 1600 images without any
annotation.
adapt

Experiments ⎯ Quantitative Results

• Global alignment method contributes 2.6% mIoU gain.

• Class-wise alignment method also contributes 0.9% mIoU gain.

• The static-object priors contributes another 0.6% mIOU improvement.

Experiments ⎯ T-SNE Visualization
p From pre-trained, GA only, to
GA+CA(prior), we could observe
the bias between domains keep
decreasing.
nGA stands for Global Domain
Alignment
nCA stands for Class-wise Domain
Alignment

Experiments ⎯ Typical Examples

Recap
p AOI is similar to fine-grained Recognition
p How to adapt to changes (e.g., due to different sensors/viewpoints)?
What kind of bird? Attentionshould help
image source
http://yassersouri.github.io/pages/fast-bird-part.html
Domain shift
image source
http://vision.cs.uml.edu/adaptation.html
Domain adaption should help

Cross-City Adaptation of Road Scene Segmenters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cross-City Adaptation of Road Scene Segmenters

Similar to Cross-City Adaptation of Road Scene Segmenters (20)

More from CHENHuiMei

More from CHENHuiMei (20)

Recently uploaded

Recently uploaded (20)

Cross-City Adaptation of Road Scene Segmenters