Auro tripathy - Localizing with CNNs

auro@shatterline.com 1
How CNNs Localize Objects
with Increasing Precision
and Speed
Auro Tripathy
May 2017
How do I
fine-tune the
bounding box?
What
Class?

•  Terms, concepts, and metrics for detection algorithms
•  Two-stage detectors
•  Region-based Convolution Neural Networks (R-CNN)
•  Fast R-CNN
•  Faster R-CNN
•  Unified (single-shot) detectors
•  You Only Look Once (YOLO)
•  Single-Shot Detector (SSD)
Outline

What is to Classification as Where is to Detection
“We’re in the midst of an Object Detection Renaissance”
– Ross Girschik
What?
ü  Person, Probability=0.7
ü  Dog, Probability=0.8
ü  Horse, Probability=0.8
What & Where?
ü  Person, Location=(x1, y1, w1, h1), Confidence=90%
ü  Dog, Location=(x2, y2, w2, h2), Confidence=80%
ü  Horse, Location=(x3, y3, w3, h5), Confidence=90%

0.5, 34.3
0.02, 58.5
0.4, 70
7, 73.2
21, 63.2
58, 77
19, 80
30
40
50
60
70
80
0 10 20 30 40 50 60 70
CNN-Based Detection Performance at a Glance
Two-Stage Techniques versus Single-Shot Techniques
SSD300X300
SSD512X512
YOLO
Faster R-CNN
Fast R-CNN
R-CNN
Deformable Parts Model
Frames per Sec (fps)
meanAvgPrecision(mAP)VOC
(fps, mAP)

What’s Mean Average Precision (mAP)?
Precision =
TP
TP + FP
Recall =
TP
TP + FN
1. Predict the Average Precision of each class in your test set
2. Then take the mean of these average individual class precisions to get
mean Average Precision (mAP)
High precision relates
to low false-positives
High recall relates
to low false-negatives

Region-based CNN (R-CNN) Kick-started Detection
0.5, 34.3
0.02, 58.5
0.4, 70
7, 73.2
21, 63.2
58, 77
19, 80
30
40
50
60
70
80
0 10 20 30 40 50 60 70
SSD300X300
SSD512X512
YOLO
Faster R-CNN
Fast R-CNN
R-CNN
mAPVOC
(fps, mAP)
fps

Image
Region Proposal
Generator (2000
Regions)
CNN - Feature
Extractor Per
Region
CNN Output -
Feature Vector
Linear SVM
Classifier for
Region
Airplane: No
:
Dog: Yes
:
TV Monitor: No
Region-Based CNN (R-CNN)
Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5)
Bounding Box
Regressor
CNN

•  Training is a three-stage disjoint pipeline
1.  Fine-tune a CNN on region proposals using log loss
2.  Fits SVMs (acting as object detectors) to CNN features replacing Softmax
3.  Learn to regress bounding boxes with squared loss (L2)
•  External Region Proposal Algorithm
•  No sharing of parameters between the 2000 region proposals
•  Volume of data mandates intermediate stages stored on disk
Using CNNs Broke New Ground
The Downside – High Workloads for Train/Test
http://videolectures.net/iccv2015_girshick_fast_r_cnn/

What’s Bounding-Box Regression?
Learn Transformation W that Maps Proposal P to Ground Truth G
Groundtruth, G
Proposal, P
d(P)
d★(P) = W★
T
ϕ5(P),
where ★ is x, y, w, h and ϕ5 are Pool5 Features
Transformation d(P) is parameterized into four functions:
dx(P), dy(P), dw(P), dh(P)
x, y are linear translations of the center of P’s bounding box
w, h are log-space translations of the width & height of P
We learn W by minimizing a standard least squares problem
with Ridge Regression regularization
x, y
w
h

Learn to Only Regress Proposals that are “Nearby”
to Ground Truth with Intersection over Union
IoU Threshold = 0.9IoU Threshold = 0.7IoU Threshold = 0.6
Used only if the Intersection over Union (IoU) between the
predicted box and the ground truth box is greater than a threshold
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/object_localization_and_detection.html

Fast R-CNN Improved Detection w/Single-Stage
Training
0.5, 34.3
0.02, 58.5
0.4, 70
7, 73.2
21, 63.2
58, 77
19, 80
30
40
50
60
70
80
0 10 20 30 40 50 60 70
SSD300X300
SSD512X512
YOLO
Faster R-CNN
Fast R-CNN
R-CNN
mAPVOC
(fps, mAP)
fps

•  CNN over entire image instead of over a
region proposal
•  Shares convolution layers
•  Continues to use external region proposals
•  Projects region proposals on top of
Conv5 of VGG16
•  Simultaneously predicts
•  Classes and
•  Bounding Boxes via joint training
Fast R-CNN
How do I
fine-tune the
bounding box?
What
Class?
https://clipartfest.com/download/fb2cd25bdefb07cc8eb8cd28091ab62ea3519461.html
Network is designed with a
classification “head” and a
regression “head”

Fast R-CNN
RoI Projection
(for each Region)
Region Proposal
Generator (2000 Regions)
RoI Pooling Layer
Fully Connected (FC6 + FC7)
1024 RoI Feature Vector
FC
Class
Probability
Bounding Box
Prediction
Conv5
Conv1

Fast R-CNN – Forward and Back-Prop Paths Using
Multi-Class Loss
RoI Projection
(for each Region)
Region Proposal Generator
(2000 Regions)
FC compatible RoI Pooling Layer
FC
Class
Probability
Bounding Box
Regressor
Conv5
Conv1
Linear
Softmax Linear
Log Loss + Smooth L1 Loss Forward Path
Back-Prop Path
https://andrewliao11.github.io/object_detection/faster_rcnn/

Lossmulticlass = Lossclassification + λ * Loss bounding box regression
Multiclass Loss = Log Loss + Smooth L1 Loss
predicted
offsets
ground truth
regression targetΣ Smooth-L1= -log(loss for true class u) + λ *
0.5x2 if mod(x) < 1
mod(x) – 0.5 otherwise
Smooth-L1(x) =
Smooth-L1 Loss less sensitive to outliers than L2 Loss

•  RoI is a rectangular window into
the feature map (r, c, h,w )
•  HxW grid of sub-windows
•  (e.g., 7X7)
•  Each sub-window, h/H x w/W
•  Max-pool the values in each sub-
window into the corresponding
output grid cell
Introduce Region-of-Interest (RoI) Pooling Layer
For Compatibility with the Fully-Connected Layer Above
Back-Propagation routes
derivatives through RoI Layer
w
h
(r,c)
h/H
w/W

•  Higher mAP over R-CNN
•  Training is single-stage using a multi-class loss
•  Training can update all network layers
•  No disk storage is required for feature-caching
Benefits of Fast R-CNN over R-CNN

Faster R-CNN Subsumes Region Proposals
0.5, 34.3
0.02, 58.5
0.4, 70
7, 73.2
21, 63.2
58, 77
19, 80
30
40
50
60
70
80
0 10 20 30 40 50 60 70
SSD300X300
SSD512X512
YOLO
Faster R-CNN
Fast R-CNN
R-CNN
mAPVOC
(fps, mAP)
fps

•  Replace the use of external object proposals with a Region Proposal
Network (RPN)
•  RPN reuse CNNs for object proposals!
•  RPN shares convolutions with the detection side of the network
•  Big benefit, marginal cost of computing proposals becomes small
•  Reuse previously covered Fast R-CNN for detection!
•  Training regime alternates between
•  First, fine-tuning for the region proposal task
•  Then, fine-tuning for the object detection keeping the proposals fixed
Faster R-CNN

Novel “Anchor” Boxes Serve as References at
Multiple Scales and Aspect Ratios
Pyramids of feature maps are
built & the classifier is run at
all scales
feature map
scaled images
Pyramids of filters of
multiple scales and sizes
are run on the feature map
multiple filters
Pyramids of reference
boxers in the regression
functions
feature map
anchors =
references
at multiple
scales and
aspect ratios
✓New

Region Proposal Network
Training Classifies Objectness & Regresses Bounding Boxes
Conv5
Conv1
k=9 * 2 Class Scores
(object or background)
k=9 * 4
Box Proposals
(x, y, w, h)
Sliding
window
k=9 “anchor” boxes to address
Three scales (128,256,512)
Three aspect ratios (2:1, 1:1, 1:2)
Scale 1 Scale 2 Scale 3
1:1
2:1
1:2
“Objectness” Score Bounding Box Regression
256 Dimension
Vector for each
Anchor at each
location

Step 1 – Train RPN initialized w/ImageNet to
Output Region Proposals
FC
Bounding Box
Regressor
Conv5
Conv1
Linear
Softmax
RPN
Layers
RPN Proposals
Fine-Tuned end-to-end
w/ImageNet Weights
https://andrewliao11.github.io/object_detection/faster_rcnn/

Step 2 – Train Fast R-CNN with Learnt Region
Proposals
FC
Bounding Box
Regressor
Conv5
Conv1
Linear
Softmax
RPN
Layers
Object Class
Probabilities
Fine-Tuned end-to-end
w/ImageNet Weights
RPN Proposals Learned in Step 1

Step 3 – Initialize RPN from Model Trained in Step 2
& Train RPN Again
FC
Bounding Box
Regressor
Conv5
Conv1
Linear
Softmax
RPN
Layers
RPN Proposals
Share the Weights from Step 2
but Lock them (prevent updates)

Step 4 – Fine Tune FC Layers of Fast R-CNN Using the
Shared Convolution Weights from Step 3
FC
Bounding Box
Regressor
Conv5
Conv1
Linear
Softmax
RPN
Layers
Object Class
Probabilities
RPN Proposals Learned in Step 3
Share the Weights from Step 3
But Prevent Updates
Fine-tune the
unique layers
Of Fast R-CNN

You Only Look Once (YOLO) Uses One Network,
Runs Fast
0.5, 34.3
0.02, 58.5
0.4, 70
7, 73.2
21, 63.2
58, 77
19, 80
30
40
50
60
70
80
0 10 20 30 40 50 60 70
SSD300X300
SSD512X512
YOLO
Faster R-CNN
Fast R-CNN
R-CNN
mAPVOC
(fps, mAP)
fps

You-Only-Look-Once (YOLO)
Do Away with Dual Networks (RPN + Classifier), Use a Single Network
•  Divide Image into a S=7 x S=7 grid of
cells
•  Within each cell, predict
1.  B=2 Bounding Boxes
2.  C=20 Class Probabilities
•  Each Bounding Box predicts 5
parameters
•  x, y, width, height, confidence
•  x, y is the center of the box relative
to the grid cell
•  Conditional class probability
(conditioned on the grid cell
containing an object)
Bounding Box +
Confidence
Class
Probability
•  Output of Network:
•  S * S * (5 * B + C)
•  7 * 7 *(5 * 2 + 20) = 1470 values

YOLO – Very Fast Direct Prediction Using a CNN
Output
S * S * (5 * B + C)
7 * 7 *(5 * 2 + 20) = 1470 values
448 * 448
3
112 * 112
256
56 * 56
192
1024
512
10247 * 7
14 * 14
Convs, 7x7x64-s-2
MaxPool, 2x2-s-2
10247 * 7
7 * 7 (5 * 2 + 20)
4096
Convs, 3x3x192
MaxPool, 2x2-s-2
Convs, 1x1x128
3x3x256
1x1x256
3x3x512
MaxPool, 2x2-s2
Convs, (1x1x256
3x3x512) x 4
1x1x512
3x3x1024
MaxPool, 2x2-s-2
Fully Connected Layer
Convs, 3x3x1024
3x3x1024
Fully Connected Layer
28 * 28
Convs, (1x1x512
3x3x1024) x 2
1x1x512
3x3x1024
3x3x1024-s-2

YOLO’s 1X1 Convolutions Reduces Parameters, Runs Fast
Simple Example Shows Parameters Reduced from 4860 to 1440
Parameter Size =
18 x (3 x 3) x 30 =
4860
30
h
w
3
3
w
h
Output
Feature
Map3x3
Kernel
Input
Feature
Map
18
Total Parameter Size =
90 + 1350 =
1440
30
h
w
3
3
w
h
Output
Feature
Map
3x3
Kernel
Input
Feature
Map
18
1x1
Kernel
5
w
h
Parameter Size =
5 x (3 x 3) x 30 =
1350
Parameter Size =
18 x (1 x 1) x 5 =
90

•  Confidence score Intersection-over-Union (IoU) between
•  Predicted Box
•  Ground Truth
Non-Maximal Suppression via Intersection-
Over-Union
Predicted Box
Ground Truth
Intersection Area
Union AreaIoU=

•  “[YOLO] struggles with small objects that appear in groups, such as
flocks of birds.”
•  “[YOLO] struggles to generalize to objects in new or unusual aspect
ratios or configurations.”
•  “YOLO struggles to localize objects correctly.”
Limitations of YOLO
You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

21, 63.2
19, 80
91, 69
81, 73.7
67, 76.8
59, 77.840, 78.6
60
70
80
90
0 20 40 60 80 100
YOLOv2 catches up to SSD
Provides Tradeoffs Between Speed and Accuracy
mAPVOC
fps
YOLO9000: Better, Faster, Stronger Joseph Redmon, Ali Farhadi University of Washington, Allen Institute for AI
SSD512X512
YOLOv2228x228
YOLOv2352x352
YOLOv2416x416
YOLOv2480x480YOLOv2544x544
YOLOv1448x448

Single-Shot Detector (SSD), Faster than YOLO
and as Accurate as Faster R-CNN
0.5, 34.3
0.02, 58.5
0.4, 70
7, 73.2
21, 63.2
58, 77
19, 80
30
40
50
60
70
80
0 10 20 30 40 50 60 70
SSD300X300
SSD512X512
YOLO
Faster R-CNN
Fast R-CNN
R-CNN
mAPVOC
(fps, mAP)
fps

•  Use six default boxes at
each feature cell
•  Similar to anchor boxes in
Faster R-CNN
•  Six aspect rations
•  { 1, 2, 3, 1/2, 1/3 }
aspect ratio boxes + 1
box with 1 aspect ratio
Uses Default Boxes at Multiple Aspect Ratios
& Scales
4x4 Feature Map
8x8 Feature Map
In a convolutional fashion, we evaluate six default boxes of six
aspect ratios at each location in two feature maps with different
scales (e.g. 8 × 8 and 4 × 4)
Default boxes

Single-Shot Detector Uses Feature Maps at Different
Scales and Concatenates Them All at the Last Layer
Multiclass
Scores
Bounding Box
Regression
Forward Path
Back-Prop Path
Multiclass
Scores
Bounding Box
Regression
Stride=2
Convolution
“…, by utilizing feature maps from several different layers in a
single network for prediction we can mimic the same effect, while
also sharing parameters across all object scales.”
19x19
10x10

SSD – Six Progressively Smaller Layers
Concatenated
300 300
3
38 38
512
Non Maximum Supression
Concatenate Detections Total Detections/Class:7308
19
19
1024
19
19
1024
512
5 5
256
3 3
256
1 1 256
Conv6 (FC)
Default Boxes:6
Detections/Class = (19 * 19 * 6)
Default Boxes:6
Default Boxes:6
Default Boxes:6
Default Boxes:6
Default Boxes:3*
Conv4_3
Conv7 (FC)
Conv8_2
Conv9_2
Conv10_2
Pool 11
VGG-16thru
Pool5Layer
1010
* 3 Boxes to reduce computation

•  Data augmentation adds 6.7% mAP
•  Scaling and cropping
•  Additionally, using lower features maps (Conv4_3) for prediction, adds 4% mAP
•  Use a variety of default box shapes
•  Similar to Faster R-CNN anchor boxes
•  { 1, 2, 3, 1/2, 1/3 } aspect ratio boxes + 1 box with 1 aspect ratio
•  {2, 1/2, 3, 1/3} aspect ratio contribute 2.9% mAP
•  Use the atrous algorithm of VGG16 (adds 0.7% mAP)
•  Use Hard Negative Mining to balance ratio of positive samples to negative
samples
SSD has Many Tools that Progressively Improve
mAP

•  Single-shot methods are faster than two-stage methods
•  Single shot mAP is comparable to Faster R-CNN, the best two-stage
method
•  SSD is faster than YOLO, and just as accurate as Faster R-CNN
•  YOLOv2 provides tradeoffs between speed and accuracy
•  The building blocks of detection algorithms presented here can lead to
higher precision and recall, i.e., more innovations to come
Summary

Links to Seminal Resources
Technique Resource
R-CNN Rich feature hierarchies for accurate object detection and
semantic segmentation Tech report (v5)
Fast R-CNN Fast R-CNN
Faster R-CNN Faster R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks
YOLO You Only Look Once: Unified, Real-Time Object Detection
YOLOv2 YOLO9000: Better, Faster, Stronger
SSD SSD: Single Shot MultiBox Detector

Auro tripathy - Localizing with CNNs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Auro tripathy - Localizing with CNNs

Similar to Auro tripathy - Localizing with CNNs (20)

More from Auro Tripathy

More from Auro Tripathy (6)

Recently uploaded

Recently uploaded (20)

Auro tripathy - Localizing with CNNs