Deep Learning Human Pose Estimation Thesis

Deep Learning for Articulated
Human Pose Estimation
Thesis Proposal Defence
Wei Yang
wyang@ee.cuhk.edu.hk
Supervisors: Prof. Xiaogang Wang and Prof. Wanli Ouyang
May 25, 2016

Outline
• Introduction
• Methodology
• Experiments
• Future Work
2

Articulated human pose
estimation localizes human
body parts in images or videos.
3

Activity Recognition Game and Animation Clothing Parsing
Applications
4

Challenges
• Articulation
• Foreshortening
• Clothing
• Occlusion
• …
5

Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
• Pictorial structures
Yang & Ramanan 2011
Traditional Methods
• Unary templates
• Pairwise springs
• Mixture of mini-parts
• Mixtures of each part
• Unary template for each mixture type
• Pairwise springs between mixture types
of two parts
Unable to handle large
variations
(e.g., foreshortening)
6

Deformable Mixture of Parts
Clustering on (∆𝑥, ∆𝑦)
7

Heatmap Regression
Deep Learning Based Methods
(𝑥, 𝑦) Coordinate Regression
• Learning better representations
• Geometric constraints among body parts
are missing in training the DCNN
• Holistic View
• Mapping from images to coordinates
are too difficult to learn
• Inaccurate in high-precision region
[Tompson et al. CVPR’15][Toshev & Szegedy . CVPR’14]
8

CNN
Spatial
constraints
Local evidence is weak
Forward
Backward
Global consistency helps
training
Motivation: Global Pose Consistency Helps in
Learning Better Representation
Forward
Backward
9

Difficulties in Modeling Spatial Constraints
∗
∗
=
=
face
shoulder
s | f
s | s
face to shoulder
shoulder to shoulder
Tompson, Jonathan J., et al. "Joint training of a convolutional network and a graphical model for human pose estimation." NIPS. 2014. 10
shoulder
⨂
Weakly spatial histogram over body part
locations
• Less effective for large variations
Learned by convolutional kernels
• Parameter space is too large hence is
difficult to learn

Graph Models
𝐺 = (𝑉, 𝐸)
Vertices
• Locations and mixture types of
body parts
• Modeled by a front-end CNN
Edges
• Pairwise spatial relationships
between body parts
• Modeled by message passing
layers
11
message passing

DCNN
Logarithm
Softmax
Framework
…
…
…
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
13

DCNN
Logarithm
Softmax
Framework
…
…
…
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽, 𝝎 =
𝑖∈𝑉
𝜙(𝒍𝑖, 𝑡𝑖|𝐼; 𝜃)
+
𝑖,𝑗 ∈𝐸
𝜓(𝒍𝑖, 𝒍𝑗, 𝑡𝑖, 𝑡𝑗|𝐼; 𝝎𝑖,𝑗
𝑡 𝑖,𝑡 𝑗
)
14

DCNN
Logarithm
Softmax
Framework
…
…
…
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽, 𝝎 =
𝑖∈𝑉
+
𝑖,𝑗 ∈𝐸
𝑡 𝑖,𝑡 𝑗
)
15
𝒍 = 𝒍𝑖 = {(𝑥𝑖, 𝑦𝑖)}: location of part 𝑖

DCNN
Logarithm
Softmax
Framework
…
…
…
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽, 𝝎 =
𝑖∈𝑉
+
𝑖,𝑗 ∈𝐸
𝑡 𝑖,𝑡 𝑗
)
16
𝒕 = 𝒕𝑖 : mixture type of part 𝑖

DCNN
Logarithm
Softmax
Front-End CNN
Part Appearance Terms
…
…
…
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽, 𝝎 =
𝑖∈𝑉
+
𝑖,𝑗 ∈𝐸
𝑡 𝑖,𝑡 𝑗
)
17

Message Passing Layers
Spatial Relationship Terms
DCNN
Logarithm
Softmax
…
…
…
l.shou
neck
r.shou
r.knee
r.ankle
head
…
max
max
max
max
max
max
𝐹 𝒍, 𝒕 𝐼; 𝜽, 𝝎 =
𝑖∈𝑉
+
𝑖,𝑗 ∈𝐸
𝑡 𝑖,𝑡 𝑗
)
18

conv
+
norm
+
pool
conv
+
norm
+
pool
conv
conv
conv
1x1
conv
+
dropo
ut
1x1conv
Dropout
Dropout
3
3 3 3 3 3
1x1conv
4096
4096
PxK
…
DCNN
Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖
• 𝑓 𝒍𝑖, 𝑡𝑖 𝐼; 𝜃 : output of the front-end CNN
• 𝜎(⋅): Softmax function
𝜙 𝒍𝑖, 𝑡𝑖 𝐼; 𝜃 = log 𝑝 𝒍𝑖, 𝑡𝑖 𝐼; 𝜃 = log 𝜎(𝑓 𝒍𝑖, 𝑡𝑖 𝐼; 𝜃 )
20
Front end CNN

conv
+
norm
+
pool
conv
+
norm
+
pool
conv
conv
conv
1x1
conv
+
dropo
ut
1x1conv
Dropout
Dropout
3
3 3 3 3 3
1x1conv
4096
4096
PxK
…
DCNN
𝜙 𝒍𝑖, 𝑡𝑖 𝐼; 𝜃 = log 𝑝 𝒍𝑖, 𝑡𝑖 𝐼; 𝜃 = log 𝜎(𝑓 𝒍𝑖, 𝑡𝑖 𝐼; 𝜃 )
21
Front end CNN

conv
+
norm
+
pool
conv
+
norm
+
pool
conv
conv
conv
1x1
conv
+
dropo
ut
1x1conv
Dropout
Dropout
3
3 3 3 3 3
1x1conv
4096
4096
PxK
…
DCNN
𝜙 𝒍𝑖, 𝑡𝑖 𝐼; 𝜃 = log 𝑝 𝒍𝑖, 𝑡𝑖 𝐼; 𝜃 = log 𝜎(𝑓 𝑙𝑖, 𝑡𝑖 𝐼; 𝜃 )
22
Front end CNN

conv
+
norm
+
pool
conv
+
norm
+
pool
conv
conv
conv
1x1
conv
+
dropo
ut
1x1conv
Dropout
Dropout
3
3 3 3 3 3
1x1conv
4096
4096
PxK
…
MixtureofParts
DCNN
23
Front end CNN

Spatial Relationship Terms
• Quadratic deformation constraints
• 𝑑 𝒍𝑖, 𝒍𝑗 = ∆𝑥, ∆𝑥2
, ∆𝑦, ∆𝑦2
• ∆𝑥 = 𝑥𝑖 − 𝑥𝑗, ∆𝑦 = 𝑦𝑖 − 𝑦𝑗
• 𝝎 encodes both the rest position and
rigidity of the edge
24
𝜓 𝒍𝑖, 𝒍𝑗, 𝑡𝑖, 𝑡𝑗 𝐼; 𝝎𝑖,𝑗
𝑡 𝑖,𝑡 𝑗
= 𝝎𝑖,𝑗
𝑡 𝑖,𝑡 𝑗
, 𝑑(𝒍𝑖, 𝒍𝑗)
Yang Y, Ramanan D. Articulated pose estimation with flexible mixtures-of-parts. CVPR, 2011.

𝑢𝑖 𝒍𝑖, 𝑡𝑖 ← 𝛼 𝑢 𝜙 𝒍𝑖, 𝑡𝑖 +
𝑘∈𝑁(𝑖)
𝑚 𝑘𝑖(𝒍𝑖, 𝑡𝑖)The belief of part 𝑖
Message Passing: Max-Sum Algorithm
25
𝑚𝑖𝑗 𝒍𝑗, 𝑡𝑗 ← 𝛼 𝑚 max
𝒍 𝑖,𝑡 𝑖
𝑢𝑖 𝒍𝑖, 𝑡𝑖 + 𝜓 𝒍𝑖, 𝒍𝑗, 𝑡𝑖, 𝑡𝑗The message passed from part 𝑖 to 𝑗

Message Passing: Max-Sum Algorithm
26
𝒍 𝑖,𝑡 𝑖
𝑢𝑖 𝒍𝑖, 𝑡𝑖 + 𝜓 𝒍𝑖, 𝒍𝑗, 𝑡𝑖, 𝑡𝑗
𝑘∈𝑁(𝑖)
𝑚 𝑘𝑖(𝒍𝑖, 𝑡𝑖)

Message Passing: Max Score Assignment
27
𝒍 𝑖,𝑡 𝑖
𝑢𝑖 𝒍𝑖, 𝑡𝑖 + 𝜓 𝒍𝑖, 𝒍𝑗, 𝑡𝑖, 𝑡𝑗
𝑘∈𝑁(𝑖)
𝑚 𝑘𝑖(𝒍𝑖, 𝑡𝑖)
𝒍𝑖
∗
, 𝑡𝑖
∗
= argmax 𝑢𝑖
∗
𝒍𝑖, 𝑡𝑖
𝒍 𝑖,𝑡 𝑖

The Message Passing Layers
28
1st

29
2nd

30
3rd

Datasets
LSP
31
Image ParseFLIC
Sports
1000 training
1000 testing
Films
3987 training
1016 testing
Activities
205 testing
Cross-dataset validation

Evaluation Metrics
Percentage of Correct Parts (PCP)
• Correctly localized body parts
• A candidate body part is treated as correct if its
segment endpoints lie within 50% of the length
of the ground-truth annotated endpoints.
• Penalize short limbs
Percentage of Detected Joints (PDJ)
• Correctly localized joints invariant to scale
• Curve computed by varying localization precision
precision threshold, which is normalized by the
scale defined as distance between left shoulder
and right hip
32

34
84.1
77.1
52.5
35.9
69.5
65.6
60.8
87.4
77.4
54.4
33.7
75.7
68
62.8
80.1
56.5
37.4
74.3
69.3
64.3
84.3
78.3
54.1
74.5
67.6
61.2
88.1
80.4
62.8
39.5
79
73.6
67.8
88.6
84.3
61.9
45.4
77.8
71.9
68.7
88.7
85.1
61.8
45
78.9
73.2
69.2
56
38
77
71
92.7
87.8
69.2
55.4
82.9
77
75
96.5
83.1
78.8
66.7
88.7
81.7
81.1
TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN
STRICT PCP ON THE LSP DATASET
Yang&Ramanan, CVPR'11 Pishchulin et al., CVPR'13 Eichner&Ferrari, ACCV'13 Kiefel&Gehler, ECCV'14 Pose Machines, ECCV'14
Ouyang et al., CVPR'14 Pishchulin et al., ICCV'13 DeepPose, CVPR'14 Chen&Yuille, NIPS'14 Ours
PCP on the LSP dataset

Comparison on the FLIC dataset
36
 Our approach (1st row)
 Chen and Yuille, CVPR 2014 (2nd row)
 Tompson et al. NIPS 2014

38
Generalization
30 40 50 60 70 80 90 100
TORSO
HEAD
UPPER ARMS
LOWER ARMS
UPPER LEGS
LOWER LEGS
MEAN
Ours Ouyang et al., CVPR'14
Yang&Ramanan, TPAMI'13 Pishchulin et al., ICCV'13
Pishchulin et al., CVPR'13 Pishchulin et al., CVPR'12
Johnson&Everingham, CVPR'11 Yang&Ramanan, CVPR'11

Component Analysis
Unary Term vs. Full Model
Tree Model vs. Loopy Model
40

Unary Term vs. Full Model
83.4
69
53.5
34.9
72.2
63.5
60.1
96.5
83.1
78.8
66.7
88.7
81.7
81.1
30
40
50
60
70
80
90
100
STRCT PCP ON THE LSP DATASET (VGG-LG)
Unary Full Model
41

Tree-Structured Model vs. Loopy Model
42
96.2
83.4
78.7
65.8
87.9
81.1
80.7
96.5
83.1
78.8
66.7
88.7
81.7
81.1
62 67 72 77 82 87 92 97
Torso
Head
Upper Arms
Lower Arms
Upper Legs
Lower Legs
Mean
Loopy Model Tree Model

Future work
Deep Residual Learning for Human Pose Estimation
Image Dependent Graph Structure Learning
43

The deeper, the better?
• Simply stacking layers leads higher training error
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
56-layerr
20-layer
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
56-layer
20-layer
Training error Testing error
44

Residual Learning:
Intuition
• A deeper model should not
have higher training error
than its shallower
counterpart.
• One solution:
identity mapping
Identity mapping
45

Plain Network
• 𝐻(𝐱) is the underlying mapping
• Expect stacked two layers to
approximate 𝐻(𝐱) Weight layer
Weight layer
𝐻(𝐱)
ReLU
ReLU
𝐱
46

Residual Learning
• Explicitly fit a residual mapping
𝐹 𝐱 = 𝐻 𝐱 − 𝐱
Weight layer
Weight layer
ReLU
ReLU
𝐱
Insight:
Finding optimal around
zero is easier! 𝐹 𝐱
𝐻 𝐱 = 𝐹 𝐱 + 𝐱
+
𝐻(𝐱)
47

ResNet vs. VGG
95.6
83.9
72.2
61.8
78.5
71.8
74.8
94.8
90.6
73.9
63.3
81.9
71.8
76.7
60
65
70
75
80
85
90
95
100
torso head u.arm l.arm u.leg l.leg Mean
PCP on the LSP dataset
VGG-LG ResNet
1x1 conv
1x1 conv
scoremap
1x1 conv
1x1 conv
scoremap
1x1 conv
1x1 conv
scoremap
…
2xpooling
7x7conv
image
48

49
…
𝒘𝑗𝑖
(𝑙+1)
𝒙𝑖
(𝑙)
𝒚 𝑗
(𝑙+1)
…
𝒚 𝑗
(𝑙+1)
=
𝒊
𝐺𝑗𝑖 𝒘𝑗𝑖
(𝑙+1)
∗ 𝒙𝑖
(𝑙)
+ 𝑏𝑗
(𝑙+1)

50
𝐺𝑗𝑖
…
𝒘𝑗𝑖
(𝑙+1)
𝒙𝑖
(𝑙)
𝒚 𝑗
(𝑙+1)
…
𝒚 𝑗
(𝑙+1)
=
𝒊
𝐺𝑗𝑖 𝒘𝑗𝑖
(𝑙+1)
∗ 𝒙𝑖
(𝑙)
+ 𝑏𝑗
(𝑙+1)

Thank you.
Deep Learning for Articulated Human Pose Estimation
Wei Yang
wyang@ee.cuhk.edu.hk
Supervisors: Prof. Xiaogang Wang and Prof. Wanli Ouyang
Committee
Prof. Xiaogang Wang (EE)
Prof. Wai-kuen Cham (EE)
Prof. Dahua Lin (IE)

Appendix: Number of Message Passing Layers
80.7
80.9
81.1
MEAN
52
80.7
81.2
81.7
LOWER LEGS
87.9
88.3
88.7
UPPER LEGS
66.3
66.3
66.7
LOWER ARMS
78.4
78.2
78.8
UPPER ARMS
1st Layer 2nd Layer 3rd Layer

Appendix: Independent Training vs. Joint
Training
93
82.1
70.6
55.4
82.1
75.3
74.2
95
83.5
75
61.9
86.9
79.8
78.6
30
40
50
60
70
80
90
100
Independent Joint
53

Appendix: Context Cues are Essential
2015/9/11 54

Deep Learning Human Pose Estimation Thesis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Deep Learning Human Pose Estimation Thesis

Similar to Deep Learning Human Pose Estimation Thesis (20)

Recently uploaded

Recently uploaded (20)

Deep Learning Human Pose Estimation Thesis

Editor's Notes