This document summarizes a thesis proposal on using deep learning for articulated human pose estimation. The proposed method uses a deep convolutional neural network (DCNN) as a front-end to extract local appearance features of body parts, combined with message passing layers to model spatial relationships between parts through pairwise constraints. This global pose model is trained end-to-end using a max-sum algorithm to maximize consistency across the entire human pose. Experimental results on standard pose estimation datasets demonstrate state-of-the-art performance.
1. Deep Learning for Articulated
Human Pose Estimation
Thesis Proposal Defence
Wei Yang
wyang@ee.cuhk.edu.hk
Supervisors: Prof. Xiaogang Wang and Prof. Wanli Ouyang
May 25, 2016
6. Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
• Pictorial structures
Yang & Ramanan 2011
Traditional Methods
• Unary templates
• Pairwise springs
• Mixture of mini-parts
• Mixtures of each part
• Unary template for each mixture type
• Pairwise springs between mixture types
of two parts
Unable to handle large
variations
(e.g., foreshortening)
6
8. Heatmap Regression
Deep Learning Based Methods
(𝑥, 𝑦) Coordinate Regression
• Learning better representations
• Geometric constraints among body parts
are missing in training the DCNN
• Holistic View
• Mapping from images to coordinates
are too difficult to learn
• Inaccurate in high-precision region
[Tompson et al. CVPR’15][Toshev & Szegedy . CVPR’14]
8
9. CNN
Spatial
constraints
Local evidence is weak
Forward
Backward
Global consistency helps
training
Motivation: Global Pose Consistency Helps in
Learning Better Representation
Forward
Backward
9
10. Difficulties in Modeling Spatial Constraints
∗
∗
=
=
face
shoulder
s | f
s | s
face to shoulder
shoulder to shoulder
Tompson, Jonathan J., et al. "Joint training of a convolutional network and a graphical model for human pose estimation." NIPS. 2014. 10
shoulder
⨂
Weakly spatial histogram over body part
locations
• Less effective for large variations
Learned by convolutional kernels
• Parameter space is too large hence is
difficult to learn
11. Graph Models
𝐺 = (𝑉, 𝐸)
Vertices
• Locations and mixture types of
body parts
• Modeled by a front-end CNN
Edges
• Pairwise spatial relationships
between body parts
• Modeled by message passing
layers
11
message passing
32. Evaluation Metrics
Percentage of Correct Parts (PCP)
• Correctly localized body parts
• A candidate body part is treated as correct if its
segment endpoints lie within 50% of the length
of the ground-truth annotated endpoints.
• Penalize short limbs
Percentage of Detected Joints (PDJ)
• Correctly localized joints invariant to scale
• Curve computed by varying localization precision
precision threshold, which is normalized by the
scale defined as distance between left shoulder
and right hip
32
41. Unary Term vs. Full Model
83.4
69
53.5
34.9
72.2
63.5
60.1
96.5
83.1
78.8
66.7
88.7
81.7
81.1
30
40
50
60
70
80
90
100
TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN
STRCT PCP ON THE LSP DATASET (VGG-LG)
Unary Full Model
41
42. Tree-Structured Model vs. Loopy Model
42
96.2
83.4
78.7
65.8
87.9
81.1
80.7
96.5
83.1
78.8
66.7
88.7
81.7
81.1
62 67 72 77 82 87 92 97
Torso
Head
Upper Arms
Lower Arms
Upper Legs
Lower Legs
Mean
Loopy Model Tree Model
43. Future work
Deep Residual Learning for Human Pose Estimation
Image Dependent Graph Structure Learning
43
45. Residual Learning:
Intuition
• A deeper model should not
have higher training error
than its shallower
counterpart.
• One solution:
identity mapping
Identity mapping
45
46. Plain Network
• 𝐻(𝐱) is the underlying mapping
• Expect stacked two layers to
approximate 𝐻(𝐱) Weight layer
Weight layer
𝐻(𝐱)
ReLU
ReLU
𝐱
46
47. Residual Learning
• Explicitly fit a residual mapping
𝐹 𝐱 = 𝐻 𝐱 − 𝐱
Weight layer
Weight layer
ReLU
ReLU
𝐱
Insight:
Finding optimal around
zero is easier! 𝐹 𝐱
𝐻 𝐱 = 𝐹 𝐱 + 𝐱
+
𝐻(𝐱)
47
51. Thank you.
Deep Learning for Articulated Human Pose Estimation
Wei Yang
wyang@ee.cuhk.edu.hk
Supervisors: Prof. Xiaogang Wang and Prof. Wanli Ouyang
Committee
Prof. Xiaogang Wang (EE)
Prof. Wai-kuen Cham (EE)
Prof. Dahua Lin (IE)
52. Appendix: Number of Message Passing Layers
80.7
80.9
81.1
MEAN
52
80.7
81.2
81.7
LOWER LEGS
87.9
88.3
88.7
UPPER LEGS
66.3
66.3
66.7
LOWER ARMS
78.4
78.2
78.8
UPPER ARMS
1st Layer 2nd Layer 3rd Layer
53. Appendix: Independent Training vs. Joint
Training
93
82.1
70.6
55.4
82.1
75.3
74.2
95
83.5
75
61.9
86.9
79.8
78.6
30
40
50
60
70
80
90
100
TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN
Independent Joint
53
Good afternoon. Welcome to my thesis proposal defence.
I’m Wei Yang from the IVP group. The title of this talk is deep learning for articulated human pose estimation.
So the first question is that: what is articulated human pose estimation?
Given an image or a video, the goal of articulated pose estimation is to recover the joint positions of articulated limbs of human body, as shown in this image.
Applications of articulated human pose estimation is very broad. From recognizing activities to interactive game systems, and from creating movies to clothing recognition, human pose estimation is a very useful information to help solve the problems or to make the original problems easier.
However, the pose estimation problem itself is not a trivial task. Human limbs are highly articulated and flexible, hence a people can appear with a variety of poses and body shape.
Meanwhile, different view points lead to different body shape or foreshortening. Various clothing also lead to various appearance of human body. All these factors make the problem more difficult.
To solve the problem, earlier methods adopt part based models, which divide the human body into a set of body parts, such as the head, torso, arms, and legs. In 3D space, these parts can be modeled as cylinders.
Later works, such as Pictorial structures, use 2 dimensional part templates, and encode the spatial relationships among different body parts by using springs (or the edges). However, capturing the whole range of appearances using pictorial structures is still quite difficult.
Take this picture as an example, A big problem is that even projections of a simple cylinder into 2D yields many different appearances. So one usually has to explicitly evaluate many different possible in-plane orientations and foreshortenings in order to find a good match for a part template.
To better handle the large variations, the mixture of mini parts model has been proposed. Each part is clustered into several mixtures according to its appearance. And each mixture has its own unary template for detection. For example, in this image the mini-parts are tuned to represent near-vertical and near horizontal limbs. to approximate the transformations
In implementation, the mixture of parts is obtained by clustering the relative locations of two neighboring body parts. We can see that the samples from the same cluster share similar visual appearance.
Recently, the state-of-the-art performance on pose estimation are achieved by deep learningmethods.
Deeppose [26] estimates the (x, y) locations of the body part by a regressor in a holistic manner. The regressor is based on the deep convolutional neural networks, and its expressive power is strong. However, the mapping from raw images to (x, y) coordinates are too difficult to learn, hence this method suffers from inaccuracy in the high-precision region.
CNN-based heatmap regression models have shown the potential of learning better representations. However, geometric constraints between body parts are usually missing in the training stage. As a concequence, during training stage, these kind of methods may produce imperfect heat maps during training.
For example, these methods may produce many high response regions to the head of unannotated people, and the error will be backpropagated to update the model parameters. However, this is inappropriate.
Since the local evidence is weak, we should consider the global consistency of the whole human body. This could be done by considering the geometric relationships between body parts during the training stage.
A natural way to model spatial constraints is to use convolutions. Once the spatial kernels have been learned, one can use these kernels to enforce the global pose consistency. There kernels can be calculated by creating a histogram of joint 𝑎 locations over the training set, given that the adjacent joint b is located at the kernel center. These kernels can also be learned by using the standard backpropagation algorithm. However, there are two limitations of this method.
First, these kernels are difficult to handle large variations, especially for the highly articulated parts such as arm and legs. Second, the kernels should large enough to cover sufficient context. Hence the parameter space is very large and the parameters are difficult to learn.
In this proposal, we propose to incorporate the CNN and the expressive mixture of parts model into an end-to-end framework. This enables us to predict the body part locations with the consideration of global pose configurations during the training stage.
We formulate the human pose estimation problem by using a graph model G=(V, E). V denotes the vertices, which specify the positions and the mixture types of body parts. The vertices are modeled by a front-end CNN in our framework. The edges model the pairwise spatial relationships between body parts. a node sends a message to each of its neighbors and receives messages from each neighbor (indicated by arrows).
Here is an illustration of the proposed framework.
It can be viewed as two components: a front-end DCNN for learning feature
representations of body parts, which followed by a softmax layer and a logarithm layer.
The second component is the message passing layers for conducting inference and learning on mixture of parts with deformation constraints between
parts. Specifically, each message passing layer performs one iteration of message passing algorithm in a forward pass. Finally, the final score map of each body part is computed by compute the maximum value over mixture types.
Given an image image I. the full score of a pose configuration is as this equation.
l is the (x, y) location of each part
T is the mixture type of each part i
The full score consists of the unary term and the pairwise term. The unary term is to model the part appearance, which is denoted by phi. The parameter theta is learned by the front-end CNN followed by a softmax layer and a logarithm layer.
The pairwise terms model the spatial relationships between body parts. we use standard quadratic deformation constraints to model this term, which will be discussed later.
We will first discuss the front-end CNN of our framework. It is a fully convolutional network. Given an input image, the output of the network are scoremaps for mixture types. Note that the front-end CNN does not take the global pose consistency into consideration, hence unary term may contain lots of false positives.
The mathematical formulation of the unary term is written as this equation. F denote the raw score of each mixture type predicted by the front-end CNN. Then the following softmax layer compute the normalized score of each mixture type. Then the logarithm layer transform the normalized score into the log space.
To make the training easier and faster. We first pretrain the front-end CNN with image patches. Suppose we have p parts, and each part is clustered into K mixture types, Then an arbitrary image patch is either the background, or belongs to one of the PXK classes. Then given a training image patch, the network predicts a label out of PxK + 1 classes. As mentioned before, the mixtures are obtained by performing clustering on the relative locations of neighboring body parts.
The second term consists of a deformation model that evaluates the relative locations of pairs of parts. We write psi for the squared offset between two part locations, and we write beta for the parameters of a spring that favors certain offsets over others. Beta encodes both the rest position and rigidity of the spring. In a Gaussian model, this would be the mean and covariance.
We employ the Max-sum algorithm to infer the best configuration in graphical models. Although the max-sum algorithm is only an approximation and the convergence cannot be guaranteed on loopy structures, it still provided excellent experimental results.
At each iteration, a vertex sends a message to its neighbors and receives messages from its neighbors. We denote mij(lj ; tj) as the message sent from part i to part j, and ui(li; ti) as the belief of part i, then the max-sum algorithm updates the messages and beliefs by these two equations.
This process iterates several times until convergence. And then we are able to obtain the max-sum assignment by compute the argmax of ui.
This process iterates several times until convergence. And then we are able to obtain the max-sum assignment by compute the argmax of ui.
Here are two examples demonstrate the results produced by different message passing layers. We can see that the results are getting better when we increase the number of message passing layers. It is not difficult to understand this phenomenon. Intuitively, a part could receive messages from further parts as the number of message passing layer increases, which may result in better pose estimations.
We demonstrate the effective of the proposed method on three widely used public datasets. The first one is the LSP dataset, namely the LEEDS sports dataset, it consists of 1000 training images and 1000 testing images from sports activities with challenging articulations.
The second dataset is the Frames Labeled in Cinema dataset, namely the FLIC dataset. This dataset is collected from popular Hollywood movies with diverse appearances and poses. Each person is annotated by 10 upper-body joints. It consists of about 4000 training and 1016 testing images.
The third dataset is the Image Parse dataset which contains diverse activities. We did not train on this dataset. It only used for cross-dataset validation to evaluate the generalization ability of the proposed method.
We adopt two widely used evaluation metrics for evaluation.
The first one is the Percentage of Correct Parts (PCP). It measures the rate of correctly detected limbs: a limb is considered as correctly detected if the distances between the detected limb endpoints and groundtruth limb endpoints are within
half of the limb length.
However, this metric penalize very short limbs. Hence the adopt the Percentage of Detected Joints as the complementary evaluation metric. This metric measures the rate of correctly localized joints, and it is invariant to scale. It computes a curve by varying localization precision threshold.
Some results on the LSP dataset are visualized in this slide. The proposed method is robust to highly articulated poses with variant orientation, foreshortening, cluttered background, occlusion, and overlapping people.
We report the PCP results on the LSP dataset on six limbs: torso, head, upper arms, lower arms, upper legs and lower legs. The cyan bar denote our method. We can see that our method can get the highest PCP value in average and on most of the limbs compared with the previous methods. We can also find that the most difficult body parts are the lower arms. Because lower arms are the body parts with the largest articulations.
We also demonstrate the PDJ curve on the LSP dataset on four body joints, namely the elbows, wrists, knees, and ankles. The red curve denote our method. By comparing the PDJ value at the threshold 0.2, our method outperforms the previous methods by a large margin on all body parts except ankles.
In this slide, we show some sample results on the FLIC dataset. Compared with previous method, our method is robust to large appearance variation and overlapping people, for example, existing methods have difficulty to accurately locate the body part for the man in the costume. However, our method is able to handle this case.
From the PDJ curve, we can also show that our method has some improvement compared with previous methods.
To demonstrate the generalization ability, we directly used the full-body model trained on the LSP dataset to predict the poses on the test images in the image parse dataset. The visualized results are pretty satisfactory. The PCP results are also reported. The proposed method achieve better or comparable results with the state-of-the-art methods. Note that most of the previous methods used the training data from the image parse dataset to train the model.
Some failure cases are showed. Our method may lead to wrong estimations due to significant
occlusions, ambiguous background, or heavily overlapping persons.
To evaluate the improvement brought by spatial constraints and joint learning, we compare the unary term with the full model. We found that the spatial constraints and the joint learning boost the performance by about 20 percent.
Our framework is flexible for both the tree-structured model and the loopy graph models. By following the previous work, we add symmetry constraints between left and right knees. We find that this constraint is very helpful for reducing the double counting problem in legs.
In future work, we plan to extend the proposed framework in two directions. First, we could use the deeper and more powerful network architecture to boost the performance. And currently, the graph structure is hand crafted, and may not be the optimal structure for every image. We want to learn the graph structure.
The depth of the network grows rapid in recent years. And generally, we find that the deeper the network, the better the performance. But is there a limitation? Through experiment, people find that the deeper network may produce higher training error when compared to its shallower counterpart.
There are several reasons. First is the notorious gradient vanishing or exploding problem. Moreover, current solvers such as Stochastic gradient descent is difficult to find the optimal mappings in the very deep network.
However, we find the a deeper model should not have higher training error than its shallower counterpart. For example, if the stacked layer are identity mapping, then the training error will not increase no matter how many layers are stacked. This is the basic idea of residual learning.
Let’s call the conventional network as the plain network. And H(x) is the underlying mapping. We hope to approximante the underlying mapping Hx by stacking of two layers. And we know it is difficult.
But how about learning the residual of Hx and x? Because find optimal around zeror is much easier. Hence we can fit a residual mapping explicitly. One building block is like this.
We stack many building blocks to build a very deep network for pose estimation. We call it the ResNet. It achieves better results on the VGG network. And we will investigate more variants of ResNet to better fit the pose estimation problem.
In literature, the graph structure for modeling the relationships among body parts is usually designed.
manually [60, 5]. However, no theoretical analysis shows how to build the connections among
body parts, or which graph structure is optimal. Some efforts have been made on learning graph
structures [55] from data. However, the graph structure is fixed once it has been learned and lacks flexibility to handle large variations.
As mentioned before, previous work use convolutional kernels to learn the geometric relationships between parts. This process can be formulated by this equation. It approximates message passing from one score map to another score map by using a convolution layer, as illustrated in the figure.
In previous work, this kind of convolution layer is either fully connected, or connected by hand crafted graph structures, and lacks flexibility to handle large variations.
We propose to adjust the graph structure according to the image by incorporating gates to control the message passing.