GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
Human Pose Estimation by Deep Learning
1. Human Pose Estimation by Deep Learning
Wei Yang
Supervisor: Prof. WANG Xiaogang, Prof. OUYANG Wanli
IVP Lab, CUHK
September 11, 2015
2. Outline
• Introduction
• Traditional Approaches
• Deep Learning Methods
– Global view (holistic view)
– Local appearance
– Combination of local appearance and global view
– Others
2015/9/11 2
3. Introduction
• What is articulated body pose estimation?
“recovers the pose of an articulated body, which consists of joints and rigid parts
using image-based observations.”
2015/9/11 3
6. Traditional Approaches
Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
Pictorial Structure
• Unary Templates
• Pairwise Springs
Yang & Ramanan 2011
Mixtures of “mini-parts”
• Mixture of part 𝑖
• Unary template for part 𝑖 with mixture 𝑚𝑖
• Pairwise springs between part 𝑖 with
mixture 𝑚𝑖 and part 𝑗 with mixture 𝑚𝑗
2015/9/11 6
head
torso
leg
Example of mini parts: near-vertical and near horizontal limbs
7. Deep Learning for Pose Estimation
• Holistic View
–e.g., joints position regression
• Local View
–e.g., body parts detection
• Combining global and local information
–e.g., body parts detection + joints position regression
• Others
–e.g., motion features, pose estimation in videos
2015/9/11 7
9. Holistic Reasoning
2015/9/11 9
• Why holistic reasoning?
– Besides extreme variability in articulations, many of the joints are barely visible
10. DeepPose: A CNN Regressor
2015/9/11 10
• Network architecture: AlexNet
– Krizhevsky, Sutskever, and Hinton, NIPS 2012 (ImageNet)
– The first time deep model is shown to be effective on large scale
[Toshev & Szegedy, CVPR 2014]
12. Cascade of Pose Regressors
• The pose estimation results are very coarse:
– due to its fixed input size of 220 × 220, the network has limited capacity to look
at detail
– Train cascade of pose regressors for more precise joint localization
2015/9/11 12
17. Motivation
• Local image patches are able to capture:
– Part presence
– Pairwise part spatial relationships
2015/9/11 17
Number of mixture type for each pair: 6
Neighbor: 1
# of relationships: 61 = 6
Neighbor: 2
# of relationships: 62
= 36
Lowerarm
Upper arm
[Chen & Yuille NIPS 2014]
18. Tree-structured Relational Graph
• 𝑇 = 𝑉, 𝐸
– 𝑉: body parts
– 𝐸: pairwise relationships between parts
• 𝐩 = 𝑝𝑖 = {(𝑥𝑖, 𝑦𝑖)}
– 𝑝𝑖: Pixel location of part 𝑖
• 𝑡 = {𝑡𝑖𝑗, 𝑡𝑗𝑖| 𝑖, 𝑗 ∈ 𝐸}
– Pairwise relationship
– Defined by relative position
– 𝑡𝑖𝑗 ∈ 1, … , 𝑇𝑖𝑗
– In experiment: 13 type for each pair
𝑖, 𝑗 ∈ 𝐸
2015/9/11 18
19. Formulation
2015/9/11 19
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 =
𝑖∈𝑉
𝐴𝑖(𝑝𝑖|𝐼; 𝜃)
Part
presence
𝜔𝑖 ⋅
Inference: 𝐩∗
, 𝐭∗
= arg max
𝐩,𝐭
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃
• Tree structure
• Can be solved efficiently by dynamic programming
𝜔𝑖, 𝜔𝑖𝑗, 𝝎𝑖𝑗
𝑡 𝑖𝑗
are learned by Latent structure SVM
+
(𝑖,𝑗)∈𝐸
𝑅(𝑝𝑖, 𝑝𝑗, 𝑡𝑖𝑗, 𝑡𝑗𝑖|𝐼; 𝜃)
Pairwise
deformation
+𝝎𝑖𝑗
𝑡 𝑖𝑗
⋅𝜔𝑖𝑗 ⋅
Pairwise
Relationship
20. Learning DCNN parameters 𝜃
2015/9/11 20
Derive the type label for each patch
• use relative position 𝑑𝑖𝑗 to represent
the pairwise relations
• Cluster the relative positions over the
whole training set 𝑑𝑖𝑗 𝑖=1
𝑁
• Type label 𝑡𝑖𝑗
𝑛
: cluster index
• Mean relative position 𝑟𝑖𝑗
𝑡 𝑖𝑗
: cluster
center
21. Casting Full Connections into Convolutions
2015/9/11 21
Elbow
Part presence map
Pairwise relationship
map
22. PCP and PDJ on LSP dataset and FLIC dataset
Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
LSP
DCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8
Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
LSP FLIC
2015/9/11 22
23. Combining Local Appearance and Holistic View
Dual-Source Deep Neural Networks for Human Pose
Estimation
2015/9/11 23
24. Dual-Source CNN
• Integrate both the local part appearance and the holistic view
of each local part for more accurate human pose estimation
• Each input is an image pair
– Part patches
– Body patches
2015/9/11 24
25. Part patches: incorporate local appearance
• Generated by region proposals with some
restrictions
– Not too small (at least contain a body part)
– Not too big (may contain too many body parts and
lacks sufficient resolution)
• All classes of joints are covered by similar
number of part patches
• During testing, part patches are selected
from multi-scale sliding windows
2015/9/11 25
26. Body patches: holistic view
• Also from region proposals
– Must cover all body parts
– In testing stage, the body patch can be generated by human detection
• For DS-CNN, each training sample is made up with 3
components
– A part patch
– A body patch
– Binary mask specifying the location of the part patch in body patch
2015/9/11 26
27. Training of the DS-CNN
2015/9/11 27
Shared weights Classification
(softmax)
Regression
(L2 distance)
28. • Part heat map
– Same size of input image
– Uniformly distributed probability for each sliding window
– Sum and average over all pixels
Testing
2015/9/11 28
0.0
0.9
0
29. Testing
• Final pose estimation
– Weighted average of predicted joint locations within part patches with high
responses.
2015/9/11 29
31. Other Methods & Applications
• MoDeep: A Deep Learning Framework Using Motion
Features for Human Pose Estimation
• Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11 31
32. Using Motion Features for Human Pose Estimation
• motion is a powerful visual cue that alone can be used to
extract high-level information, including articulated pose.
2015/9/11 32
Image credit: Large displacement optical flow: descriptor matching in variational motion estimation
Thomas Brox, J. Malik. IEEE TPAMI, 33(3): 500-513, 2011
33. Modeep: Using Motion Features for Human Pose
Estimation
• Extended Frames Labeled In Cinema (FLIC) dataset with
additional motion features
2015/9/11 33
MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation.
Arjun et. al., ACCV 2014
Average of frame pair Optical flow
35. Simple Spatial Model
• FLIC: multiple people with only one annotated person
• Testing: incorporate annotated torso position with simple
spatial model
2015/9/11 35
Predicted left shoulder Spatial mask of left shoulder Result
37. Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11 37
• CNN can benefit from temporal context by combining
information across the multiple frames using optical flow.
38. Spatial ConvNet
2015/9/11 38
Why regression heatmap instead of
joint coordinates?
• The network can be multi-modal
• regressing coordinates directly is a highly
non-linear and more difficult to learn
mapping
39. Warping neighbouring heatmaps for improving pose
estimates
• Heatmaps from frames (t − n) and (t + n) warped to frame t
using tracks from optical flow (green & blue lines) can help
refine the wrongly estimated part location
2015/9/11 39
41. • End-to-end pose estimation
– Joint learning of pose features and pose configurations
– Allow local appearance to be fine-tuned by pose configuration
Ongoing Project
2015/9/11 41
UnaryresponsePairwiserelationships
…
43. Preliminary Results (PCP on LSP)
2015/9/11 43
• Future work
– Pose relational graph learning
– Multi-task learning
• Human detection
• Human segmentation
– Combining global information
Head Torso U.arms L.arms U.legs L.legs mean
84.7 91 68.7 53.6 80.7 73.3 72.82
44. Recent developments
• Deeppose: Human pose estimation via deep neural networks
– A Toshev, C Szegedy – CVPR, 2014
• Joint training of a convolutional network and a graphical model for human pose estimation
– JJ Tompson, A Jain, Y LeCun, C Bregler – NIPS, 2014
• Human Pose Estimation with Iterative Error Feedback
– Carreira, Joao, et al. arXiv preprint arXiv:1507.06550 (2015).
• Maximum-Margin Structured Learning with Deep Networks for 3D Human PoseEstimation
– S Li, W Zhang, AB Chan - arXiv preprint arXiv:1508.06708, 2015
• Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network
– S Li, ZQ Liu, AB Chan – CVPR Workshop, 2014
• Flowing ConvNets for Human Pose Estimation in Videos
– T Pfister, J Charles, A Zisserman - ICCV, 2015
• R-CNNs for Pose Estimation and Action Detection
– G Gkioxari, B Hariharan, R Girshick, J Malik - arXiv preprint arXiv:1406.5212, 2014
• MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation
– A Jain, J Tompson, Y LeCun, C Bregler -ACCV 2014
• Efficient object localization using convolutional networks
– J Tompson, R Goroshin, A Jain, Y LeCun, C Bregler – CVPR, 2015
• Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation
– Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, CVPR 2015
• Parsing Occluded People by Flexible Compositions
– Xianjie Chen, Alan L. Yuille. CVPR 2015
• Articulated pose estimation by a graphical model with image dependent pairwise relations
– X Chen, AL Yuille –NIPS, 2014
• …
2015/9/11 44
45. Thank you
Human Pose Estimation by Deep Learning
Wei Yang
IVP Lab, CUHK
September 11, 2015
46. Evaluation Metrics
• Percentage of Correct Parts (PCP)
– measures the percentage of correctly localized body parts.
– A candidate body part is treated as correct if its segment endpoints lie within
50% of the length of the ground-truth annotated endpoints.
• Percentage of Detected Joints (PDJ)
– measures the performance using a curve of the percentage of correctly localized
joints by varying localization precision threshold, which is normalized by the
scale defined as distance between left shoulder and right hip
– invariant to scale
2015/9/11 46
Editor's Notes
Good afternoon everyone. Welcome to the first IVP seminar of this term.
I’m YANG Wei. In last two seminars, Xingyu and Chu Xiao gave us a comprehensive overview of object detection as well as traditional human pose estimation approaches. In this talk, I will continue the discussion on recent developments of human pose estimation based on the powerful deep learning methods. Hope you can benefit from these methods.
First, we will briefly review the problem of human pose estimation
Meanwhile, we will go over the traditional approaches for pose estimation, which have been discussed in the seminar given by Chu Xiao.
Then we will spend most of the time discussing several important approaches based on deep learning techniques, from both global view and local view.
According to Wikipedia, the goal of articulated pose estimation is to “recovers the joint positions of articulated limbs, as we show here for a man playing baseball.
There are lots of applications where being able to estimate human pose is useful. For example, pose estimation is helpful for recognizing action. It also helps to parse clothing in fashion photographs. Recently, pose estimation has been successful applied in human tracking and gaming systems.
However, In unconstrained images, human pose estimation can be a very hard problem because people can appear with a variety of poses, clothing, and body shape. In the slides, you can see some very interesting and unusual examples that demonstrate how flexible the human pose is.
Traditional approaches for human pose estimation model the human as a set of parts, such as a head, torso, arm, and leg part. In 3D, these parts can be modeled as cylinders.
Pictorial structures use 2D part models, where geometric relations between parts are encoded by springs.
However, capturing the whole range of appearances using pictorial structures is still quite difficult. A big problem is that even projections of a simple cylinder into 2D yields many different appearances. So one usually has to explicitly evaluate many different possible in-plane orientations and foreshortenings in order to find a good match for a part template.
Yang propose mini parts to approximate these transformations. in this case the mini-parts are tuned to represent near-vertical and near horizontal limbs.
As the fast development of DL, in recent two years, several pose estimation methods based on deep learning technich have been proposed.
Some based on holistic view (global view), e.g., directly regress body joints location. Some based on local appearance. Some combine global view and local view in a unified framework, and achieve state-of-the-art methods.
Finaly, we will also discuss some pioneer works on pose estimation in videos.
For example, in the left image. We can guess the location of the right arm only because we see the rest of the pose and anticipate the activity of the person.
Similarly, in the right image, the left half body of the person is not visible at all. Since Deep Neural Networks can model very complex relationships, the authors believe that DNN can provide a holistic reasoning.
The initial stage of DeepPose is quite straight forward. It trains a DNN to regress the locations of all the body joints given an input image.
DeepPose adopts AlexNet as the basic network structure. This structure was proposed in 2012. It won the imagenet competition on a large margin, and is the first time that deep model is shown to be effective on large scale computer vision task.
This is the visualized results on LSP dataset. We can see that this method has limitations in high precision regions, such as lower arms and lower legs.
It is worth to mention that this method is very fast, since predictions can be get by batch forward propagation.
The pose estimation results from the initial stage are very coarse especially in high precision regions:
One possible reason is that the input size is fixed as 220 by 220, the network has limited capacity to look at details.
To refine coarse regression results, the authors further train cascade of pose regressors for more precise joint localization
Given the predicted joint locations from the last stage. We first crop image patches centering at the predicted location. And then train a DNN-based regressor to refine the respected locations.
This process can be repeated several times. It is helpful to refine the coarse predictions because the network can see higher resolution regions.
The ground truths are in green and predicted poses are in red. We can see that the initial stage is usually successful at estimating roughly correct pose. However, the results are not precise enough. After one stage of refinement, the results are much more accurate.
We observe that local image patches are not only able to capture part presence, but also able to reason pairwise spatial relationships.
For example, consider the patch centered at wrist can predict the relative position of elbow; the patch centered at elbow can reliably predict position of shoulder and wrist.
We use mixture model to define different types of spatial relationships. The right panel shows typical spatial relationships the wrist can have with its neighbor elbow.
The left panel shows the typical spatial relationships the elbow can have with its two neighbors, say shoulder and wrist.
Based on this observation, we can define human pose as a tree structure graph, where each node denotes the position of each part, and the edges denote the pairwise spatial relationships.
We define the score function of part locations p and pairwise relation types t. It is computed by summing the Unary appearance term and the pairwise relationship term. The unary term is the part presence map indicating the probability that part I appears at each location of the image. Pairwise term consists of two part. The first part is the pairwise relationship map, and the second part is the deformation cost. Theta are parameters which are learned by CNN.
Inference is to find the positions and mixture types to maximize this score. As the relational graph is tree structure, it can be efficiently solved by dynamic programming.
Here we talk about how to learn theta. Given an image, we want produce a score map to indicate its probability of a specific type. This is done by learn a multi class classifier on local image patches. First we need to derive type label for each patch.
Then we use two convolutional layers with 1 by 1 kernels to replace the original fully connected layers. Then the network becomes a fully convolutional network, and can perform convolutions on input image with arbitrary size, and the output is the scoremap for each type, as we want.
Then we can easily compute the part presence map and pairwise relationship maps as this figure illustrated.
For example, to compute part presence map of elbow, we just add all the score maps associated with elbow to shoulder, and elbow to wrist together.
To compute pairwise relationship maps, we need to perform marginalization.
Here are
As we discussed before, both global and local methods have merits and drawbacks for human estimation. Hence in this years CVPR, a paper combining both local appearance and holistic view is proposed.
In this paper, the authors train a network by dual-sources. Which is to say that each input is an image pair. One image is the body patch, which incorporate local appearance information. One image is the full image, which incorporate the global context information. The authors hope that this combination would result in more accurate human pose estimation.
The authors first use the objectiveness methods to propose a lot of category-independent object proposals, as shown in the boxes in the image. Then the part patches are selected by some restrictions. First the region cannot be too small, it must contain a whole body part. Second, the proposed region cannot be too big either. Because all patches will be warped to the same size as the input of the network, too large regions lacks sufficient resolution.
Moreover, for efficient training, all classes of joints are covered by similar number of part patches
During testing, part patches are selected from multi-scale sliding windows.
Body patches are also selected from region proposals. The region must cover all the body parts. In testing stage, these regions can be generated by human detection.
The binary mask is concatenated with the body patch as an additional alpha channel.
During training, both part patch and body patch are fed into a two branch CNN. The local part branch is to predict the label of the part patch. This is a classification problem, and is trained by using softmax loss function.
The global branch is to predict the x, y coordinate given the body patch and the corresponding part mask. This is a regression problem and is trained by using the Euclidean loss function.
Note that the structures of the two branch are the same, hence the weights are shared except for the last layer.
In test stage, a heap map is generated for each part. The heap map has the same size of the input image. First, the part patches are obtained by sliding window method. Then use the trained network to predict the probability of a each label for each part. The pixels within the patch have the same probability. Finally, sum and average over all pixels to get the final heat map.
While the heat map provides a rough estimation of the joint location, it is insufficient to accurately localize the body joints. Remember that the global branch predicts the accurate joint location within a given patch. Hence for a specific part, we select part patches with high probability. And compute the weighted sum of the predicted joint locations to get the final joint location.
Here is the PCP value on LSP dataset. We can see that this method improves the performance on a large margin.
OK. After discuss methods from local and global view. Lets discuss some applications of pose estimation in videos.
We all know that motion is a powerful….
This figure illustrates the optical flow. The left side is the average of two adjacent frames. The right side is the estimated optical flow. We can see that the background can be greatly suppressed by the motion feature. Which would be a great help for pose estimation.
Here, a method called modeep try to incorporate motion features to improve human pose estimation.
This method extended the FLIC dataset with additional motion features, as shown is the figure.
Then it trains a multi-resolution convolutional network to predict the heat maps for each body parts with the additional motion features as the input.
Since FLIC is a dataset with multiple people within an image, but only one person is annotated. In testing stage, the tors box can be used to help determine which pose to be estimated. This method compute a spatial mask of each part with respect to the torso box. This mask is helpful for suppressing false positives.
Here are some experiment results. The first line are the estimated pose without motion feature and the second lines are with motion feature.
We can see that motion can greatly improve the results in occlusion, cluttered background, and the motion blur situation.
A very similar work also use optical flow to track human pose in videos. This work has published in this years ICCV. It first use a CNN to predict heatmaps for each body parts for each frame, then for the t’th frame, it computes the optical flow of t-n to t+n frame with respect to the t-th frame. The heatmaps are then warped to the t-th frame according to the optical flow. Finally, the authors use a 1by 1 convolutional layer to combine all the heat maps together.
Here is an illustration of the network producing part heat maps. The authors discussed why….
Finally. I wanna give brief introduction of my ongoing project. As we have discussed before, most of the pose estimation frameworks are not end-to-end. They often learn pose features first, and then fixed the feature to optimize a relationship model.
In my work, I design an end-to-end pose estimation framework. It can be viewed as the feature extraction part plus a deformable part model. However, we plug the deformable part model into the network. And the parameter of both parts can be learned jointly.
Here is an illustration of the deformable part model. Since the relation graphs of human pose are often in tree structure. We can use message passing method for efficient inference. This is very similar to traditional recurrent neural network. Here each time step denote a part. The message is passed from the leaves to the root. The deformation weights are shared across different parts. To learn the parameters, we can use backpropagation to learn the deformation weights and the weights of convolution layer and fully connected layers jointly.
Preliminary experiments shows that the proposed method out performs most of the traditional approaches. However, it still not better can recent deep learning methods. In in the future, we plan to learn the pose relational graph from the dataset. Meanwhile, pose estimation may benefit from related tasks such as human detection and human segmentation.
Finally, we need to figure out how to combine global information into this framework.