SlideShare a Scribd company logo
1 of 46
Human Pose Estimation by Deep Learning
Wei Yang
Supervisor: Prof. WANG Xiaogang, Prof. OUYANG Wanli
IVP Lab, CUHK
September 11, 2015
Outline
• Introduction
• Traditional Approaches
• Deep Learning Methods
– Global view (holistic view)
– Local appearance
– Combination of local appearance and global view
– Others
2015/9/11 2
Introduction
• What is articulated body pose estimation?
“recovers the pose of an articulated body, which consists of joints and rigid parts
using image-based observations.”
2015/9/11 3
Applications
Action recognition Clothing Parsing
Gaming
2015/9/11 4
Human tracking
Challenges
2015/9/11 5
Traditional Approaches
Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
Pictorial Structure
• Unary Templates
• Pairwise Springs
Yang & Ramanan 2011
Mixtures of “mini-parts”
• Mixture of part 𝑖
• Unary template for part 𝑖 with mixture 𝑚𝑖
• Pairwise springs between part 𝑖 with
mixture 𝑚𝑖 and part 𝑗 with mixture 𝑚𝑗
2015/9/11 6
head
torso
leg
Example of mini parts: near-vertical and near horizontal limbs
Deep Learning for Pose Estimation
• Holistic View
–e.g., joints position regression
• Local View
–e.g., body parts detection
• Combining global and local information
–e.g., body parts detection + joints position regression
• Others
–e.g., motion features, pose estimation in videos
2015/9/11 7
Holistic View
DeepPose: Human Pose Estimation via Deep Neural
Networks
2015/9/11 8
Holistic Reasoning
2015/9/11 9
• Why holistic reasoning?
– Besides extreme variability in articulations, many of the joints are barely visible
DeepPose: A CNN Regressor
2015/9/11 10
• Network architecture: AlexNet
– Krizhevsky, Sutskever, and Hinton, NIPS 2012 (ImageNet)
– The first time deep model is shown to be effective on large scale
[Toshev & Szegedy, CVPR 2014]
Results on LSP (Leeds Sports Pose) dataset
2015/9/11 11
Cascade of Pose Regressors
• The pose estimation results are very coarse:
– due to its fixed input size of 220 × 220, the network has limited capacity to look
at detail
– Train cascade of pose regressors for more precise joint localization
2015/9/11 12
Cascade of Pose Regressors
2015/9/11 13
Refined pose estimation
2015/9/11 14
Percentage of Correct Parts (PCP) on LSP dataset
2015/9/11 15
Local Appearance Method
Articulated Pose Estimation by a Graphical Model
with Image Dependent Pairwise Relations
2015/9/11 16
Motivation
• Local image patches are able to capture:
– Part presence
– Pairwise part spatial relationships
2015/9/11 17
Number of mixture type for each pair: 6
Neighbor: 1
# of relationships: 61 = 6
Neighbor: 2
# of relationships: 62
= 36
Lowerarm
Upper arm
[Chen & Yuille NIPS 2014]
Tree-structured Relational Graph
• 𝑇 = 𝑉, 𝐸
– 𝑉: body parts
– 𝐸: pairwise relationships between parts
• 𝐩 = 𝑝𝑖 = {(𝑥𝑖, 𝑦𝑖)}
– 𝑝𝑖: Pixel location of part 𝑖
• 𝑡 = {𝑡𝑖𝑗, 𝑡𝑗𝑖| 𝑖, 𝑗 ∈ 𝐸}
– Pairwise relationship
– Defined by relative position
– 𝑡𝑖𝑗 ∈ 1, … , 𝑇𝑖𝑗
– In experiment: 13 type for each pair
𝑖, 𝑗 ∈ 𝐸
2015/9/11 18
Formulation
2015/9/11 19
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 =
𝑖∈𝑉
𝐴𝑖(𝑝𝑖|𝐼; 𝜃)
Part
presence
𝜔𝑖 ⋅
Inference: 𝐩∗
, 𝐭∗
= arg max
𝐩,𝐭
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃
• Tree structure
• Can be solved efficiently by dynamic programming
𝜔𝑖, 𝜔𝑖𝑗, 𝝎𝑖𝑗
𝑡 𝑖𝑗
are learned by Latent structure SVM
+
(𝑖,𝑗)∈𝐸
𝑅(𝑝𝑖, 𝑝𝑗, 𝑡𝑖𝑗, 𝑡𝑗𝑖|𝐼; 𝜃)
Pairwise
deformation
+𝝎𝑖𝑗
𝑡 𝑖𝑗
⋅𝜔𝑖𝑗 ⋅
Pairwise
Relationship
Learning DCNN parameters 𝜃
2015/9/11 20
Derive the type label for each patch
• use relative position 𝑑𝑖𝑗 to represent
the pairwise relations
• Cluster the relative positions over the
whole training set 𝑑𝑖𝑗 𝑖=1
𝑁
• Type label 𝑡𝑖𝑗
𝑛
: cluster index
• Mean relative position 𝑟𝑖𝑗
𝑡 𝑖𝑗
: cluster
center
Casting Full Connections into Convolutions
2015/9/11 21
Elbow
Part presence map
Pairwise relationship
map
PCP and PDJ on LSP dataset and FLIC dataset
Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
LSP
DCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8
Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
LSP FLIC
2015/9/11 22
Combining Local Appearance and Holistic View
Dual-Source Deep Neural Networks for Human Pose
Estimation
2015/9/11 23
Dual-Source CNN
• Integrate both the local part appearance and the holistic view
of each local part for more accurate human pose estimation
• Each input is an image pair
– Part patches
– Body patches
2015/9/11 24
Part patches: incorporate local appearance
• Generated by region proposals with some
restrictions
– Not too small (at least contain a body part)
– Not too big (may contain too many body parts and
lacks sufficient resolution)
• All classes of joints are covered by similar
number of part patches
• During testing, part patches are selected
from multi-scale sliding windows
2015/9/11 25
Body patches: holistic view
• Also from region proposals
– Must cover all body parts
– In testing stage, the body patch can be generated by human detection
• For DS-CNN, each training sample is made up with 3
components
– A part patch
– A body patch
– Binary mask specifying the location of the part patch in body patch
2015/9/11 26
Training of the DS-CNN
2015/9/11 27
Shared weights Classification
(softmax)
Regression
(L2 distance)
• Part heat map
– Same size of input image
– Uniformly distributed probability for each sliding window
– Sum and average over all pixels
Testing
2015/9/11 28
0.0
0.9
0
Testing
• Final pose estimation
– Weighted average of predicted joint locations within part patches with high
responses.
2015/9/11 29
Results: PCP on LSP
2015/9/11 30
Other Methods & Applications
• MoDeep: A Deep Learning Framework Using Motion
Features for Human Pose Estimation
• Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11 31
Using Motion Features for Human Pose Estimation
• motion is a powerful visual cue that alone can be used to
extract high-level information, including articulated pose.
2015/9/11 32
Image credit: Large displacement optical flow: descriptor matching in variational motion estimation
Thomas Brox, J. Malik. IEEE TPAMI, 33(3): 500-513, 2011
Modeep: Using Motion Features for Human Pose
Estimation
• Extended Frames Labeled In Cinema (FLIC) dataset with
additional motion features
2015/9/11 33
MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation.
Arjun et. al., ACCV 2014
Average of frame pair Optical flow
Multi-resolution efficient sliding window model
2015/9/11 34
Simple Spatial Model
• FLIC: multiple people with only one annotated person
• Testing: incorporate annotated torso position with simple
spatial model
2015/9/11 35
Predicted left shoulder Spatial mask of left shoulder Result
Experiment results
2015/9/11 36
Without motion feature
With motion feature
occlusion Cluttered bg Motion blur
Flowing ConvNets for Human Pose Estimation in Videos
2015/9/11 37
• CNN can benefit from temporal context by combining
information across the multiple frames using optical flow.
Spatial ConvNet
2015/9/11 38
Why regression heatmap instead of
joint coordinates?
• The network can be multi-modal
• regressing coordinates directly is a highly
non-linear and more difficult to learn
mapping
Warping neighbouring heatmaps for improving pose
estimates
• Heatmaps from frames (t − n) and (t + n) warped to frame t
using tracks from optical flow (green & blue lines) can help
refine the wrongly estimated part location
2015/9/11 39
Results
2015/9/11 40
• End-to-end pose estimation
– Joint learning of pose features and pose configurations
– Allow local appearance to be fine-tuned by pose configuration
Ongoing Project
2015/9/11 41
UnaryresponsePairwiserelationships
…
Ongoing Project
2015/9/11 42
Pairwise relationships
… 𝑥𝑡−2 𝑥 𝑡−1 𝑥𝑡 𝑥 𝑇
𝑥 𝑡 𝑥 𝑡+1𝑥 𝑡−1
𝑤 𝑑𝑡 𝑤 𝑑𝑡 𝑤 𝑑𝑡
𝑤 𝑚 𝑤 𝑚 𝑤 𝑚
(𝑃𝑎𝑟𝑡 𝑝−1) (𝑃𝑎𝑟𝑡 𝑝−2) (𝑃𝑎𝑟𝑡 𝑝−3)
𝑧𝑡 𝑧𝑡+1𝑧𝑡−1
Add constraints between body parts in a network
Distance transform
Unary response
Preliminary Results (PCP on LSP)
2015/9/11 43
• Future work
– Pose relational graph learning
– Multi-task learning
• Human detection
• Human segmentation
– Combining global information
Head Torso U.arms L.arms U.legs L.legs mean
84.7 91 68.7 53.6 80.7 73.3 72.82
Recent developments
• Deeppose: Human pose estimation via deep neural networks
– A Toshev, C Szegedy – CVPR, 2014
• Joint training of a convolutional network and a graphical model for human pose estimation
– JJ Tompson, A Jain, Y LeCun, C Bregler – NIPS, 2014
• Human Pose Estimation with Iterative Error Feedback
– Carreira, Joao, et al. arXiv preprint arXiv:1507.06550 (2015).
• Maximum-Margin Structured Learning with Deep Networks for 3D Human PoseEstimation
– S Li, W Zhang, AB Chan - arXiv preprint arXiv:1508.06708, 2015
• Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network
– S Li, ZQ Liu, AB Chan – CVPR Workshop, 2014
• Flowing ConvNets for Human Pose Estimation in Videos
– T Pfister, J Charles, A Zisserman - ICCV, 2015
• R-CNNs for Pose Estimation and Action Detection
– G Gkioxari, B Hariharan, R Girshick, J Malik - arXiv preprint arXiv:1406.5212, 2014
• MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation
– A Jain, J Tompson, Y LeCun, C Bregler -ACCV 2014
• Efficient object localization using convolutional networks
– J Tompson, R Goroshin, A Jain, Y LeCun, C Bregler – CVPR, 2015
• Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation
– Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, CVPR 2015
• Parsing Occluded People by Flexible Compositions
– Xianjie Chen, Alan L. Yuille. CVPR 2015
• Articulated pose estimation by a graphical model with image dependent pairwise relations
– X Chen, AL Yuille –NIPS, 2014
• …
2015/9/11 44
Thank you
Human Pose Estimation by Deep Learning
Wei Yang
IVP Lab, CUHK
September 11, 2015
Evaluation Metrics
• Percentage of Correct Parts (PCP)
– measures the percentage of correctly localized body parts.
– A candidate body part is treated as correct if its segment endpoints lie within
50% of the length of the ground-truth annotated endpoints.
• Percentage of Detected Joints (PDJ)
– measures the performance using a curve of the percentage of correctly localized
joints by varying localization precision threshold, which is normalized by the
scale defined as distance between left shoulder and right hip
– invariant to scale
2015/9/11 46

More Related Content

What's hot

Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereoBaliThorat1
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesFellowship at Vodafone FutureLab
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overviewjins0618
 
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationVikas Jain
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object DetectionTaegyun Jeon
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNNNoura Hussein
 
Facial emotion recognition
Facial emotion recognitionFacial emotion recognition
Facial emotion recognitionRahin Patel
 
Object tracking presentation
Object tracking  presentationObject tracking  presentation
Object tracking presentationMrsShwetaBanait1
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basicsBrodmann17
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Suraj Aavula
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learningAntonio Rueda-Toicen
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
Eigenface For Face Recognition
Eigenface For Face RecognitionEigenface For Face Recognition
Eigenface For Face RecognitionMinh Tran
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Prakhar Rastogi
 
Lec7: Medical Image Segmentation (I) (Radiology Applications of Segmentation,...
Lec7: Medical Image Segmentation (I) (Radiology Applications of Segmentation,...Lec7: Medical Image Segmentation (I) (Radiology Applications of Segmentation,...
Lec7: Medical Image Segmentation (I) (Radiology Applications of Segmentation,...Ulaş Bağcı
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaginggeetachauhan
 

What's hot (20)

Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereo
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
 
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and Classification
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
Facial emotion recognition
Facial emotion recognitionFacial emotion recognition
Facial emotion recognition
 
Object tracking presentation
Object tracking  presentationObject tracking  presentation
Object tracking presentation
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basics
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
Eigenface For Face Recognition
Eigenface For Face RecognitionEigenface For Face Recognition
Eigenface For Face Recognition
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Lec7: Medical Image Segmentation (I) (Radiology Applications of Segmentation,...
Lec7: Medical Image Segmentation (I) (Radiology Applications of Segmentation,...Lec7: Medical Image Segmentation (I) (Radiology Applications of Segmentation,...
Lec7: Medical Image Segmentation (I) (Radiology Applications of Segmentation,...
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 

Viewers also liked

Convolutional Pose Machines
Convolutional Pose MachinesConvolutional Pose Machines
Convolutional Pose MachinesTakanori Ogata
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseWei Yang
 
Manifold learning
Manifold learningManifold learning
Manifold learningWei Yang
 
Pose Machine
Pose MachinePose Machine
Pose MachineWei Yang
 
Deformable Part Models are Convolutional Neural Networks
Deformable Part Models are Convolutional Neural NetworksDeformable Part Models are Convolutional Neural Networks
Deformable Part Models are Convolutional Neural NetworksWei Yang
 
Deep convolutional neural fields for depth estimation from a single image
Deep convolutional neural fields for depth estimation from a single imageDeep convolutional neural fields for depth estimation from a single image
Deep convolutional neural fields for depth estimation from a single imageWei Yang
 
DeepPose: Human Pose Estimation via Deep Neural Networks
DeepPose: Human Pose Estimation via Deep Neural NetworksDeepPose: Human Pose Estimation via Deep Neural Networks
DeepPose: Human Pose Estimation via Deep Neural NetworksShunta Saito
 
CVML2011: human action recognition (Ivan Laptev)
CVML2011: human action recognition (Ivan Laptev)CVML2011: human action recognition (Ivan Laptev)
CVML2011: human action recognition (Ivan Laptev)zukun
 
2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)
2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)
2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)Lac Hong University
 
Recovering 3D human body configurations using shape contexts
Recovering 3D human body configurations using shape contextsRecovering 3D human body configurations using shape contexts
Recovering 3D human body configurations using shape contextswolf
 
Semantic human activity detection in videos
Semantic human activity detection in videosSemantic human activity detection in videos
Semantic human activity detection in videosHirantha Pradeep
 
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
Contextless Object Recognition with Shape-enriched SIFT and Bags of FeaturesContextless Object Recognition with Shape-enriched SIFT and Bags of Features
Contextless Object Recognition with Shape-enriched SIFT and Bags of FeaturesUniversitat Politècnica de Catalunya
 
20 Instagram Pics that will Have You Wanting to Visit the Grand Canyon Just ...
20 Instagram Pics that will Have You Wanting to Visit the  Grand Canyon Just ...20 Instagram Pics that will Have You Wanting to Visit the  Grand Canyon Just ...
20 Instagram Pics that will Have You Wanting to Visit the Grand Canyon Just ...Grand Canyon Visitor Center
 
Docking Pose Assessment: The importance of keeping your GARD up
Docking Pose Assessment: The importance of keeping your GARD upDocking Pose Assessment: The importance of keeping your GARD up
Docking Pose Assessment: The importance of keeping your GARD upDavid Thompson
 
Shape Matching and Object Recognition Using Shape Contexts
Shape Matching and Object Recognition Using Shape ContextsShape Matching and Object Recognition Using Shape Contexts
Shape Matching and Object Recognition Using Shape ContextsRatul Alahy
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)nikhilus85
 
Building and road detection from large aerial imagery
Building and road detection from large aerial imageryBuilding and road detection from large aerial imagery
Building and road detection from large aerial imageryShunta Saito
 

Viewers also liked (20)

Convolutional Pose Machines
Convolutional Pose MachinesConvolutional Pose Machines
Convolutional Pose Machines
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defense
 
Manifold learning
Manifold learningManifold learning
Manifold learning
 
Pose Machine
Pose MachinePose Machine
Pose Machine
 
Monocular Human Pose Estimation with Bayesian Networks
Monocular Human Pose Estimation with Bayesian NetworksMonocular Human Pose Estimation with Bayesian Networks
Monocular Human Pose Estimation with Bayesian Networks
 
Deformable Part Models are Convolutional Neural Networks
Deformable Part Models are Convolutional Neural NetworksDeformable Part Models are Convolutional Neural Networks
Deformable Part Models are Convolutional Neural Networks
 
Deep convolutional neural fields for depth estimation from a single image
Deep convolutional neural fields for depth estimation from a single imageDeep convolutional neural fields for depth estimation from a single image
Deep convolutional neural fields for depth estimation from a single image
 
DeepPose: Human Pose Estimation via Deep Neural Networks
DeepPose: Human Pose Estimation via Deep Neural NetworksDeepPose: Human Pose Estimation via Deep Neural Networks
DeepPose: Human Pose Estimation via Deep Neural Networks
 
CVML2011: human action recognition (Ivan Laptev)
CVML2011: human action recognition (Ivan Laptev)CVML2011: human action recognition (Ivan Laptev)
CVML2011: human action recognition (Ivan Laptev)
 
2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)
2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)
2.51 tổ chức lớp viết báo khoa học y khoa đăng trên tạp chí quốc tế (4)
 
Recovering 3D human body configurations using shape contexts
Recovering 3D human body configurations using shape contextsRecovering 3D human body configurations using shape contexts
Recovering 3D human body configurations using shape contexts
 
Semantic human activity detection in videos
Semantic human activity detection in videosSemantic human activity detection in videos
Semantic human activity detection in videos
 
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
Contextless Object Recognition with Shape-enriched SIFT and Bags of FeaturesContextless Object Recognition with Shape-enriched SIFT and Bags of Features
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
 
20 Instagram Pics that will Have You Wanting to Visit the Grand Canyon Just ...
20 Instagram Pics that will Have You Wanting to Visit the  Grand Canyon Just ...20 Instagram Pics that will Have You Wanting to Visit the  Grand Canyon Just ...
20 Instagram Pics that will Have You Wanting to Visit the Grand Canyon Just ...
 
Docking Pose Assessment: The importance of keeping your GARD up
Docking Pose Assessment: The importance of keeping your GARD upDocking Pose Assessment: The importance of keeping your GARD up
Docking Pose Assessment: The importance of keeping your GARD up
 
Shape Matching and Object Recognition Using Shape Contexts
Shape Matching and Object Recognition Using Shape ContextsShape Matching and Object Recognition Using Shape Contexts
Shape Matching and Object Recognition Using Shape Contexts
 
Shape context
Shape context Shape context
Shape context
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)
 
Building and road detection from large aerial imagery
Building and road detection from large aerial imageryBuilding and road detection from large aerial imagery
Building and road detection from large aerial imagery
 

Similar to Human Pose Estimation by Deep Learning

Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Symeon Papadopoulos
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Trackingijsrd.com
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015Jia-Bin Huang
 
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...multimediaeval
 
Remotely sensed image segmentation using multiphase level set acm
Remotely sensed image segmentation using multiphase level set acmRemotely sensed image segmentation using multiphase level set acm
Remotely sensed image segmentation using multiphase level set acmKriti Bajpai
 
SINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDY
SINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDYSINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDY
SINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDYcsandit
 
How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...Dongmin Choi
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 
Optical Flow with Semantic Segmentation and Localized Layers
Optical Flow with Semantic Segmentation and Localized LayersOptical Flow with Semantic Segmentation and Localized Layers
Optical Flow with Semantic Segmentation and Localized LayersSeval Çapraz
 
Human Action Recognition Based on Spacio-temporal features-Poster
Human Action Recognition Based on Spacio-temporal features-PosterHuman Action Recognition Based on Spacio-temporal features-Poster
Human Action Recognition Based on Spacio-temporal features-Posternikhilus85
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
final_project_1_2k21cse07.pptx
final_project_1_2k21cse07.pptxfinal_project_1_2k21cse07.pptx
final_project_1_2k21cse07.pptxshwetabhagat25
 
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET TransformRotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET TransformIRJET Journal
 
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES sipij
 
CERTH/CEA LIST at MediaEval Placing Task 2015
CERTH/CEA LIST at MediaEval Placing Task 2015CERTH/CEA LIST at MediaEval Placing Task 2015
CERTH/CEA LIST at MediaEval Placing Task 2015Symeon Papadopoulos
 
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Simone Ercoli
 

Similar to Human Pose Estimation by Deep Learning (20)

Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
 
Iciap 2
Iciap 2Iciap 2
Iciap 2
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...
MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...
 
Remotely sensed image segmentation using multiphase level set acm
Remotely sensed image segmentation using multiphase level set acmRemotely sensed image segmentation using multiphase level set acm
Remotely sensed image segmentation using multiphase level set acm
 
SINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDY
SINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDYSINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDY
SINGLE IMAGE SUPER RESOLUTION: A COMPARATIVE STUDY
 
How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
Optical Flow with Semantic Segmentation and Localized Layers
Optical Flow with Semantic Segmentation and Localized LayersOptical Flow with Semantic Segmentation and Localized Layers
Optical Flow with Semantic Segmentation and Localized Layers
 
Human Action Recognition Based on Spacio-temporal features-Poster
Human Action Recognition Based on Spacio-temporal features-PosterHuman Action Recognition Based on Spacio-temporal features-Poster
Human Action Recognition Based on Spacio-temporal features-Poster
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
 
final_project_1_2k21cse07.pptx
final_project_1_2k21cse07.pptxfinal_project_1_2k21cse07.pptx
final_project_1_2k21cse07.pptx
 
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET TransformRotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
 
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
 
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
 
CERTH/CEA LIST at MediaEval Placing Task 2015
CERTH/CEA LIST at MediaEval Placing Task 2015CERTH/CEA LIST at MediaEval Placing Task 2015
CERTH/CEA LIST at MediaEval Placing Task 2015
 
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
 

Recently uploaded

GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...HafsaHussainp
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and AnnovaMansi Rastogi
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosZachary Labe
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptAmirRaziq1
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxJosielynTars
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 

Recently uploaded (20)

GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annova
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenarios
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.ppt
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptx
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 

Human Pose Estimation by Deep Learning

  • 1. Human Pose Estimation by Deep Learning Wei Yang Supervisor: Prof. WANG Xiaogang, Prof. OUYANG Wanli IVP Lab, CUHK September 11, 2015
  • 2. Outline • Introduction • Traditional Approaches • Deep Learning Methods – Global view (holistic view) – Local appearance – Combination of local appearance and global view – Others 2015/9/11 2
  • 3. Introduction • What is articulated body pose estimation? “recovers the pose of an articulated body, which consists of joints and rigid parts using image-based observations.” 2015/9/11 3
  • 4. Applications Action recognition Clothing Parsing Gaming 2015/9/11 4 Human tracking
  • 6. Traditional Approaches Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher 2005 Pictorial Structure • Unary Templates • Pairwise Springs Yang & Ramanan 2011 Mixtures of “mini-parts” • Mixture of part 𝑖 • Unary template for part 𝑖 with mixture 𝑚𝑖 • Pairwise springs between part 𝑖 with mixture 𝑚𝑖 and part 𝑗 with mixture 𝑚𝑗 2015/9/11 6 head torso leg Example of mini parts: near-vertical and near horizontal limbs
  • 7. Deep Learning for Pose Estimation • Holistic View –e.g., joints position regression • Local View –e.g., body parts detection • Combining global and local information –e.g., body parts detection + joints position regression • Others –e.g., motion features, pose estimation in videos 2015/9/11 7
  • 8. Holistic View DeepPose: Human Pose Estimation via Deep Neural Networks 2015/9/11 8
  • 9. Holistic Reasoning 2015/9/11 9 • Why holistic reasoning? – Besides extreme variability in articulations, many of the joints are barely visible
  • 10. DeepPose: A CNN Regressor 2015/9/11 10 • Network architecture: AlexNet – Krizhevsky, Sutskever, and Hinton, NIPS 2012 (ImageNet) – The first time deep model is shown to be effective on large scale [Toshev & Szegedy, CVPR 2014]
  • 11. Results on LSP (Leeds Sports Pose) dataset 2015/9/11 11
  • 12. Cascade of Pose Regressors • The pose estimation results are very coarse: – due to its fixed input size of 220 × 220, the network has limited capacity to look at detail – Train cascade of pose regressors for more precise joint localization 2015/9/11 12
  • 13. Cascade of Pose Regressors 2015/9/11 13
  • 15. Percentage of Correct Parts (PCP) on LSP dataset 2015/9/11 15
  • 16. Local Appearance Method Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations 2015/9/11 16
  • 17. Motivation • Local image patches are able to capture: – Part presence – Pairwise part spatial relationships 2015/9/11 17 Number of mixture type for each pair: 6 Neighbor: 1 # of relationships: 61 = 6 Neighbor: 2 # of relationships: 62 = 36 Lowerarm Upper arm [Chen & Yuille NIPS 2014]
  • 18. Tree-structured Relational Graph • 𝑇 = 𝑉, 𝐸 – 𝑉: body parts – 𝐸: pairwise relationships between parts • 𝐩 = 𝑝𝑖 = {(𝑥𝑖, 𝑦𝑖)} – 𝑝𝑖: Pixel location of part 𝑖 • 𝑡 = {𝑡𝑖𝑗, 𝑡𝑗𝑖| 𝑖, 𝑗 ∈ 𝐸} – Pairwise relationship – Defined by relative position – 𝑡𝑖𝑗 ∈ 1, … , 𝑇𝑖𝑗 – In experiment: 13 type for each pair 𝑖, 𝑗 ∈ 𝐸 2015/9/11 18
  • 19. Formulation 2015/9/11 19 𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 = 𝑖∈𝑉 𝐴𝑖(𝑝𝑖|𝐼; 𝜃) Part presence 𝜔𝑖 ⋅ Inference: 𝐩∗ , 𝐭∗ = arg max 𝐩,𝐭 𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 • Tree structure • Can be solved efficiently by dynamic programming 𝜔𝑖, 𝜔𝑖𝑗, 𝝎𝑖𝑗 𝑡 𝑖𝑗 are learned by Latent structure SVM + (𝑖,𝑗)∈𝐸 𝑅(𝑝𝑖, 𝑝𝑗, 𝑡𝑖𝑗, 𝑡𝑗𝑖|𝐼; 𝜃) Pairwise deformation +𝝎𝑖𝑗 𝑡 𝑖𝑗 ⋅𝜔𝑖𝑗 ⋅ Pairwise Relationship
  • 20. Learning DCNN parameters 𝜃 2015/9/11 20 Derive the type label for each patch • use relative position 𝑑𝑖𝑗 to represent the pairwise relations • Cluster the relative positions over the whole training set 𝑑𝑖𝑗 𝑖=1 𝑁 • Type label 𝑡𝑖𝑗 𝑛 : cluster index • Mean relative position 𝑟𝑖𝑗 𝑡 𝑖𝑗 : cluster center
  • 21. Casting Full Connections into Convolutions 2015/9/11 21 Elbow Part presence map Pairwise relationship map
  • 22. PCP and PDJ on LSP dataset and FLIC dataset Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP LSP DCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8 Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6 LSP FLIC 2015/9/11 22
  • 23. Combining Local Appearance and Holistic View Dual-Source Deep Neural Networks for Human Pose Estimation 2015/9/11 23
  • 24. Dual-Source CNN • Integrate both the local part appearance and the holistic view of each local part for more accurate human pose estimation • Each input is an image pair – Part patches – Body patches 2015/9/11 24
  • 25. Part patches: incorporate local appearance • Generated by region proposals with some restrictions – Not too small (at least contain a body part) – Not too big (may contain too many body parts and lacks sufficient resolution) • All classes of joints are covered by similar number of part patches • During testing, part patches are selected from multi-scale sliding windows 2015/9/11 25
  • 26. Body patches: holistic view • Also from region proposals – Must cover all body parts – In testing stage, the body patch can be generated by human detection • For DS-CNN, each training sample is made up with 3 components – A part patch – A body patch – Binary mask specifying the location of the part patch in body patch 2015/9/11 26
  • 27. Training of the DS-CNN 2015/9/11 27 Shared weights Classification (softmax) Regression (L2 distance)
  • 28. • Part heat map – Same size of input image – Uniformly distributed probability for each sliding window – Sum and average over all pixels Testing 2015/9/11 28 0.0 0.9 0
  • 29. Testing • Final pose estimation – Weighted average of predicted joint locations within part patches with high responses. 2015/9/11 29
  • 30. Results: PCP on LSP 2015/9/11 30
  • 31. Other Methods & Applications • MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation • Flowing ConvNets for Human Pose Estimation in Videos 2015/9/11 31
  • 32. Using Motion Features for Human Pose Estimation • motion is a powerful visual cue that alone can be used to extract high-level information, including articulated pose. 2015/9/11 32 Image credit: Large displacement optical flow: descriptor matching in variational motion estimation Thomas Brox, J. Malik. IEEE TPAMI, 33(3): 500-513, 2011
  • 33. Modeep: Using Motion Features for Human Pose Estimation • Extended Frames Labeled In Cinema (FLIC) dataset with additional motion features 2015/9/11 33 MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation. Arjun et. al., ACCV 2014 Average of frame pair Optical flow
  • 34. Multi-resolution efficient sliding window model 2015/9/11 34
  • 35. Simple Spatial Model • FLIC: multiple people with only one annotated person • Testing: incorporate annotated torso position with simple spatial model 2015/9/11 35 Predicted left shoulder Spatial mask of left shoulder Result
  • 36. Experiment results 2015/9/11 36 Without motion feature With motion feature occlusion Cluttered bg Motion blur
  • 37. Flowing ConvNets for Human Pose Estimation in Videos 2015/9/11 37 • CNN can benefit from temporal context by combining information across the multiple frames using optical flow.
  • 38. Spatial ConvNet 2015/9/11 38 Why regression heatmap instead of joint coordinates? • The network can be multi-modal • regressing coordinates directly is a highly non-linear and more difficult to learn mapping
  • 39. Warping neighbouring heatmaps for improving pose estimates • Heatmaps from frames (t − n) and (t + n) warped to frame t using tracks from optical flow (green & blue lines) can help refine the wrongly estimated part location 2015/9/11 39
  • 41. • End-to-end pose estimation – Joint learning of pose features and pose configurations – Allow local appearance to be fine-tuned by pose configuration Ongoing Project 2015/9/11 41 UnaryresponsePairwiserelationships …
  • 42. Ongoing Project 2015/9/11 42 Pairwise relationships … 𝑥𝑡−2 𝑥 𝑡−1 𝑥𝑡 𝑥 𝑇 𝑥 𝑡 𝑥 𝑡+1𝑥 𝑡−1 𝑤 𝑑𝑡 𝑤 𝑑𝑡 𝑤 𝑑𝑡 𝑤 𝑚 𝑤 𝑚 𝑤 𝑚 (𝑃𝑎𝑟𝑡 𝑝−1) (𝑃𝑎𝑟𝑡 𝑝−2) (𝑃𝑎𝑟𝑡 𝑝−3) 𝑧𝑡 𝑧𝑡+1𝑧𝑡−1 Add constraints between body parts in a network Distance transform Unary response
  • 43. Preliminary Results (PCP on LSP) 2015/9/11 43 • Future work – Pose relational graph learning – Multi-task learning • Human detection • Human segmentation – Combining global information Head Torso U.arms L.arms U.legs L.legs mean 84.7 91 68.7 53.6 80.7 73.3 72.82
  • 44. Recent developments • Deeppose: Human pose estimation via deep neural networks – A Toshev, C Szegedy – CVPR, 2014 • Joint training of a convolutional network and a graphical model for human pose estimation – JJ Tompson, A Jain, Y LeCun, C Bregler – NIPS, 2014 • Human Pose Estimation with Iterative Error Feedback – Carreira, Joao, et al. arXiv preprint arXiv:1507.06550 (2015). • Maximum-Margin Structured Learning with Deep Networks for 3D Human PoseEstimation – S Li, W Zhang, AB Chan - arXiv preprint arXiv:1508.06708, 2015 • Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network – S Li, ZQ Liu, AB Chan – CVPR Workshop, 2014 • Flowing ConvNets for Human Pose Estimation in Videos – T Pfister, J Charles, A Zisserman - ICCV, 2015 • R-CNNs for Pose Estimation and Action Detection – G Gkioxari, B Hariharan, R Girshick, J Malik - arXiv preprint arXiv:1406.5212, 2014 • MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation – A Jain, J Tompson, Y LeCun, C Bregler -ACCV 2014 • Efficient object localization using convolutional networks – J Tompson, R Goroshin, A Jain, Y LeCun, C Bregler – CVPR, 2015 • Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation – Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, CVPR 2015 • Parsing Occluded People by Flexible Compositions – Xianjie Chen, Alan L. Yuille. CVPR 2015 • Articulated pose estimation by a graphical model with image dependent pairwise relations – X Chen, AL Yuille –NIPS, 2014 • … 2015/9/11 44
  • 45. Thank you Human Pose Estimation by Deep Learning Wei Yang IVP Lab, CUHK September 11, 2015
  • 46. Evaluation Metrics • Percentage of Correct Parts (PCP) – measures the percentage of correctly localized body parts. – A candidate body part is treated as correct if its segment endpoints lie within 50% of the length of the ground-truth annotated endpoints. • Percentage of Detected Joints (PDJ) – measures the performance using a curve of the percentage of correctly localized joints by varying localization precision threshold, which is normalized by the scale defined as distance between left shoulder and right hip – invariant to scale 2015/9/11 46

Editor's Notes

  1. Good afternoon everyone. Welcome to the first IVP seminar of this term. I’m YANG Wei. In last two seminars, Xingyu and Chu Xiao gave us a comprehensive overview of object detection as well as traditional human pose estimation approaches. In this talk, I will continue the discussion on recent developments of human pose estimation based on the powerful deep learning methods. Hope you can benefit from these methods.
  2. First, we will briefly review the problem of human pose estimation Meanwhile, we will go over the traditional approaches for pose estimation, which have been discussed in the seminar given by Chu Xiao. Then we will spend most of the time discussing several important approaches based on deep learning techniques, from both global view and local view.
  3. According to Wikipedia, the goal of articulated pose estimation is to “recovers the joint positions of articulated limbs, as we show here for a man playing baseball.
  4. There are lots of applications where being able to estimate human pose is useful. For example, pose estimation is helpful for recognizing action. It also helps to parse clothing in fashion photographs. Recently, pose estimation has been successful applied in human tracking and gaming systems.
  5. However, In unconstrained images, human pose estimation can be a very hard problem because people can appear with a variety of poses, clothing, and body shape. In the slides, you can see some very interesting and unusual examples that demonstrate how flexible the human pose is.
  6. Traditional approaches for human pose estimation model the human as a set of parts, such as a head, torso, arm, and leg part. In 3D, these parts can be modeled as cylinders. Pictorial structures use 2D part models, where geometric relations between parts are encoded by springs. However, capturing the whole range of appearances using pictorial structures is still quite difficult. A big problem is that even projections of a simple cylinder into 2D yields many different appearances. So one usually has to explicitly evaluate many different possible in-plane orientations and foreshortenings in order to find a good match for a part template. Yang propose mini parts to approximate these transformations. in this case the mini-parts are tuned to represent near-vertical and near horizontal limbs.
  7. As the fast development of DL, in recent two years, several pose estimation methods based on deep learning technich have been proposed. Some based on holistic view (global view), e.g., directly regress body joints location. Some based on local appearance. Some combine global view and local view in a unified framework, and achieve state-of-the-art methods. Finaly, we will also discuss some pioneer works on pose estimation in videos.
  8. For example, in the left image. We can guess the location of the right arm only because we see the rest of the pose and anticipate the activity of the person. Similarly, in the right image, the left half body of the person is not visible at all. Since Deep Neural Networks can model very complex relationships, the authors believe that DNN can provide a holistic reasoning.
  9. The initial stage of DeepPose is quite straight forward. It trains a DNN to regress the locations of all the body joints given an input image. DeepPose adopts AlexNet as the basic network structure. This structure was proposed in 2012. It won the imagenet competition on a large margin, and is the first time that deep model is shown to be effective on large scale computer vision task.
  10. This is the visualized results on LSP dataset. We can see that this method has limitations in high precision regions, such as lower arms and lower legs. It is worth to mention that this method is very fast, since predictions can be get by batch forward propagation.
  11. The pose estimation results from the initial stage are very coarse especially in high precision regions: One possible reason is that the input size is fixed as 220 by 220, the network has limited capacity to look at details. To refine coarse regression results, the authors further train cascade of pose regressors for more precise joint localization
  12. Given the predicted joint locations from the last stage. We first crop image patches centering at the predicted location. And then train a DNN-based regressor to refine the respected locations. This process can be repeated several times. It is helpful to refine the coarse predictions because the network can see higher resolution regions.
  13. The ground truths are in green and predicted poses are in red. We can see that the initial stage is usually successful at estimating roughly correct pose. However, the results are not precise enough. After one stage of refinement, the results are much more accurate.
  14. We observe that local image patches are not only able to capture part presence, but also able to reason pairwise spatial relationships. For example, consider the patch centered at wrist can predict the relative position of elbow; the patch centered at elbow can reliably predict position of shoulder and wrist. We use mixture model to define different types of spatial relationships. The right panel shows typical spatial relationships the wrist can have with its neighbor elbow. The left panel shows the typical spatial relationships the elbow can have with its two neighbors, say shoulder and wrist.
  15. Based on this observation, we can define human pose as a tree structure graph, where each node denotes the position of each part, and the edges denote the pairwise spatial relationships.
  16. We define the score function of part locations p and pairwise relation types t. It is computed by summing the Unary appearance term and the pairwise relationship term. The unary term is the part presence map indicating the probability that part I appears at each location of the image. Pairwise term consists of two part. The first part is the pairwise relationship map, and the second part is the deformation cost. Theta are parameters which are learned by CNN. Inference is to find the positions and mixture types to maximize this score. As the relational graph is tree structure, it can be efficiently solved by dynamic programming.
  17. Here we talk about how to learn theta. Given an image, we want produce a score map to indicate its probability of a specific type. This is done by learn a multi class classifier on local image patches. First we need to derive type label for each patch.
  18. Then we use two convolutional layers with 1 by 1 kernels to replace the original fully connected layers. Then the network becomes a fully convolutional network, and can perform convolutions on input image with arbitrary size, and the output is the scoremap for each type, as we want. Then we can easily compute the part presence map and pairwise relationship maps as this figure illustrated. For example, to compute part presence map of elbow, we just add all the score maps associated with elbow to shoulder, and elbow to wrist together. To compute pairwise relationship maps, we need to perform marginalization.
  19. Here are
  20. As we discussed before, both global and local methods have merits and drawbacks for human estimation. Hence in this years CVPR, a paper combining both local appearance and holistic view is proposed.
  21. In this paper, the authors train a network by dual-sources. Which is to say that each input is an image pair. One image is the body patch, which incorporate local appearance information. One image is the full image, which incorporate the global context information. The authors hope that this combination would result in more accurate human pose estimation.
  22. The authors first use the objectiveness methods to propose a lot of category-independent object proposals, as shown in the boxes in the image. Then the part patches are selected by some restrictions. First the region cannot be too small, it must contain a whole body part. Second, the proposed region cannot be too big either. Because all patches will be warped to the same size as the input of the network, too large regions lacks sufficient resolution. Moreover, for efficient training, all classes of joints are covered by similar number of part patches During testing, part patches are selected from multi-scale sliding windows.
  23. Body patches are also selected from region proposals. The region must cover all the body parts. In testing stage, these regions can be generated by human detection. The binary mask is concatenated with the body patch as an additional alpha channel.
  24. During training, both part patch and body patch are fed into a two branch CNN. The local part branch is to predict the label of the part patch. This is a classification problem, and is trained by using softmax loss function. The global branch is to predict the x, y coordinate given the body patch and the corresponding part mask. This is a regression problem and is trained by using the Euclidean loss function. Note that the structures of the two branch are the same, hence the weights are shared except for the last layer.
  25. In test stage, a heap map is generated for each part. The heap map has the same size of the input image. First, the part patches are obtained by sliding window method. Then use the trained network to predict the probability of a each label for each part. The pixels within the patch have the same probability. Finally, sum and average over all pixels to get the final heat map.
  26. While the heat map provides a rough estimation of the joint location, it is insufficient to accurately localize the body joints. Remember that the global branch predicts the accurate joint location within a given patch. Hence for a specific part, we select part patches with high probability. And compute the weighted sum of the predicted joint locations to get the final joint location.
  27. Here is the PCP value on LSP dataset. We can see that this method improves the performance on a large margin.
  28. OK. After discuss methods from local and global view. Lets discuss some applications of pose estimation in videos.
  29. We all know that motion is a powerful…. This figure illustrates the optical flow. The left side is the average of two adjacent frames. The right side is the estimated optical flow. We can see that the background can be greatly suppressed by the motion feature. Which would be a great help for pose estimation.
  30. Here, a method called modeep try to incorporate motion features to improve human pose estimation. This method extended the FLIC dataset with additional motion features, as shown is the figure.
  31. Then it trains a multi-resolution convolutional network to predict the heat maps for each body parts with the additional motion features as the input.
  32. Since FLIC is a dataset with multiple people within an image, but only one person is annotated. In testing stage, the tors box can be used to help determine which pose to be estimated. This method compute a spatial mask of each part with respect to the torso box. This mask is helpful for suppressing false positives.
  33. Here are some experiment results. The first line are the estimated pose without motion feature and the second lines are with motion feature. We can see that motion can greatly improve the results in occlusion, cluttered background, and the motion blur situation.
  34. A very similar work also use optical flow to track human pose in videos. This work has published in this years ICCV. It first use a CNN to predict heatmaps for each body parts for each frame, then for the t’th frame, it computes the optical flow of t-n to t+n frame with respect to the t-th frame. The heatmaps are then warped to the t-th frame according to the optical flow. Finally, the authors use a 1by 1 convolutional layer to combine all the heat maps together.
  35. Here is an illustration of the network producing part heat maps. The authors discussed why….
  36. Finally. I wanna give brief introduction of my ongoing project. As we have discussed before, most of the pose estimation frameworks are not end-to-end. They often learn pose features first, and then fixed the feature to optimize a relationship model. In my work, I design an end-to-end pose estimation framework. It can be viewed as the feature extraction part plus a deformable part model. However, we plug the deformable part model into the network. And the parameter of both parts can be learned jointly.
  37. Here is an illustration of the deformable part model. Since the relation graphs of human pose are often in tree structure. We can use message passing method for efficient inference. This is very similar to traditional recurrent neural network. Here each time step denote a part. The message is passed from the leaves to the root. The deformation weights are shared across different parts. To learn the parameters, we can use backpropagation to learn the deformation weights and the weights of convolution layer and fully connected layers jointly.
  38. Preliminary experiments shows that the proposed method out performs most of the traditional approaches. However, it still not better can recent deep learning methods. In in the future, we plan to learn the pose relational graph from the dataset. Meanwhile, pose estimation may benefit from related tasks such as human detection and human segmentation. Finally, we need to figure out how to combine global information into this framework.