Pose estimation from RGB images by deep learning

Object Pose Estimation from
RGB Images by Deep Learning
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline (obsolete)
• PoseCNN: A CNN for 6D Object Pose Estimation in Cluttered Scenes
• BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of
Challenging Objects without Using Depth
• SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again
• Real-Time Seamless Single Shot 6D Object Pose Prediction
• Implicit 3D Orientation Learning for 6D Object Detection from RGB Images
• Vehicle Detection and Pose Estimation for Autonomous Driving (Thesis)
• A mixed classification-regression framework for pose estimation from images
• Classification and Pose Estimation of Vehicles in Videos by 3D Modeling within Discrete-
Continuous Optimization
• Improved Object Detection and Pose Using Part-Based Models
• Object Detection and Viewpoint Estimation with a Deformable 3D Model

PoseCNN: A Convolutional Neural Network for 6D
Object Pose Estimation in Cluttered Scenes
• Estimating 6D pose of known objects is important for robots to interact with the real world.
• The problem is challenging due to the variety of objects as well as the complexity of a scene
caused by clutter and occlusions between objects.
• This work introduces PoseCNN for 6D pose estimation, which estimates the 3D translation of
an object by localizing its center in the image and predicting its distance from the camera.
• The 3D rotation of the object is estimated by regressing to a quaternion representation.
• It also introduces a loss function that enables PoseCNN to handle symmetric objects.
• It builds a large scale video dataset for 6D object pose estimation, called YCB-Video dataset.
• This dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92
videos with 133,827 frames.
• The code and dataset at https://rse-lab.cs.washington.edu/projects/posecnn/.

The PoseCNN is trained to perform three tasks: semantic labeling, 3D translation
estimation, and 3D rotation regression.

• The PoseCNN network contains two stages.
• The 1st stage consists of 13 conv. layers and 4 max- pooling layers, which extract feature
maps with different resolutions from the input image. This stage is the network backbone
since the extracted features are shared across all the tasks performed by the network.
• The 2nd stage consists of an embedding step that embeds the high-dimensional feature
maps generated by the first stage into low-dimensional, task- specific features. Then, the
network performs 3 different tasks that lead to the 6D pose estimation, i.e., semantic
labeling (variation of FCN), 3D translation estimation, and 3D rotation regression.
• It estimates the 3D translation by localizing the 2D object center in the image and estimating
object distance from the camera.
• The network regresses to the center direction for each pixel in the image, and a Hough voting
layer finds the 2D object center of an object.
• Using the object bounding boxes predicted from the Hough voting layer, it utilizes two RoI
pooling layers to “crop and pool” the visual features generated by the 1st stage of the network
for the 3D rotation regression.

Architecture of PoseCNN

The 3D translation can be estimated by localizing the
2D center of the object and estimating the 3D center
distance from the camera.
Each pixel casts votes for 2-D image
locations along the ray predicted
from the network.

BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for
Predicting the 3D Poses of Challenging Objects without Using Depth
• This paper introduces a method for 3D object detection and pose estimation from color
images only.
• It first uses segmentation to detect the objects of interest in 2D even in presence of partial
occlusions and cluttered background.
• By contrast with recent patch-based methods, it relies on a “holistic” approach: apply to
the detected objects a CNN trained to predict their 3D poses in the form of 2D projections
of the corners of their 3D bounding boxes.
• This, however, is not sufficient for handling objects from the recent T-LESS dataset: These
objects exhibit an axis of rotational symmetry, and the similarity of two images of such an
object under two different poses makes training the CNN challenging.
• It solves this problem by restricting the range of poses used for training, and by introducing
a classifier to identify the range of a pose at run-time before estimating it.
• It also uses an optional additional step that refines the predicted poses.
• The full approach is also scalable, as a single network can be trained for multiple objects
simultaneously.

Localization: (a) The input image is resized to 512 × 384 and split into regions of size 128 × 128. (b)
Each region is first segmented into a binary mask of 8 × 8 for each possible object o. (c) Only the
largest component is kept if several components are present, the active locations are segmented
more finely. (d) The centroid of the final segmentation is used as the 2D object center.

• It first finds the objects in 2D, obtaining a first estimate of the 3D poses, including objects
with a rotational symmetry, and finally refining the initial pose estimates.
• It identifies the 2D centers of the objects of interest in the input images.
• It could use a standard 2D object detector, but it develops an approach based on
segmentation that resulted in better performance as it can provide accurate locations even
under partial occlusions.
• It predicts the 3D pose of an object by applying a Deep Network to an image window
centered on the 2D object center.
• As for the segmentation, it uses VGG as a basis for this network, which allows to handle all
the objects of the target dataset with a single network.
• For an object with an angle of symmetry α, it can therefore restrict the poses used for
training to the poses where the angle of rotation around the symmetry axis is within the
range [0; α], to avoid the ambiguity between images.
• Denote by β the rotation angle, and introduce the intervals r1 = [0; α/2[ and r2 = [α/2; α[.
• To avoid ambiguity, restrict β to be in r1 for the training images used in the optimization.
• It introduces a CNN classifier to predict at run-time if β is in r1 or r2.

Objects with symmetry of rotation: object #5 of T-LESS has an angle of symmetry α of 180◦ , if
ignoring the small screw and electrical contact. If restricting the range of poses in the training set
between 0◦ (a) and 180◦ (b), pose estimation still fails for test samples with an angle of rotation close
to 0◦ modulo 180◦ (c). The solution is to restrict the range during training to be between 0◦ and 90◦. It
uses a classifier to detect if the pose in an input image is between 90◦ and 180◦. If this is the case (d),
it mirrors the input image (e), and mirrors back the predicted projections for the corners (f).

Refining the pose. Given a first pose estimate,
shown by the blue bounding box (a), it generates a
binary mask (b) or a color rendering (c) of the object.
Given the input image and this mask or rendering, it
can predict an update that improves the object pose,
shown by the red bounding box (d).
Two generated training images for different
objects from the LINEMOD dataset. The object
is shifted from the center to handle the
inaccuracy of the detection method, and the
background is random to make sure that the
network cannot exploit the context specific to
the dataset.

First row: LINEMOD dataset; Second row:
Occlusion dataset; Third row: T-LESS dataset
(for objects of revolution, we represent the
pose with a cylinder rather than a box); Last
row: Some failure cases.

Outline
• PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation (2018.12)
• SilhoNet: An RGB Method for 6D Object Pose Estimation (2019.6)
• Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation (2019.8)
• Deepim: Deep iterative matching for 6d pose estimation (2019.10)
• Accurate 6D Object Pose Estimation by Pose Conditioned Mesh Reconstruction (2019.10)
• CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF
Object Pose Estimation (ICCV, 2019)
• DPOD: 6D Pose Object Detector and Refiner (ICCV, 2019)
• ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation (2019.12)
• LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object
Pose Estimation (2019.12)
• HybridPose: 6D Object Pose Estimation under Hybrid Representations (2020.1)
• 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss (2020.2)

PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation
• This paper addresses the challenge of 6DoF pose estimation from a single RGB image under
severe occlusion or truncation.
• Many recent works have shown that a two-stage approach, which first detects keypoints and
then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable
performance.
• However, most of these methods only localize a set of sparse keypoints by regressing their
image coordinates or heatmaps, which are sensitive to occlusion and truncation.
• It introduces a Pixelwise Voting Network (PVNet) to regress pixelwise unit vectors pointing to
the keypoints and use these vectors to vote for keypoint locations using RANSAC.
• This creates a flexible representation for localizing occluded or truncated keypoints.
• Another important feature of this representation is that it provides uncertainties of keypoint
locations that can be further leveraged by the PnP solver.
• Experiments show that the proposed approach outperforms the state of the art on the
LINEMOD, Occlusion LINEMOD and YCB- Video datasets by a large margin, while being
efficient for real-time pose estimation.
• The code will be avaliable at https://zju-3dv.github.io/pvnet/.

The 6D pose estimation problem is formulated as a Perspective-n-Point (PnP) problem, which
requires correspondences between 2D and 3D keypoints.

Overview of the keypoint localization: (a) An image. (b) The architecture of PVNet. (c) Pixel-wise unit vectors
pointing to the object keypoints. (d) Semantic labels. (e) Hypotheses of the keypoint locations generated by
voting. (f) Probability distributions of the keypoint locations estimated from hypotheses.

SilhoNet: An RGB Method for 6D Object Pose Estimation
• Autonomous robot manipulation involves estimating the translation and orientation of the
object to be manipulated as a 6-degree-of-freedom (6D) pose.
• Methods using RGB- D data have shown great success in solving this problem.
• However, there are situations where cost constraints or the working environment may limit
the use of RGB-D sensors.
• When limited to mono camera data only, the problem of pose estimation is very challenging.
• Knowing how the target object is occluded in the image is important for certain applications,
such as AR, where it is desirable to project over only the visible portion of an object.
• This work introduces SilhoNet, a RGB-based deep learning method, that predicts 6D object
pose from monocular images.
• It uses a CNN pipeline that takes in ROI proposals to predict an intermediate silhouette
representation for objects with an associated occlusion mask and a 3D translation vector.
• The 3D orientation is then regressed from the predicted silhouettes.

The SilhoNet pipeline for
silhouette prediction and
6D object pose estimation

• The 3D orientation is predicted from an intermediate un-occluded silhouette representation.
• The method also predicts an occlusion mask which can be used to determine which parts of
the object model are visible in the image.
• The method operates in two stages, first predicting an intermediate silhouette
representation and occlusion mask of an object along with a vector describing the 3D
translation and then regressing the 3D orientation quaternion from the predicted silhouette.
• The input to the network is an RGB image with ROI proposals for detected objects and the
associated class labels.
• The 1st stage uses a VGG16 backbone with deconvolution layers at the end to produce a
feature map from the RGB input image. (This is the same as used in PoseCNN)
• Extracted features from the input image are concatenated with features from a set of
rendered object viewpoints and then passed through 3 network branches, two of which have
identical structure to predict a full un-occluded silhouette and an occlusion mask.
• The 3rd branch predicts a 3D vector encoding the object center in pixel coordinates and the
range of the object center from the camera.
• The 2nd stage of the network passes the predicted silhouette through a ResNet-18
architecture with two FCLs at the end to output an L2-normalized quaternion, representing
the 3D orientation.

• A Faster- RCNN from Tensorpack on the YCB- video dataset to predict ROI proposals;
• For each class, it rendered a set of 12 viewpoints from the object model of size 224x224;
• The 1st stage of the network predicts an intermediate silhouette representation of the object
as a 64x64 dimensional binary mask.
• This silhouette represents the full un-occluded visual hull of the object as though it were
rendered with the same 3D orientation but centered in the frame.
• The size of the silhouette in the frame is invariant to the scale of the object in the image and
is determined by a fixed distance of the object from the camera at which the silhouette
appears to be rendered.
• This distance is chosen for each object so that the silhouette just fits within the frame for any
3D orientation.
• Given the smallest field of view of the camera as A, and the object size as width, height and
depth (w, h, d), then, the render distance r is

• If an object at a given range is shifted along the arc with the camera center as the focus, the
Z coordinate will change while the object appearance in the shifted ROI will be unchanged.
• By predicting the object range rather than directly regressing the Z coordinate, this method
does not suffer from ambiguities and can recover the Z coordinate with good accuracy.
• Given the camera focal length f, the pixel coordinates of the object center (px, py) with
respect to the image center, and the range r of the object center form the camera center,
similar triangles can be used to show that the 3D object translation, (X, Y, Z), can be
recovered as
• Given a ROI with lower x and y coordinate bounds (bx, by), the coordinates of the image
principal point (cx, cy) and the predicted normalized output from the network (nx, ny), the
object center pixel coordinates (px, py) are recovered as

• The network predicts the apparent orientation as though the ROI were extracted from the
center of the image.
• Given the predicted object translation, the true orientation is recovered by applying a pitch
δθ, and roll δφ, adjustment to the predicted orientation, calculated as
• The network’s 2nd stage takes in the predicted silhouette probability maps, thresholded at
some value into binary masks, and outputs a quaternion prediction for the object orientation.
• This stage is composed of a ResNet-18 backbone, with the layers from the average pooling and
below replaced with two fully connected layers.
• It constructs a transform matrix T with a z translation equal to the render distance r for the
corresponding object class and the x and y translation components set to 0.
• The rotation is formed from the predicted apparent orientation. Using the following
equation, each vertex of the object model can be projected onto the occlusion mask, which
is scaled up to fit the minimum dimension of the input image (camera intrinsic K),

Example prediction of occluded and un-occluded silhouettes from a test image

Pix2Pose: Pixel-Wise Coordinate Regression of
Objects for 6D Pose Estimation
• Estimating the 6D pose of objects using only RGB images remains challenging because of
problems such as occlusion and symmetries.
• It is also difficult to construct 3D models with precise texture without expert knowledge or
specialized scanning devices.
• To address these problems, it proposes a pose estimation method, Pix2Pose, that predicts
the 3D coordinates of each object pixel without textured models.
• An auto-encoder architecture is designed to estimate the 3D coord.s and errors per pixel.
• These pixel-wise predictions are then used in multiple stages to form 2D-3D
correspondences to directly compute poses with the PnP algorithm with RANSAC iterations.
• This method is robust to occlusion by leveraging recent achievements in generative
adversarial training to precisely recover occluded parts.
• Furthermore, a loss function, the transformer loss, is proposed to handle symmetric objects
by guiding predictions to the closest symmetric pose.

An overview of the architecture of Pix2Pose and the training pipeline.

• Pix2Pose predicts 3D coord.s of individual pixels using a cropped region for an object.
• The robust estimation is established by recovering 3D coordinates of occluded parts and
using all pixels of an object for pose prediction.
• A single network is trained and used for each object class.
• The texture of a 3D model is not necessary for training and inference.
• The network input is a cropped image Is using a bounding box of a detected object class.
• The network outputs are normalized 3D coordinates of each pixel in the object coordinate
and estimated errors of each prediction from the Pix2Pose network.
• The target output includes coordinate predictions of occluded parts, which makes the
prediction more robust to partial occlusion.
• Since a coordinate consists of three values similar to RGB values in an image, the output can
be regarded as a color image.
• Therefore, the ground truth output is easily derived by rendering the colored coordinate
model in the ground truth pose.

An example of the pose estimation process. An image and 2D detection results are the input. In the 1st stage, the
predicted results are used to specify important pixels and adjust bounding boxes while removing backgrounds and
uncertain pixels. In the 2nd stage, pixels with valid coordinate values and small error predictions are used to
estimate poses using the PnP algorithm with RANSAC. Green and blue lines in the result represent 3D bounding
boxes of objects in ground truth poses and estimated poses.

Accurate 6D Object Pose Estimation by Pose
Conditioned Mesh Reconstruction
• Current 6D object pose methods consist of deep CNN models fully optimized for a single
object but with its architecture standardized among objects with different shapes.
• This work explicitly exploits each object’s distinct topological info i.e. 3D dense meshes in
the pose estimation model, prior to any post-processing refinement stage.
• In order to achieve this, it proposes a learning framework in which a Graph Convolutional
Neural Network reconstructs a pose conditioned 3D mesh of the object.
• An estimation of allocentric orientation is recovered by computing, in a differentiable
manner, the Procrustes’ alignment btw the canonical and reconstructed dense 3D meshes.
• 6D egocentric pose is lifted using additional mask and 2D centroid projection estimations.
• This method is capable of self validating its pose estimation by measuring the quality of the
reconstructed mesh, which is invaluable in real life applications.

The pipeline where it fully exploits the object shape topology both in 2D and 3D for 6D pose estimation

• Given a monocular RGB input image, the goal is to estimate full 6D pose of a rigid object.
• It aims to design a distinct per object architecture in an automated manner by taking full
advantage of prior information of the object.
• The reconstruction stage combines the use of the object’s known topology with encoded
pose information extracted from the image.
• The estimated mesh info is used to recover the allocentric orientation of the target object.
• Egocentric orientation can be recovered and lifted to 6D by adopting different approaches.
• It uses a pretrained FasterRCNN based 2D object detector and fine tune the model on
training data in order to detect an object in 2D space.
• The detector is used to crop an object ROI for further processing, which is used in a high
resolution to extract fine details of object appearance in the next stages of our pipeline.
• This ad hoc detector is trained independently.

Qualitative results obtained with this method

Deepim: Deep iterative matching for 6d pose estimation
• While several recent techniques have used depth cameras for object pose estimation, such
cameras have limitations with respect to frame rate, field of view, resolution, and depth
range, making it very difficult to detect small, thin, transparent, or fast moving objects.
• Estimating 6D poses of objects from images is an important problem in various applications
such as robot manipulation and virtual reality.
• While direct regression of images to object poses has limited accuracy, matching rendered
images of an object against the input image can produce accurate results.
• This work proposes a deep neural network for 6D pose matching named DeepIM.
• Given an initial pose estimation, this network is able to iteratively refine the pose by
matching the rendered image against the observed image.
• The network is trained to predict a relative pose transformation using a disentangled
representation of 3D location and 3D orientation and an iterative training process.
• DeepIM is able to match previously unseen objects.

DeepIM, a deep iterative matching network for 6D object pose estimation. The network is trained to predict a
relative SE(3) transformation that can be applied to an initial pose estimation for iterative pose refinement.
Given a 6D pose estimation of an object, either from PoseCNN or the refined pose from previous iteration, along
with the 3D model of the object, it generates the rendered image showing the appearance of the target object
under this rough pose estimation. With the image pairs of rendered image and observed image, the network
predicts a relative transformation which can be applied to refine the input pose.

• The observed image, the rendered image, and the two masks, are concatenated into an 8-
channel tensor input to the network (3 channels for observed/rendered image, 1 channel for
each mask).
• It uses the FlowNetSimple architecture as the backbone network, which is trained to predict
optical flow between two images.
• It tried using the VGG16 image classification network as the backbone network, but the
results were very poor, confirming the intuition that a representation related to optical flow
is very useful for pose matching.
• The pose estimation branch takes the feature map after 10 convolution layers from
FlowNetSimple as input.
• It contains two fully-connected layers each with dimension 256, followed by two additional
fully-connected layers for predicting the quaternion of the 3D rotation and the 3D
translation, respectively.
• During training, two auxiliary branches to regularize the feature representation of the
network and increase training stability and performance.
• One branch is trained for predicting optical flow btw the rendered and the observed image, and
the other branch for predicting the foreground mask of the object in the observed image.

DeepIM uses a FlowNetSimple backbone to predict a relative SE(3) transformation to match the
observed and rendered image of an object. Taking observed image and rendered image and their
corresponding masks as input, the conv. layers output a feature map which then be forwarded
through several FCLs to predict the translation and rotation. The same feature map, combined with
feature maps in the previous layers, will also be used to predict flow and FG mask during training.

Pose refinement results on the Occlusion LINEMOD dataset

CDPN: Coordinates-Based Disentangled Pose Network for
Real-Time RGB-Based 6-DoF Object Pose Estimation
• 6-DoF object pose estimation from a single RGB image is a fundamental and long-standing
problem in computer vision.
• Current leading approaches solve it by training deep networks to either regress both
rotation and translation from image directly or to construct 2D-3D correspondences and
further solve them via PnP indirectly.
• It argued rotation and translation should be treated differently for their significant difference.
This work proposes a novel 6-DoF pose estimation approach: Coordinates-based
Disentangled Pose Network (CDPN), which disentangles the pose to predict rotation and
translation separately to achieve highly accurate and robust pose estimation.
• This method is flexible, efficient, can deal with texture-less and occluded objects.
• This approach exceeds the SOA RGB-based methods on commonly used metrics.

Given an input image, first zoom in on target object, and then, the rotation and translation are disentangled
for estimation. Concretely, the rotation is solved by PnP from predicted 3D coordinates, while the
translation is estimated directly from image.

• First, a fast, lightweight detector (e.g. tiny YOLOv3) is employed for coarse detection;
• Second, a fixed size segmentation is implemented to extract the object pixels.
• For detection, the pose estimation system can tolerate detection errors to a large extent
attributing to the Dynamic Zoom-In (DZI), so a fast but less-precise detector is enough.
• For segmentation, it is merged into coordinates regression to make enough light and fast.
• This two-step pipeline can efficiently extract exact object region in various situations.
• In terms of translation, to achieve more robust and accurate estimation, it predicts it from
the image instead of 2D-3D correspondences to avoid the influence from the scale error in
the predicted 3D coordinates.
• Instead of regressing translation from the whole image, a Scale-Invariant Translation
Estimation (SITE) method estimates it from the detected object region.
• In this way, the disentangled processes regarding rotation and translation are unified into a
single network, namely Coordinates-based Disentangled Pose Network (CDPN).

Qualitative results for 6-DoF pose estimation and 3D coordinates regression.

DPOD: 6D Pose Object Detector and Refiner
• This paper presents a deep learning method for 3D object detection and 6D pose estimation
from RGB images only.
• This method, named DPOD (Dense Pose Object Detector), estimates dense multi-class
2D-3D correspondence maps between an input image and available 3D models.
• Given the correspondences, a 6DoF pose is computed via PnP and RANSAC.
• An additional RGB pose refinement of the initial pose estimates is performed using a custom
deep learning-based refinement scheme.
• Unlike other methods that mainly use real data for training and do not train on synthetic
renderings, it performs evaluation on both synthetic and real training data demonstrating
superior results before and after refinement when compared to all recent detectors.
• While being precise, the presented approach is still real-time capable.

Given an input RGB image, the correspondence block, featuring an encoder-decoder neural network, regresses the
object ID mask and the correspondence map. The latter one provides us with explicit 2D-3D correspondences,
whereas the ID mask estimates which correspondences should be taken for each detected object. The respective
6D poses are then efficiently computed by the pose block based on PnP + RANSAC.

• The inference pipeline is divided into two blocks: the correspondence and the pose block.
• The correspondence block consists of an encoder-decoder CNN with three decoder heads
which regress the ID mask and dense 2D-3D correspondence map from an RGB image of
size 320×240×3.
• The encoder part is based on a 12-layer ResNet-like architecture featuring residual layers
that allow for faster convergence.
• The decoders upsample the feature up to its original size using a stack of bilinear
interpolations followed by convolutional layers.
• The pose block is responsible for the pose prediction: Given the estimated ID mask, it
observes which objects were detected in the image and their 2D locations, whereas the
correspondence map maps each 2D point to a coordinate on an actual 3D model.
• The 6D pose is then estimated using the Perspective-n-Point (PnP) pose estimation method
that estimates the camera pose given correspondences and intrinsic parameters of the
camera.

Correspondence model: Given a 3D
model of interest (1), it applies a 2
channel correspondence texture (2) to
it. The resulting correspondence
model (3) is then used to generate GT
maps and estimate poses.
Refinement architecture: The network predicts a refined pose
given an initial pose proposal. Crops of the real image and the
rendering are fed into two parallel branches. The difference of
the computed feature tensor is to estimate the refined pose.
To learn dense 2D-3D correspondences,
each model of the dataset is textured
with a correspondence map.

Qualitative results: Poses predicted with the
approach on (a) the LineMOD dataset and
(b) the OCCLUSION dataset.

ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation
• Feature-based and template-based methods were popular for 6D object pose estimation.
• Feature-based methods rely on distinguishable features and perform badly for texture-poor
objects.
• Template-based methods do not work well if objects are partially occluded.
• With deep learning methods showing success for different image-related problem settings,
models inspired or extending these have been used increasingly.
• Symmetric objects pose a particular challenge for orientation estimation, because multiple
solutions or manifolds of solutions exist.
• This work introduces ConvPoseCNN, a fully convolutional architecture that avoids cutting out
individual objects.
• It put forward pixel-wise, dense prediction of both translation and orientation components of the
object pose, where the dense orientation is represented in Quaternion form.
• It presents different approaches for aggregation of the dense orientation predictions, including
averaging and clustering schemes.
• The dense orientation prediction implicitly learns to attend to occlusion-free, and feature-rich
object regions.

Dense Prediction of 6D pose parameters inside ConvPoseCNN. The dense
predictions are aggregated on the object level to form 6D pose outputs.

ConvPoseCNN

• The ConvPoseCNN architecture derived from PoseCNN, which predicts, starting from RGB
images, 6D poses for each object in the image.
• The network starts with the convolutional backbone of VGG16 that extracts features.
• These are subsequently processed in three branches: The fully-convolutional segmentation
branch that predicts a pixel-wise semantic segmentation, the fully-convolutional vertex branch,
which predicts a pixel-wise estimation of the center direction and center depth, and the
quaternion estimation branch.
• The segmentation and vertex branch results are combined to vote for object centers in a Hough
transform layer.
• The Hough layer also predicts bounding boxes for the detected objects.
• PoseCNN then uses these bounding boxes to crop and pool the extracted features which are
then fed into a fully-connected neural network architecture.
• This fully-connected part predicts an orientation quaternion for each bounding box.

Qualitative results from ConvPoseCNN L2 on the YCB-Video test set. Top: (orange) ground truth
bounding boxes, (green) 6D pose prediction. Middle: Angular error of the dense quaternion
prediction w.r.t. ground truth. Bottom: Quaternion prediction norm before normalization.

LatentFusion: End-to-End Differentiable Reconstruction and
Rendering for Unseen Object Pose Estimation
• Current 6D object pose estimation methods usually require a 3D model for each object.
• These methods also require additional training in order to incorporate new objects.
• As a result, they are difficult to scale to a large number of objects and cannot be directly applied
to unseen objects.
• This work proposes a framework for 6D pose estimation of unseen objects.
• It designs an end-to-end neural network that reconstructs a latent 3D representation of an
object using a small number of reference views of the object.
• Using the learned 3D representation, the network is able to render the object from arbitrary views.
• Using this neural renderer, it directly optimizes for the pose given an input image.
• By training the network with a large number of 3D shapes for reconstruction and rendering, this
network generalizes well to unseen objects.
• A dataset for unseen object pose estimation–MOPED (Model-free Object Pose Estimation
Dataset) is presented.

This is the end-to-end differentiable modeling and rendering pipeline to
perform pose estimation using simple gradient updates.

• Given a set of N reference images with associated object poses and object segmentation
masks, it seeks to construct a representation of the object which can be rendered with
arbitrary camera parameters.
• It represents the object as a latent 3D voxel grid, directly manipulated using standard 3D
transformations–naturally accommodating the requirement of novel view rendering.
• There are two main components to the reconstruction pipeline: 1) Modeling the object by
predicting per-view feature volumes and fusing them into a single canonical latent
representation; 2) Rendering the latent representation to depth and color images.
• The modeling step is inspired by space carving in that the network takes observations from
multiple views and leverages multi-view consistency to build a canonical representation.
• The rendering modules takes the fused object volume and renders it given arbitrary camera
parameters.
• It does by first rendering depth and then using an image-based rendering approach to
produce a color image, preserving high frequency details through a neural network.

A high-level overview of this architecture. 1) This modeling network takes an image and mask and predicts a
feature volume for each input view. The predicted features volumes are then fused into a single canonical
latent object by the fusion module. 2) Given the latent object, the rendering network produces a depth map
and a mask for any output camera.

[6] X Deng, A Mousavian, Y Xiang, F Xia, T Bretl, and D Fox. “PoseRBPF: A rao-blackwellized particle filter for
6D object pose tracking”. Robotics: Science and Systems (RSS), 2019.

HybridPose: 6D Object Pose Estimation under
Hybrid Representations
• HybridPose, a 6D object pose estimation approach, utilizes a hybrid intermediate
representation to express different geometric information in the input image, including
keypoints, edge vectors, and symmetry correspondences.
• Compared to a unitary representation, the hybrid representation allows pose regression to
exploit more and diverse features when one type of predicted representation is inaccurate
(e.g., because of occlusion).
• HybridPose leverages a robust regression module to filter out outliers in predicted
intermediate representation.
• All intermediate representations can be predicted by the same simple neural network
without sacrificing the overall performance.
• Compared to SOA pose estimation approaches, HybridPose is comparable in running time
and is significantly more accurate.
• The HybridPose code: https://github.com/chensong1995/HybridPose.

HybridPose predicts keypoints, edge vectors, and symmetry correspondences. (a) input RGB image. (b) red
markers denote predicted 2D keypoints. (c) edge vectors are defined by a fully-connected graph among all
keypoints. (d) symmetry correspondences connect each 2D pixel on the object to its symmetric counterpart.

• The input to HybridPose is an image containing an object in a known class, taken by a
pinhole camera with known intrinsic parameters.
• Assuming that the class of objects has a canonical coordinate system (i.e. the 3D point
cloud), under which HybridPose outputs the 6D camera pose of the image object.
• HybridPose consists of a prediction module and a pose regression module.
• HybridPose utilizes three prediction networks to estimate a set of keypoints, a set of edges
between keypoints and a set of symmetry correspondences between image pixels.
• The keypoint network employs an off-the-shelf prediction network PVNet;
• The edge network predicts edge vectors along a pre-defined graph of keypoints, which
stabilizes pose regression when keypoints are cluttered in the input image;
• The symmetry network predicts symmetry correspondences that reflect the underlying
(partial) reflection symmetry Z(extension of the FlowNet 2.0).
• The pose regression module optimizes the object pose to fit the output of the three
prediction networks (similar to the P3P solver, following the EPnP framework).

HybridPose consists of intermediate representation prediction networks and a pose regression module. The
prediction networks take an image as input, and output predicted keypoints, edge vectors, and symmetry
correspondences. The pose regression module consists of an initialization sub-module and a refinement sub-module.
The initialization sub-module solves a linear system with predicted intermediate representations to obtain an initial
pose. The refinement sub-module utilizes GM robust norm in the optimization to obtain the final pose prediction.

HybridPose handles situations where the object has no occlusion (a, d, f, h),
light occlusion (b, c), and severe occlusion (e, g).

6DoF Object Pose Estimation via Differentiable
Proxy Voting Loss
• Estimating a 6DOF object pose from a single image is very challenging due to occlusions or
texture-less appearances.
• Vector-field based keypoint voting has demonstrated its effectiveness and superiority on
tackling those issues.
• However, direct regression of vector-fields neglects that the distances between pixels and
keypoints also affect the deviations of hypotheses dramatically.
• In other words, small errors in direction vectors may generate severely deviated hypotheses
when pixels are far away from a keypoint.
• This paper aims to reduce such errors by incorporating the distances between pixels and
keypoints into the objective.
• To this end, it develops a differentiable proxy voting loss (DPVL) which mimics the
hypothesis selection in the voting procedure.
• By exploiting the voting loss, it can train the network in an end-to-end manner.

Proxy Voting Loss
Differentiable Proxy Voting Loss (DPVL)
illustration. Provided that the estimation
errors of direction vectors are the same
(e.g., α), the distance between a pixel
and a keypoint affects the closeness
between a hypothesis and the keypoint.
DPVL minimizes the distance d⋆ between
a proxy hypothesis fk(p⋆ ) and a keypoint
ki to achieve accurate hypotheses for
keypoint voting.

Proxy Voting Loss
• This work focuses on obtaining accurate initial pose estimation.
• In particular, this method is designed to localize and estimate the orientations and
translations of an object accurately without any refinement.
• The object pose is represented by a rigid transformation from the object coordinate system
to the camera coordinate system.
• Since voting based methods have demonstrated their robustness to occlusions and view
changes, here it follows the voting based pose estimation pipeline.
• Specifically, this method firstly votes 2D positions of the object keypoints from the vector-
fields and then estimates the 6DOF pose by solving a PnP problem.
• Prior works regress pixel-wise vector-fields by an l1 loss.
• However, small errors in the vector-fields may lead to large deviation errors of hypotheses
because the loss does not take the distance between a pixel and a keypoint into account.
• Therefore, it presents a differentiable proxy voting loss (DPVL) to reduce such errors by
mimicking the hypothesis selection in the voting procedure.
• Furthermore, benefiting from DPVL, the network is able to converge much faster.

Proxy Voting Loss
The system pipeline

Proxy Voting Loss
Qualitative results of pose estimation on the LINEMOD dataset
Qualitative results on the Occlusion LINEMOD dataset

Pose estimation from RGB images by deep learning

Pose estimation from RGB images by deep learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pose estimation from RGB images by deep learning

Similar to Pose estimation from RGB images by deep learning (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Pose estimation from RGB images by deep learning