SlideShare a Scribd company logo
1 of 70
Download to read offline
Object Pose Estimation from
RGB Images by Deep Learning
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
Outline (obsolete)
• PoseCNN: A CNN for 6D Object Pose Estimation in Cluttered Scenes
• BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of
Challenging Objects without Using Depth
• SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again
• Real-Time Seamless Single Shot 6D Object Pose Prediction
• Implicit 3D Orientation Learning for 6D Object Detection from RGB Images
• Vehicle Detection and Pose Estimation for Autonomous Driving (Thesis)
• A mixed classification-regression framework for pose estimation from images
• Classification and Pose Estimation of Vehicles in Videos by 3D Modeling within Discrete-
Continuous Optimization
• Improved Object Detection and Pose Using Part-Based Models
• Object Detection and Viewpoint Estimation with a Deformable 3D Model
PoseCNN: A Convolutional Neural Network for 6D
Object Pose Estimation in Cluttered Scenes
• Estimating 6D pose of known objects is important for robots to interact with the real world.
• The problem is challenging due to the variety of objects as well as the complexity of a scene
caused by clutter and occlusions between objects.
• This work introduces PoseCNN for 6D pose estimation, which estimates the 3D translation of
an object by localizing its center in the image and predicting its distance from the camera.
• The 3D rotation of the object is estimated by regressing to a quaternion representation.
• It also introduces a loss function that enables PoseCNN to handle symmetric objects.
• It builds a large scale video dataset for 6D object pose estimation, called YCB-Video dataset.
• This dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92
videos with 133,827 frames.
• The code and dataset at https://rse-lab.cs.washington.edu/projects/posecnn/.
PoseCNN: A Convolutional Neural Network for 6D
Object Pose Estimation in Cluttered Scenes
The PoseCNN is trained to perform three tasks: semantic labeling, 3D translation
estimation, and 3D rotation regression.
PoseCNN: A Convolutional Neural Network for 6D
Object Pose Estimation in Cluttered Scenes
• The PoseCNN network contains two stages.
• The 1st stage consists of 13 conv. layers and 4 max- pooling layers, which extract feature
maps with different resolutions from the input image. This stage is the network backbone
since the extracted features are shared across all the tasks performed by the network.
• The 2nd stage consists of an embedding step that embeds the high-dimensional feature
maps generated by the first stage into low-dimensional, task- specific features. Then, the
network performs 3 different tasks that lead to the 6D pose estimation, i.e., semantic
labeling (variation of FCN), 3D translation estimation, and 3D rotation regression.
• It estimates the 3D translation by localizing the 2D object center in the image and estimating
object distance from the camera.
• The network regresses to the center direction for each pixel in the image, and a Hough voting
layer finds the 2D object center of an object.
• Using the object bounding boxes predicted from the Hough voting layer, it utilizes two RoI
pooling layers to “crop and pool” the visual features generated by the 1st stage of the network
for the 3D rotation regression.
PoseCNN: A Convolutional Neural Network for 6D
Object Pose Estimation in Cluttered Scenes
Architecture of PoseCNN
PoseCNN: A Convolutional Neural Network for 6D
Object Pose Estimation in Cluttered Scenes
The 3D translation can be estimated by localizing the
2D center of the object and estimating the 3D center
distance from the camera.
Each pixel casts votes for 2-D image
locations along the ray predicted
from the network.
PoseCNN: A Convolutional Neural Network for 6D
Object Pose Estimation in Cluttered Scenes
BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for
Predicting the 3D Poses of Challenging Objects without Using Depth
• This paper introduces a method for 3D object detection and pose estimation from color
images only.
• It first uses segmentation to detect the objects of interest in 2D even in presence of partial
occlusions and cluttered background.
• By contrast with recent patch-based methods, it relies on a “holistic” approach: apply to
the detected objects a CNN trained to predict their 3D poses in the form of 2D projections
of the corners of their 3D bounding boxes.
• This, however, is not sufficient for handling objects from the recent T-LESS dataset: These
objects exhibit an axis of rotational symmetry, and the similarity of two images of such an
object under two different poses makes training the CNN challenging.
• It solves this problem by restricting the range of poses used for training, and by introducing
a classifier to identify the range of a pose at run-time before estimating it.
• It also uses an optional additional step that refines the predicted poses.
• The full approach is also scalable, as a single network can be trained for multiple objects
simultaneously.
BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for
Predicting the 3D Poses of Challenging Objects without Using Depth
Localization: (a) The input image is resized to 512 × 384 and split into regions of size 128 × 128. (b)
Each region is first segmented into a binary mask of 8 × 8 for each possible object o. (c) Only the
largest component is kept if several components are present, the active locations are segmented
more finely. (d) The centroid of the final segmentation is used as the 2D object center.
BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for
Predicting the 3D Poses of Challenging Objects without Using Depth
• It first finds the objects in 2D, obtaining a first estimate of the 3D poses, including objects
with a rotational symmetry, and finally refining the initial pose estimates.
• It identifies the 2D centers of the objects of interest in the input images.
• It could use a standard 2D object detector, but it develops an approach based on
segmentation that resulted in better performance as it can provide accurate locations even
under partial occlusions.
• It predicts the 3D pose of an object by applying a Deep Network to an image window
centered on the 2D object center.
• As for the segmentation, it uses VGG as a basis for this network, which allows to handle all
the objects of the target dataset with a single network.
• For an object with an angle of symmetry α, it can therefore restrict the poses used for
training to the poses where the angle of rotation around the symmetry axis is within the
range [0; α], to avoid the ambiguity between images.
• Denote by β the rotation angle, and introduce the intervals r1 = [0; α/2[ and r2 = [α/2; α[.
• To avoid ambiguity, restrict β to be in r1 for the training images used in the optimization.
• It introduces a CNN classifier to predict at run-time if β is in r1 or r2.
BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for
Predicting the 3D Poses of Challenging Objects without Using Depth
Objects with symmetry of rotation: object #5 of T-LESS has an angle of symmetry α of 180◦ , if
ignoring the small screw and electrical contact. If restricting the range of poses in the training set
between 0◦ (a) and 180◦ (b), pose estimation still fails for test samples with an angle of rotation close
to 0◦ modulo 180◦ (c). The solution is to restrict the range during training to be between 0◦ and 90◦. It
uses a classifier to detect if the pose in an input image is between 90◦ and 180◦. If this is the case (d),
it mirrors the input image (e), and mirrors back the predicted projections for the corners (f).
BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for
Predicting the 3D Poses of Challenging Objects without Using Depth
Refining the pose. Given a first pose estimate,
shown by the blue bounding box (a), it generates a
binary mask (b) or a color rendering (c) of the object.
Given the input image and this mask or rendering, it
can predict an update that improves the object pose,
shown by the red bounding box (d).
Two generated training images for different
objects from the LINEMOD dataset. The object
is shifted from the center to handle the
inaccuracy of the detection method, and the
background is random to make sure that the
network cannot exploit the context specific to
the dataset.
BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for
Predicting the 3D Poses of Challenging Objects without Using Depth
First row: LINEMOD dataset; Second row:
Occlusion dataset; Third row: T-LESS dataset
(for objects of revolution, we represent the
pose with a cylinder rather than a box); Last
row: Some failure cases.
Outline
• PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation (2018.12)
• SilhoNet: An RGB Method for 6D Object Pose Estimation (2019.6)
• Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation (2019.8)
• Deepim: Deep iterative matching for 6d pose estimation (2019.10)
• Accurate 6D Object Pose Estimation by Pose Conditioned Mesh Reconstruction (2019.10)
• CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF
Object Pose Estimation (ICCV, 2019)
• DPOD: 6D Pose Object Detector and Refiner (ICCV, 2019)
• ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation (2019.12)
• LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object
Pose Estimation (2019.12)
• HybridPose: 6D Object Pose Estimation under Hybrid Representations (2020.1)
• 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss (2020.2)
PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation
• This paper addresses the challenge of 6DoF pose estimation from a single RGB image under
severe occlusion or truncation.
• Many recent works have shown that a two-stage approach, which first detects keypoints and
then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable
performance.
• However, most of these methods only localize a set of sparse keypoints by regressing their
image coordinates or heatmaps, which are sensitive to occlusion and truncation.
• It introduces a Pixelwise Voting Network (PVNet) to regress pixelwise unit vectors pointing to
the keypoints and use these vectors to vote for keypoint locations using RANSAC.
• This creates a flexible representation for localizing occluded or truncated keypoints.
• Another important feature of this representation is that it provides uncertainties of keypoint
locations that can be further leveraged by the PnP solver.
• Experiments show that the proposed approach outperforms the state of the art on the
LINEMOD, Occlusion LINEMOD and YCB- Video datasets by a large margin, while being
efficient for real-time pose estimation.
• The code will be avaliable at https://zju-3dv.github.io/pvnet/.
PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation
The 6D pose estimation problem is formulated as a Perspective-n-Point (PnP) problem, which
requires correspondences between 2D and 3D keypoints.
PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation
Overview of the keypoint localization: (a) An image. (b) The architecture of PVNet. (c) Pixel-wise unit vectors
pointing to the object keypoints. (d) Semantic labels. (e) Hypotheses of the keypoint locations generated by
voting. (f) Probability distributions of the keypoint locations estimated from hypotheses.
PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation
SilhoNet: An RGB Method for 6D Object Pose Estimation
• Autonomous robot manipulation involves estimating the translation and orientation of the
object to be manipulated as a 6-degree-of-freedom (6D) pose.
• Methods using RGB- D data have shown great success in solving this problem.
• However, there are situations where cost constraints or the working environment may limit
the use of RGB-D sensors.
• When limited to mono camera data only, the problem of pose estimation is very challenging.
• Knowing how the target object is occluded in the image is important for certain applications,
such as AR, where it is desirable to project over only the visible portion of an object.
• This work introduces SilhoNet, a RGB-based deep learning method, that predicts 6D object
pose from monocular images.
• It uses a CNN pipeline that takes in ROI proposals to predict an intermediate silhouette
representation for objects with an associated occlusion mask and a 3D translation vector.
• The 3D orientation is then regressed from the predicted silhouettes.
SilhoNet: An RGB Method for 6D Object Pose Estimation
The SilhoNet pipeline for
silhouette prediction and
6D object pose estimation
SilhoNet: An RGB Method for 6D Object Pose Estimation
• The 3D orientation is predicted from an intermediate un-occluded silhouette representation.
• The method also predicts an occlusion mask which can be used to determine which parts of
the object model are visible in the image.
• The method operates in two stages, first predicting an intermediate silhouette
representation and occlusion mask of an object along with a vector describing the 3D
translation and then regressing the 3D orientation quaternion from the predicted silhouette.
• The input to the network is an RGB image with ROI proposals for detected objects and the
associated class labels.
• The 1st stage uses a VGG16 backbone with deconvolution layers at the end to produce a
feature map from the RGB input image. (This is the same as used in PoseCNN)
• Extracted features from the input image are concatenated with features from a set of
rendered object viewpoints and then passed through 3 network branches, two of which have
identical structure to predict a full un-occluded silhouette and an occlusion mask.
• The 3rd branch predicts a 3D vector encoding the object center in pixel coordinates and the
range of the object center from the camera.
• The 2nd stage of the network passes the predicted silhouette through a ResNet-18
architecture with two FCLs at the end to output an L2-normalized quaternion, representing
the 3D orientation.
SilhoNet: An RGB Method for 6D Object Pose Estimation
• A Faster- RCNN from Tensorpack on the YCB- video dataset to predict ROI proposals;
• For each class, it rendered a set of 12 viewpoints from the object model of size 224x224;
• The 1st stage of the network predicts an intermediate silhouette representation of the object
as a 64x64 dimensional binary mask.
• This silhouette represents the full un-occluded visual hull of the object as though it were
rendered with the same 3D orientation but centered in the frame.
• The size of the silhouette in the frame is invariant to the scale of the object in the image and
is determined by a fixed distance of the object from the camera at which the silhouette
appears to be rendered.
• This distance is chosen for each object so that the silhouette just fits within the frame for any
3D orientation.
• Given the smallest field of view of the camera as A, and the object size as width, height and
depth (w, h, d), then, the render distance r is
SilhoNet: An RGB Method for 6D Object Pose Estimation
• If an object at a given range is shifted along the arc with the camera center as the focus, the
Z coordinate will change while the object appearance in the shifted ROI will be unchanged.
• By predicting the object range rather than directly regressing the Z coordinate, this method
does not suffer from ambiguities and can recover the Z coordinate with good accuracy.
• Given the camera focal length f, the pixel coordinates of the object center (px, py) with
respect to the image center, and the range r of the object center form the camera center,
similar triangles can be used to show that the 3D object translation, (X, Y, Z), can be
recovered as
• Given a ROI with lower x and y coordinate bounds (bx, by), the coordinates of the image
principal point (cx, cy) and the predicted normalized output from the network (nx, ny), the
object center pixel coordinates (px, py) are recovered as
SilhoNet: An RGB Method for 6D Object Pose Estimation
• The network predicts the apparent orientation as though the ROI were extracted from the
center of the image.
• Given the predicted object translation, the true orientation is recovered by applying a pitch
δθ, and roll δφ, adjustment to the predicted orientation, calculated as
• The network’s 2nd stage takes in the predicted silhouette probability maps, thresholded at
some value into binary masks, and outputs a quaternion prediction for the object orientation.
• This stage is composed of a ResNet-18 backbone, with the layers from the average pooling and
below replaced with two fully connected layers.
• It constructs a transform matrix T with a z translation equal to the render distance r for the
corresponding object class and the x and y translation components set to 0.
• The rotation is formed from the predicted apparent orientation. Using the following
equation, each vertex of the object model can be projected onto the occlusion mask, which
is scaled up to fit the minimum dimension of the input image (camera intrinsic K),
SilhoNet: An RGB Method for 6D Object Pose Estimation
Example prediction of occluded and un-occluded silhouettes from a test image
Pix2Pose: Pixel-Wise Coordinate Regression of
Objects for 6D Pose Estimation
• Estimating the 6D pose of objects using only RGB images remains challenging because of
problems such as occlusion and symmetries.
• It is also difficult to construct 3D models with precise texture without expert knowledge or
specialized scanning devices.
• To address these problems, it proposes a pose estimation method, Pix2Pose, that predicts
the 3D coordinates of each object pixel without textured models.
• An auto-encoder architecture is designed to estimate the 3D coord.s and errors per pixel.
• These pixel-wise predictions are then used in multiple stages to form 2D-3D
correspondences to directly compute poses with the PnP algorithm with RANSAC iterations.
• This method is robust to occlusion by leveraging recent achievements in generative
adversarial training to precisely recover occluded parts.
• Furthermore, a loss function, the transformer loss, is proposed to handle symmetric objects
by guiding predictions to the closest symmetric pose.
Pix2Pose: Pixel-Wise Coordinate Regression of
Objects for 6D Pose Estimation
An overview of the architecture of Pix2Pose and the training pipeline.
Pix2Pose: Pixel-Wise Coordinate Regression of
Objects for 6D Pose Estimation
• Pix2Pose predicts 3D coord.s of individual pixels using a cropped region for an object.
• The robust estimation is established by recovering 3D coordinates of occluded parts and
using all pixels of an object for pose prediction.
• A single network is trained and used for each object class.
• The texture of a 3D model is not necessary for training and inference.
• The network input is a cropped image Is using a bounding box of a detected object class.
• The network outputs are normalized 3D coordinates of each pixel in the object coordinate
and estimated errors of each prediction from the Pix2Pose network.
• The target output includes coordinate predictions of occluded parts, which makes the
prediction more robust to partial occlusion.
• Since a coordinate consists of three values similar to RGB values in an image, the output can
be regarded as a color image.
• Therefore, the ground truth output is easily derived by rendering the colored coordinate
model in the ground truth pose.
Pix2Pose: Pixel-Wise Coordinate Regression of
Objects for 6D Pose Estimation
An example of the pose estimation process. An image and 2D detection results are the input. In the 1st stage, the
predicted results are used to specify important pixels and adjust bounding boxes while removing backgrounds and
uncertain pixels. In the 2nd stage, pixels with valid coordinate values and small error predictions are used to
estimate poses using the PnP algorithm with RANSAC. Green and blue lines in the result represent 3D bounding
boxes of objects in ground truth poses and estimated poses.
Pix2Pose: Pixel-Wise Coordinate Regression of
Objects for 6D Pose Estimation
Accurate 6D Object Pose Estimation by Pose
Conditioned Mesh Reconstruction
• Current 6D object pose methods consist of deep CNN models fully optimized for a single
object but with its architecture standardized among objects with different shapes.
• This work explicitly exploits each object’s distinct topological info i.e. 3D dense meshes in
the pose estimation model, prior to any post-processing refinement stage.
• In order to achieve this, it proposes a learning framework in which a Graph Convolutional
Neural Network reconstructs a pose conditioned 3D mesh of the object.
• An estimation of allocentric orientation is recovered by computing, in a differentiable
manner, the Procrustes’ alignment btw the canonical and reconstructed dense 3D meshes.
• 6D egocentric pose is lifted using additional mask and 2D centroid projection estimations.
• This method is capable of self validating its pose estimation by measuring the quality of the
reconstructed mesh, which is invaluable in real life applications.
Accurate 6D Object Pose Estimation by Pose
Conditioned Mesh Reconstruction
The pipeline where it fully exploits the object shape topology both in 2D and 3D for 6D pose estimation
Accurate 6D Object Pose Estimation by Pose
Conditioned Mesh Reconstruction
• Given a monocular RGB input image, the goal is to estimate full 6D pose of a rigid object.
• It aims to design a distinct per object architecture in an automated manner by taking full
advantage of prior information of the object.
• The reconstruction stage combines the use of the object’s known topology with encoded
pose information extracted from the image.
• The estimated mesh info is used to recover the allocentric orientation of the target object.
• Egocentric orientation can be recovered and lifted to 6D by adopting different approaches.
• It uses a pretrained FasterRCNN based 2D object detector and fine tune the model on
training data in order to detect an object in 2D space.
• The detector is used to crop an object ROI for further processing, which is used in a high
resolution to extract fine details of object appearance in the next stages of our pipeline.
• This ad hoc detector is trained independently.
Accurate 6D Object Pose Estimation by Pose
Conditioned Mesh Reconstruction
Qualitative results obtained with this method
Deepim: Deep iterative matching for 6d pose estimation
• While several recent techniques have used depth cameras for object pose estimation, such
cameras have limitations with respect to frame rate, field of view, resolution, and depth
range, making it very difficult to detect small, thin, transparent, or fast moving objects.
• Estimating 6D poses of objects from images is an important problem in various applications
such as robot manipulation and virtual reality.
• While direct regression of images to object poses has limited accuracy, matching rendered
images of an object against the input image can produce accurate results.
• This work proposes a deep neural network for 6D pose matching named DeepIM.
• Given an initial pose estimation, this network is able to iteratively refine the pose by
matching the rendered image against the observed image.
• The network is trained to predict a relative pose transformation using a disentangled
representation of 3D location and 3D orientation and an iterative training process.
• DeepIM is able to match previously unseen objects.
Deepim: Deep iterative matching for 6d pose estimation
DeepIM, a deep iterative matching network for 6D object pose estimation. The network is trained to predict a
relative SE(3) transformation that can be applied to an initial pose estimation for iterative pose refinement.
Given a 6D pose estimation of an object, either from PoseCNN or the refined pose from previous iteration, along
with the 3D model of the object, it generates the rendered image showing the appearance of the target object
under this rough pose estimation. With the image pairs of rendered image and observed image, the network
predicts a relative transformation which can be applied to refine the input pose.
Deepim: Deep iterative matching for 6d pose estimation
• The observed image, the rendered image, and the two masks, are concatenated into an 8-
channel tensor input to the network (3 channels for observed/rendered image, 1 channel for
each mask).
• It uses the FlowNetSimple architecture as the backbone network, which is trained to predict
optical flow between two images.
• It tried using the VGG16 image classification network as the backbone network, but the
results were very poor, confirming the intuition that a representation related to optical flow
is very useful for pose matching.
• The pose estimation branch takes the feature map after 10 convolution layers from
FlowNetSimple as input.
• It contains two fully-connected layers each with dimension 256, followed by two additional
fully-connected layers for predicting the quaternion of the 3D rotation and the 3D
translation, respectively.
• During training, two auxiliary branches to regularize the feature representation of the
network and increase training stability and performance.
• One branch is trained for predicting optical flow btw the rendered and the observed image, and
the other branch for predicting the foreground mask of the object in the observed image.
Deepim: Deep iterative matching for 6d pose estimation
DeepIM uses a FlowNetSimple backbone to predict a relative SE(3) transformation to match the
observed and rendered image of an object. Taking observed image and rendered image and their
corresponding masks as input, the conv. layers output a feature map which then be forwarded
through several FCLs to predict the translation and rotation. The same feature map, combined with
feature maps in the previous layers, will also be used to predict flow and FG mask during training.
Deepim: Deep iterative matching for 6d pose estimation
Pose refinement results on the Occlusion LINEMOD dataset
CDPN: Coordinates-Based Disentangled Pose Network for
Real-Time RGB-Based 6-DoF Object Pose Estimation
• 6-DoF object pose estimation from a single RGB image is a fundamental and long-standing
problem in computer vision.
• Current leading approaches solve it by training deep networks to either regress both
rotation and translation from image directly or to construct 2D-3D correspondences and
further solve them via PnP indirectly.
• It argued rotation and translation should be treated differently for their significant difference.
This work proposes a novel 6-DoF pose estimation approach: Coordinates-based
Disentangled Pose Network (CDPN), which disentangles the pose to predict rotation and
translation separately to achieve highly accurate and robust pose estimation.
• This method is flexible, efficient, can deal with texture-less and occluded objects.
• This approach exceeds the SOA RGB-based methods on commonly used metrics.
CDPN: Coordinates-Based Disentangled Pose Network for
Real-Time RGB-Based 6-DoF Object Pose Estimation
Given an input image, first zoom in on target object, and then, the rotation and translation are disentangled
for estimation. Concretely, the rotation is solved by PnP from predicted 3D coordinates, while the
translation is estimated directly from image.
CDPN: Coordinates-Based Disentangled Pose Network for
Real-Time RGB-Based 6-DoF Object Pose Estimation
• First, a fast, lightweight detector (e.g. tiny YOLOv3) is employed for coarse detection;
• Second, a fixed size segmentation is implemented to extract the object pixels.
• For detection, the pose estimation system can tolerate detection errors to a large extent
attributing to the Dynamic Zoom-In (DZI), so a fast but less-precise detector is enough.
• For segmentation, it is merged into coordinates regression to make enough light and fast.
• This two-step pipeline can efficiently extract exact object region in various situations.
• In terms of translation, to achieve more robust and accurate estimation, it predicts it from
the image instead of 2D-3D correspondences to avoid the influence from the scale error in
the predicted 3D coordinates.
• Instead of regressing translation from the whole image, a Scale-Invariant Translation
Estimation (SITE) method estimates it from the detected object region.
• In this way, the disentangled processes regarding rotation and translation are unified into a
single network, namely Coordinates-based Disentangled Pose Network (CDPN).
CDPN: Coordinates-Based Disentangled Pose Network for
Real-Time RGB-Based 6-DoF Object Pose Estimation
Qualitative results for 6-DoF pose estimation and 3D coordinates regression.
DPOD: 6D Pose Object Detector and Refiner
• This paper presents a deep learning method for 3D object detection and 6D pose estimation
from RGB images only.
• This method, named DPOD (Dense Pose Object Detector), estimates dense multi-class
2D-3D correspondence maps between an input image and available 3D models.
• Given the correspondences, a 6DoF pose is computed via PnP and RANSAC.
• An additional RGB pose refinement of the initial pose estimates is performed using a custom
deep learning-based refinement scheme.
• Unlike other methods that mainly use real data for training and do not train on synthetic
renderings, it performs evaluation on both synthetic and real training data demonstrating
superior results before and after refinement when compared to all recent detectors.
• While being precise, the presented approach is still real-time capable.
DPOD: 6D Pose Object Detector and Refiner
Given an input RGB image, the correspondence block, featuring an encoder-decoder neural network, regresses the
object ID mask and the correspondence map. The latter one provides us with explicit 2D-3D correspondences,
whereas the ID mask estimates which correspondences should be taken for each detected object. The respective
6D poses are then efficiently computed by the pose block based on PnP + RANSAC.
DPOD: 6D Pose Object Detector and Refiner
• The inference pipeline is divided into two blocks: the correspondence and the pose block.
• The correspondence block consists of an encoder-decoder CNN with three decoder heads
which regress the ID mask and dense 2D-3D correspondence map from an RGB image of
size 320×240×3.
• The encoder part is based on a 12-layer ResNet-like architecture featuring residual layers
that allow for faster convergence.
• The decoders upsample the feature up to its original size using a stack of bilinear
interpolations followed by convolutional layers.
• The pose block is responsible for the pose prediction: Given the estimated ID mask, it
observes which objects were detected in the image and their 2D locations, whereas the
correspondence map maps each 2D point to a coordinate on an actual 3D model.
• The 6D pose is then estimated using the Perspective-n-Point (PnP) pose estimation method
that estimates the camera pose given correspondences and intrinsic parameters of the
camera.
DPOD: 6D Pose Object Detector and Refiner
Correspondence model: Given a 3D
model of interest (1), it applies a 2
channel correspondence texture (2) to
it. The resulting correspondence
model (3) is then used to generate GT
maps and estimate poses.
Refinement architecture: The network predicts a refined pose
given an initial pose proposal. Crops of the real image and the
rendering are fed into two parallel branches. The difference of
the computed feature tensor is to estimate the refined pose.
To learn dense 2D-3D correspondences,
each model of the dataset is textured
with a correspondence map.
DPOD: 6D Pose Object Detector and Refiner
Qualitative results: Poses predicted with the
approach on (a) the LineMOD dataset and
(b) the OCCLUSION dataset.
ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation
• Feature-based and template-based methods were popular for 6D object pose estimation.
• Feature-based methods rely on distinguishable features and perform badly for texture-poor
objects.
• Template-based methods do not work well if objects are partially occluded.
• With deep learning methods showing success for different image-related problem settings,
models inspired or extending these have been used increasingly.
• Symmetric objects pose a particular challenge for orientation estimation, because multiple
solutions or manifolds of solutions exist.
• This work introduces ConvPoseCNN, a fully convolutional architecture that avoids cutting out
individual objects.
• It put forward pixel-wise, dense prediction of both translation and orientation components of the
object pose, where the dense orientation is represented in Quaternion form.
• It presents different approaches for aggregation of the dense orientation predictions, including
averaging and clustering schemes.
• The dense orientation prediction implicitly learns to attend to occlusion-free, and feature-rich
object regions.
ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation
Dense Prediction of 6D pose parameters inside ConvPoseCNN. The dense
predictions are aggregated on the object level to form 6D pose outputs.
ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation
ConvPoseCNN
ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation
• The ConvPoseCNN architecture derived from PoseCNN, which predicts, starting from RGB
images, 6D poses for each object in the image.
• The network starts with the convolutional backbone of VGG16 that extracts features.
• These are subsequently processed in three branches: The fully-convolutional segmentation
branch that predicts a pixel-wise semantic segmentation, the fully-convolutional vertex branch,
which predicts a pixel-wise estimation of the center direction and center depth, and the
quaternion estimation branch.
• The segmentation and vertex branch results are combined to vote for object centers in a Hough
transform layer.
• The Hough layer also predicts bounding boxes for the detected objects.
• PoseCNN then uses these bounding boxes to crop and pool the extracted features which are
then fed into a fully-connected neural network architecture.
• This fully-connected part predicts an orientation quaternion for each bounding box.
ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation
Qualitative results from ConvPoseCNN L2 on the YCB-Video test set. Top: (orange) ground truth
bounding boxes, (green) 6D pose prediction. Middle: Angular error of the dense quaternion
prediction w.r.t. ground truth. Bottom: Quaternion prediction norm before normalization.
LatentFusion: End-to-End Differentiable Reconstruction and
Rendering for Unseen Object Pose Estimation
• Current 6D object pose estimation methods usually require a 3D model for each object.
• These methods also require additional training in order to incorporate new objects.
• As a result, they are difficult to scale to a large number of objects and cannot be directly applied
to unseen objects.
• This work proposes a framework for 6D pose estimation of unseen objects.
• It designs an end-to-end neural network that reconstructs a latent 3D representation of an
object using a small number of reference views of the object.
• Using the learned 3D representation, the network is able to render the object from arbitrary views.
• Using this neural renderer, it directly optimizes for the pose given an input image.
• By training the network with a large number of 3D shapes for reconstruction and rendering, this
network generalizes well to unseen objects.
• A dataset for unseen object pose estimation–MOPED (Model-free Object Pose Estimation
Dataset) is presented.
LatentFusion: End-to-End Differentiable Reconstruction and
Rendering for Unseen Object Pose Estimation
This is the end-to-end differentiable modeling and rendering pipeline to
perform pose estimation using simple gradient updates.
LatentFusion: End-to-End Differentiable Reconstruction and
Rendering for Unseen Object Pose Estimation
• Given a set of N reference images with associated object poses and object segmentation
masks, it seeks to construct a representation of the object which can be rendered with
arbitrary camera parameters.
• It represents the object as a latent 3D voxel grid, directly manipulated using standard 3D
transformations–naturally accommodating the requirement of novel view rendering.
• There are two main components to the reconstruction pipeline: 1) Modeling the object by
predicting per-view feature volumes and fusing them into a single canonical latent
representation; 2) Rendering the latent representation to depth and color images.
• The modeling step is inspired by space carving in that the network takes observations from
multiple views and leverages multi-view consistency to build a canonical representation.
• The rendering modules takes the fused object volume and renders it given arbitrary camera
parameters.
• It does by first rendering depth and then using an image-based rendering approach to
produce a color image, preserving high frequency details through a neural network.
LatentFusion: End-to-End Differentiable Reconstruction and
Rendering for Unseen Object Pose Estimation
A high-level overview of this architecture. 1) This modeling network takes an image and mask and predicts a
feature volume for each input view. The predicted features volumes are then fused into a single canonical
latent object by the fusion module. 2) Given the latent object, the rendering network produces a depth map
and a mask for any output camera.
LatentFusion: End-to-End Differentiable Reconstruction and
Rendering for Unseen Object Pose Estimation
[6] X Deng, A Mousavian, Y Xiang, F Xia, T Bretl, and D Fox. “PoseRBPF: A rao-blackwellized particle filter for
6D object pose tracking”. Robotics: Science and Systems (RSS), 2019.
HybridPose: 6D Object Pose Estimation under
Hybrid Representations
• HybridPose, a 6D object pose estimation approach, utilizes a hybrid intermediate
representation to express different geometric information in the input image, including
keypoints, edge vectors, and symmetry correspondences.
• Compared to a unitary representation, the hybrid representation allows pose regression to
exploit more and diverse features when one type of predicted representation is inaccurate
(e.g., because of occlusion).
• HybridPose leverages a robust regression module to filter out outliers in predicted
intermediate representation.
• All intermediate representations can be predicted by the same simple neural network
without sacrificing the overall performance.
• Compared to SOA pose estimation approaches, HybridPose is comparable in running time
and is significantly more accurate.
• The HybridPose code: https://github.com/chensong1995/HybridPose.
HybridPose: 6D Object Pose Estimation under
Hybrid Representations
HybridPose predicts keypoints, edge vectors, and symmetry correspondences. (a) input RGB image. (b) red
markers denote predicted 2D keypoints. (c) edge vectors are defined by a fully-connected graph among all
keypoints. (d) symmetry correspondences connect each 2D pixel on the object to its symmetric counterpart.
HybridPose: 6D Object Pose Estimation under
Hybrid Representations
• The input to HybridPose is an image containing an object in a known class, taken by a
pinhole camera with known intrinsic parameters.
• Assuming that the class of objects has a canonical coordinate system (i.e. the 3D point
cloud), under which HybridPose outputs the 6D camera pose of the image object.
• HybridPose consists of a prediction module and a pose regression module.
• HybridPose utilizes three prediction networks to estimate a set of keypoints, a set of edges
between keypoints and a set of symmetry correspondences between image pixels.
• The keypoint network employs an off-the-shelf prediction network PVNet;
• The edge network predicts edge vectors along a pre-defined graph of keypoints, which
stabilizes pose regression when keypoints are cluttered in the input image;
• The symmetry network predicts symmetry correspondences that reflect the underlying
(partial) reflection symmetry Z(extension of the FlowNet 2.0).
• The pose regression module optimizes the object pose to fit the output of the three
prediction networks (similar to the P3P solver, following the EPnP framework).
HybridPose: 6D Object Pose Estimation under
Hybrid Representations
HybridPose consists of intermediate representation prediction networks and a pose regression module. The
prediction networks take an image as input, and output predicted keypoints, edge vectors, and symmetry
correspondences. The pose regression module consists of an initialization sub-module and a refinement sub-module.
The initialization sub-module solves a linear system with predicted intermediate representations to obtain an initial
pose. The refinement sub-module utilizes GM robust norm in the optimization to obtain the final pose prediction.
HybridPose: 6D Object Pose Estimation under
Hybrid Representations
HybridPose handles situations where the object has no occlusion (a, d, f, h),
light occlusion (b, c), and severe occlusion (e, g).
6DoF Object Pose Estimation via Differentiable
Proxy Voting Loss
• Estimating a 6DOF object pose from a single image is very challenging due to occlusions or
texture-less appearances.
• Vector-field based keypoint voting has demonstrated its effectiveness and superiority on
tackling those issues.
• However, direct regression of vector-fields neglects that the distances between pixels and
keypoints also affect the deviations of hypotheses dramatically.
• In other words, small errors in direction vectors may generate severely deviated hypotheses
when pixels are far away from a keypoint.
• This paper aims to reduce such errors by incorporating the distances between pixels and
keypoints into the objective.
• To this end, it develops a differentiable proxy voting loss (DPVL) which mimics the
hypothesis selection in the voting procedure.
• By exploiting the voting loss, it can train the network in an end-to-end manner.
6DoF Object Pose Estimation via Differentiable
Proxy Voting Loss
Differentiable Proxy Voting Loss (DPVL)
illustration. Provided that the estimation
errors of direction vectors are the same
(e.g., α), the distance between a pixel
and a keypoint affects the closeness
between a hypothesis and the keypoint.
DPVL minimizes the distance d⋆ between
a proxy hypothesis fk(p⋆ ) and a keypoint
ki to achieve accurate hypotheses for
keypoint voting.
6DoF Object Pose Estimation via Differentiable
Proxy Voting Loss
• This work focuses on obtaining accurate initial pose estimation.
• In particular, this method is designed to localize and estimate the orientations and
translations of an object accurately without any refinement.
• The object pose is represented by a rigid transformation from the object coordinate system
to the camera coordinate system.
• Since voting based methods have demonstrated their robustness to occlusions and view
changes, here it follows the voting based pose estimation pipeline.
• Specifically, this method firstly votes 2D positions of the object keypoints from the vector-
fields and then estimates the 6DOF pose by solving a PnP problem.
• Prior works regress pixel-wise vector-fields by an l1 loss.
• However, small errors in the vector-fields may lead to large deviation errors of hypotheses
because the loss does not take the distance between a pixel and a keypoint into account.
• Therefore, it presents a differentiable proxy voting loss (DPVL) to reduce such errors by
mimicking the hypothesis selection in the voting procedure.
• Furthermore, benefiting from DPVL, the network is able to converge much faster.
6DoF Object Pose Estimation via Differentiable
Proxy Voting Loss
The system pipeline
6DoF Object Pose Estimation via Differentiable
Proxy Voting Loss
Qualitative results of pose estimation on the LINEMOD dataset
Qualitative results on the Occlusion LINEMOD dataset
Pose estimation from RGB images by deep learning

More Related Content

What's hot

What's hot (20)

image classification
image classificationimage classification
image classification
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
 
Object Recognition
Object RecognitionObject Recognition
Object Recognition
 
Image Processing Basics
Image Processing BasicsImage Processing Basics
Image Processing Basics
 
03 digital image fundamentals DIP
03 digital image fundamentals DIP03 digital image fundamentals DIP
03 digital image fundamentals DIP
 
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
Object recognition
Object recognitionObject recognition
Object recognition
 
Depth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIDepth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors II
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
Deep VO and SLAM
Deep VO and SLAMDeep VO and SLAM
Deep VO and SLAM
 
Deep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal DataDeep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal Data
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
 
04 image enhancement edge detection
04 image enhancement edge detection04 image enhancement edge detection
04 image enhancement edge detection
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 
3D visualisation of medical images
3D visualisation of medical images3D visualisation of medical images
3D visualisation of medical images
 
Neural Radiance Fields & Neural Rendering.pdf
Neural Radiance Fields & Neural Rendering.pdfNeural Radiance Fields & Neural Rendering.pdf
Neural Radiance Fields & Neural Rendering.pdf
 
Image enhancement
Image enhancementImage enhancement
Image enhancement
 
Digital image processing
Digital image processingDigital image processing
Digital image processing
 
Enhancement in Digital Image Processing
Enhancement in Digital Image ProcessingEnhancement in Digital Image Processing
Enhancement in Digital Image Processing
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
 

Similar to Pose estimation from RGB images by deep learning

10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
mokamojah
 
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
c.choi
 

Similar to Pose estimation from RGB images by deep learning (20)

3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
 
3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV
 
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III
 
fusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving Ifusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving I
 
LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)
 
3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V
 
[DL輪読会]ClearGrasp
[DL輪読会]ClearGrasp[DL輪読会]ClearGrasp
[DL輪読会]ClearGrasp
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
 
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
 
ei2106-submit-opt-415
ei2106-submit-opt-415ei2106-submit-opt-415
ei2106-submit-opt-415
 
Mmpaper draft10
Mmpaper draft10Mmpaper draft10
Mmpaper draft10
 
Mmpaper draft10
Mmpaper draft10Mmpaper draft10
Mmpaper draft10
 
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
 
sduGroupEvent
sduGroupEventsduGroupEvent
sduGroupEvent
 
cvpresentation-190812154654 (1).pptx
cvpresentation-190812154654 (1).pptxcvpresentation-190812154654 (1).pptx
cvpresentation-190812154654 (1).pptx
 
ppt 20BET1024.pptx
ppt 20BET1024.pptxppt 20BET1024.pptx
ppt 20BET1024.pptx
 
Computer Vision - Real Time Face Recognition using Open CV and Python
Computer Vision - Real Time Face Recognition using Open CV and PythonComputer Vision - Real Time Face Recognition using Open CV and Python
Computer Vision - Real Time Face Recognition using Open CV and Python
 
998-isvc16
998-isvc16998-isvc16
998-isvc16
 
Scrdet++ analysis
Scrdet++ analysisScrdet++ analysis
Scrdet++ analysis
 
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
 

More from Yu Huang

More from Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 

Recently uploaded

1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design
AshishSingh1301
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
drjose256
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
benjamincojr
 

Recently uploaded (20)

Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptx
 
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTUUNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.ppt
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptx
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 

Pose estimation from RGB images by deep learning

  • 1. Object Pose Estimation from RGB Images by Deep Learning Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  • 2. Outline (obsolete) • PoseCNN: A CNN for 6D Object Pose Estimation in Cluttered Scenes • BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth • SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again • Real-Time Seamless Single Shot 6D Object Pose Prediction • Implicit 3D Orientation Learning for 6D Object Detection from RGB Images • Vehicle Detection and Pose Estimation for Autonomous Driving (Thesis) • A mixed classification-regression framework for pose estimation from images • Classification and Pose Estimation of Vehicles in Videos by 3D Modeling within Discrete- Continuous Optimization • Improved Object Detection and Pose Using Part-Based Models • Object Detection and Viewpoint Estimation with a Deformable 3D Model
  • 3. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes • Estimating 6D pose of known objects is important for robots to interact with the real world. • The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. • This work introduces PoseCNN for 6D pose estimation, which estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. • The 3D rotation of the object is estimated by regressing to a quaternion representation. • It also introduces a loss function that enables PoseCNN to handle symmetric objects. • It builds a large scale video dataset for 6D object pose estimation, called YCB-Video dataset. • This dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. • The code and dataset at https://rse-lab.cs.washington.edu/projects/posecnn/.
  • 4. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes The PoseCNN is trained to perform three tasks: semantic labeling, 3D translation estimation, and 3D rotation regression.
  • 5. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes • The PoseCNN network contains two stages. • The 1st stage consists of 13 conv. layers and 4 max- pooling layers, which extract feature maps with different resolutions from the input image. This stage is the network backbone since the extracted features are shared across all the tasks performed by the network. • The 2nd stage consists of an embedding step that embeds the high-dimensional feature maps generated by the first stage into low-dimensional, task- specific features. Then, the network performs 3 different tasks that lead to the 6D pose estimation, i.e., semantic labeling (variation of FCN), 3D translation estimation, and 3D rotation regression. • It estimates the 3D translation by localizing the 2D object center in the image and estimating object distance from the camera. • The network regresses to the center direction for each pixel in the image, and a Hough voting layer finds the 2D object center of an object. • Using the object bounding boxes predicted from the Hough voting layer, it utilizes two RoI pooling layers to “crop and pool” the visual features generated by the 1st stage of the network for the 3D rotation regression.
  • 6. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes Architecture of PoseCNN
  • 7. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes The 3D translation can be estimated by localizing the 2D center of the object and estimating the 3D center distance from the camera. Each pixel casts votes for 2-D image locations along the ray predicted from the network.
  • 8. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
  • 9. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth • This paper introduces a method for 3D object detection and pose estimation from color images only. • It first uses segmentation to detect the objects of interest in 2D even in presence of partial occlusions and cluttered background. • By contrast with recent patch-based methods, it relies on a “holistic” approach: apply to the detected objects a CNN trained to predict their 3D poses in the form of 2D projections of the corners of their 3D bounding boxes. • This, however, is not sufficient for handling objects from the recent T-LESS dataset: These objects exhibit an axis of rotational symmetry, and the similarity of two images of such an object under two different poses makes training the CNN challenging. • It solves this problem by restricting the range of poses used for training, and by introducing a classifier to identify the range of a pose at run-time before estimating it. • It also uses an optional additional step that refines the predicted poses. • The full approach is also scalable, as a single network can be trained for multiple objects simultaneously.
  • 10. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth Localization: (a) The input image is resized to 512 × 384 and split into regions of size 128 × 128. (b) Each region is first segmented into a binary mask of 8 × 8 for each possible object o. (c) Only the largest component is kept if several components are present, the active locations are segmented more finely. (d) The centroid of the final segmentation is used as the 2D object center.
  • 11. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth • It first finds the objects in 2D, obtaining a first estimate of the 3D poses, including objects with a rotational symmetry, and finally refining the initial pose estimates. • It identifies the 2D centers of the objects of interest in the input images. • It could use a standard 2D object detector, but it develops an approach based on segmentation that resulted in better performance as it can provide accurate locations even under partial occlusions. • It predicts the 3D pose of an object by applying a Deep Network to an image window centered on the 2D object center. • As for the segmentation, it uses VGG as a basis for this network, which allows to handle all the objects of the target dataset with a single network. • For an object with an angle of symmetry α, it can therefore restrict the poses used for training to the poses where the angle of rotation around the symmetry axis is within the range [0; α], to avoid the ambiguity between images. • Denote by β the rotation angle, and introduce the intervals r1 = [0; α/2[ and r2 = [α/2; α[. • To avoid ambiguity, restrict β to be in r1 for the training images used in the optimization. • It introduces a CNN classifier to predict at run-time if β is in r1 or r2.
  • 12. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth Objects with symmetry of rotation: object #5 of T-LESS has an angle of symmetry α of 180◦ , if ignoring the small screw and electrical contact. If restricting the range of poses in the training set between 0◦ (a) and 180◦ (b), pose estimation still fails for test samples with an angle of rotation close to 0◦ modulo 180◦ (c). The solution is to restrict the range during training to be between 0◦ and 90◦. It uses a classifier to detect if the pose in an input image is between 90◦ and 180◦. If this is the case (d), it mirrors the input image (e), and mirrors back the predicted projections for the corners (f).
  • 13. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth Refining the pose. Given a first pose estimate, shown by the blue bounding box (a), it generates a binary mask (b) or a color rendering (c) of the object. Given the input image and this mask or rendering, it can predict an update that improves the object pose, shown by the red bounding box (d). Two generated training images for different objects from the LINEMOD dataset. The object is shifted from the center to handle the inaccuracy of the detection method, and the background is random to make sure that the network cannot exploit the context specific to the dataset.
  • 14. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth First row: LINEMOD dataset; Second row: Occlusion dataset; Third row: T-LESS dataset (for objects of revolution, we represent the pose with a cylinder rather than a box); Last row: Some failure cases.
  • 15. Outline • PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation (2018.12) • SilhoNet: An RGB Method for 6D Object Pose Estimation (2019.6) • Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation (2019.8) • Deepim: Deep iterative matching for 6d pose estimation (2019.10) • Accurate 6D Object Pose Estimation by Pose Conditioned Mesh Reconstruction (2019.10) • CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation (ICCV, 2019) • DPOD: 6D Pose Object Detector and Refiner (ICCV, 2019) • ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation (2019.12) • LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation (2019.12) • HybridPose: 6D Object Pose Estimation under Hybrid Representations (2020.1) • 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss (2020.2)
  • 16. PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation • This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. • Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. • However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. • It introduces a Pixelwise Voting Network (PVNet) to regress pixelwise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. • This creates a flexible representation for localizing occluded or truncated keypoints. • Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. • Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB- Video datasets by a large margin, while being efficient for real-time pose estimation. • The code will be avaliable at https://zju-3dv.github.io/pvnet/.
  • 17. PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation The 6D pose estimation problem is formulated as a Perspective-n-Point (PnP) problem, which requires correspondences between 2D and 3D keypoints.
  • 18. PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation Overview of the keypoint localization: (a) An image. (b) The architecture of PVNet. (c) Pixel-wise unit vectors pointing to the object keypoints. (d) Semantic labels. (e) Hypotheses of the keypoint locations generated by voting. (f) Probability distributions of the keypoint locations estimated from hypotheses.
  • 19. PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation
  • 20. SilhoNet: An RGB Method for 6D Object Pose Estimation • Autonomous robot manipulation involves estimating the translation and orientation of the object to be manipulated as a 6-degree-of-freedom (6D) pose. • Methods using RGB- D data have shown great success in solving this problem. • However, there are situations where cost constraints or the working environment may limit the use of RGB-D sensors. • When limited to mono camera data only, the problem of pose estimation is very challenging. • Knowing how the target object is occluded in the image is important for certain applications, such as AR, where it is desirable to project over only the visible portion of an object. • This work introduces SilhoNet, a RGB-based deep learning method, that predicts 6D object pose from monocular images. • It uses a CNN pipeline that takes in ROI proposals to predict an intermediate silhouette representation for objects with an associated occlusion mask and a 3D translation vector. • The 3D orientation is then regressed from the predicted silhouettes.
  • 21. SilhoNet: An RGB Method for 6D Object Pose Estimation The SilhoNet pipeline for silhouette prediction and 6D object pose estimation
  • 22. SilhoNet: An RGB Method for 6D Object Pose Estimation • The 3D orientation is predicted from an intermediate un-occluded silhouette representation. • The method also predicts an occlusion mask which can be used to determine which parts of the object model are visible in the image. • The method operates in two stages, first predicting an intermediate silhouette representation and occlusion mask of an object along with a vector describing the 3D translation and then regressing the 3D orientation quaternion from the predicted silhouette. • The input to the network is an RGB image with ROI proposals for detected objects and the associated class labels. • The 1st stage uses a VGG16 backbone with deconvolution layers at the end to produce a feature map from the RGB input image. (This is the same as used in PoseCNN) • Extracted features from the input image are concatenated with features from a set of rendered object viewpoints and then passed through 3 network branches, two of which have identical structure to predict a full un-occluded silhouette and an occlusion mask. • The 3rd branch predicts a 3D vector encoding the object center in pixel coordinates and the range of the object center from the camera. • The 2nd stage of the network passes the predicted silhouette through a ResNet-18 architecture with two FCLs at the end to output an L2-normalized quaternion, representing the 3D orientation.
  • 23. SilhoNet: An RGB Method for 6D Object Pose Estimation • A Faster- RCNN from Tensorpack on the YCB- video dataset to predict ROI proposals; • For each class, it rendered a set of 12 viewpoints from the object model of size 224x224; • The 1st stage of the network predicts an intermediate silhouette representation of the object as a 64x64 dimensional binary mask. • This silhouette represents the full un-occluded visual hull of the object as though it were rendered with the same 3D orientation but centered in the frame. • The size of the silhouette in the frame is invariant to the scale of the object in the image and is determined by a fixed distance of the object from the camera at which the silhouette appears to be rendered. • This distance is chosen for each object so that the silhouette just fits within the frame for any 3D orientation. • Given the smallest field of view of the camera as A, and the object size as width, height and depth (w, h, d), then, the render distance r is
  • 24. SilhoNet: An RGB Method for 6D Object Pose Estimation • If an object at a given range is shifted along the arc with the camera center as the focus, the Z coordinate will change while the object appearance in the shifted ROI will be unchanged. • By predicting the object range rather than directly regressing the Z coordinate, this method does not suffer from ambiguities and can recover the Z coordinate with good accuracy. • Given the camera focal length f, the pixel coordinates of the object center (px, py) with respect to the image center, and the range r of the object center form the camera center, similar triangles can be used to show that the 3D object translation, (X, Y, Z), can be recovered as • Given a ROI with lower x and y coordinate bounds (bx, by), the coordinates of the image principal point (cx, cy) and the predicted normalized output from the network (nx, ny), the object center pixel coordinates (px, py) are recovered as
  • 25. SilhoNet: An RGB Method for 6D Object Pose Estimation • The network predicts the apparent orientation as though the ROI were extracted from the center of the image. • Given the predicted object translation, the true orientation is recovered by applying a pitch δθ, and roll δφ, adjustment to the predicted orientation, calculated as • The network’s 2nd stage takes in the predicted silhouette probability maps, thresholded at some value into binary masks, and outputs a quaternion prediction for the object orientation. • This stage is composed of a ResNet-18 backbone, with the layers from the average pooling and below replaced with two fully connected layers. • It constructs a transform matrix T with a z translation equal to the render distance r for the corresponding object class and the x and y translation components set to 0. • The rotation is formed from the predicted apparent orientation. Using the following equation, each vertex of the object model can be projected onto the occlusion mask, which is scaled up to fit the minimum dimension of the input image (camera intrinsic K),
  • 26. SilhoNet: An RGB Method for 6D Object Pose Estimation Example prediction of occluded and un-occluded silhouettes from a test image
  • 27. Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation • Estimating the 6D pose of objects using only RGB images remains challenging because of problems such as occlusion and symmetries. • It is also difficult to construct 3D models with precise texture without expert knowledge or specialized scanning devices. • To address these problems, it proposes a pose estimation method, Pix2Pose, that predicts the 3D coordinates of each object pixel without textured models. • An auto-encoder architecture is designed to estimate the 3D coord.s and errors per pixel. • These pixel-wise predictions are then used in multiple stages to form 2D-3D correspondences to directly compute poses with the PnP algorithm with RANSAC iterations. • This method is robust to occlusion by leveraging recent achievements in generative adversarial training to precisely recover occluded parts. • Furthermore, a loss function, the transformer loss, is proposed to handle symmetric objects by guiding predictions to the closest symmetric pose.
  • 28. Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation An overview of the architecture of Pix2Pose and the training pipeline.
  • 29. Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation • Pix2Pose predicts 3D coord.s of individual pixels using a cropped region for an object. • The robust estimation is established by recovering 3D coordinates of occluded parts and using all pixels of an object for pose prediction. • A single network is trained and used for each object class. • The texture of a 3D model is not necessary for training and inference. • The network input is a cropped image Is using a bounding box of a detected object class. • The network outputs are normalized 3D coordinates of each pixel in the object coordinate and estimated errors of each prediction from the Pix2Pose network. • The target output includes coordinate predictions of occluded parts, which makes the prediction more robust to partial occlusion. • Since a coordinate consists of three values similar to RGB values in an image, the output can be regarded as a color image. • Therefore, the ground truth output is easily derived by rendering the colored coordinate model in the ground truth pose.
  • 30. Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation An example of the pose estimation process. An image and 2D detection results are the input. In the 1st stage, the predicted results are used to specify important pixels and adjust bounding boxes while removing backgrounds and uncertain pixels. In the 2nd stage, pixels with valid coordinate values and small error predictions are used to estimate poses using the PnP algorithm with RANSAC. Green and blue lines in the result represent 3D bounding boxes of objects in ground truth poses and estimated poses.
  • 31. Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation
  • 32. Accurate 6D Object Pose Estimation by Pose Conditioned Mesh Reconstruction • Current 6D object pose methods consist of deep CNN models fully optimized for a single object but with its architecture standardized among objects with different shapes. • This work explicitly exploits each object’s distinct topological info i.e. 3D dense meshes in the pose estimation model, prior to any post-processing refinement stage. • In order to achieve this, it proposes a learning framework in which a Graph Convolutional Neural Network reconstructs a pose conditioned 3D mesh of the object. • An estimation of allocentric orientation is recovered by computing, in a differentiable manner, the Procrustes’ alignment btw the canonical and reconstructed dense 3D meshes. • 6D egocentric pose is lifted using additional mask and 2D centroid projection estimations. • This method is capable of self validating its pose estimation by measuring the quality of the reconstructed mesh, which is invaluable in real life applications.
  • 33. Accurate 6D Object Pose Estimation by Pose Conditioned Mesh Reconstruction The pipeline where it fully exploits the object shape topology both in 2D and 3D for 6D pose estimation
  • 34. Accurate 6D Object Pose Estimation by Pose Conditioned Mesh Reconstruction • Given a monocular RGB input image, the goal is to estimate full 6D pose of a rigid object. • It aims to design a distinct per object architecture in an automated manner by taking full advantage of prior information of the object. • The reconstruction stage combines the use of the object’s known topology with encoded pose information extracted from the image. • The estimated mesh info is used to recover the allocentric orientation of the target object. • Egocentric orientation can be recovered and lifted to 6D by adopting different approaches. • It uses a pretrained FasterRCNN based 2D object detector and fine tune the model on training data in order to detect an object in 2D space. • The detector is used to crop an object ROI for further processing, which is used in a high resolution to extract fine details of object appearance in the next stages of our pipeline. • This ad hoc detector is trained independently.
  • 35. Accurate 6D Object Pose Estimation by Pose Conditioned Mesh Reconstruction Qualitative results obtained with this method
  • 36. Deepim: Deep iterative matching for 6d pose estimation • While several recent techniques have used depth cameras for object pose estimation, such cameras have limitations with respect to frame rate, field of view, resolution, and depth range, making it very difficult to detect small, thin, transparent, or fast moving objects. • Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. • While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. • This work proposes a deep neural network for 6D pose matching named DeepIM. • Given an initial pose estimation, this network is able to iteratively refine the pose by matching the rendered image against the observed image. • The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. • DeepIM is able to match previously unseen objects.
  • 37. Deepim: Deep iterative matching for 6d pose estimation DeepIM, a deep iterative matching network for 6D object pose estimation. The network is trained to predict a relative SE(3) transformation that can be applied to an initial pose estimation for iterative pose refinement. Given a 6D pose estimation of an object, either from PoseCNN or the refined pose from previous iteration, along with the 3D model of the object, it generates the rendered image showing the appearance of the target object under this rough pose estimation. With the image pairs of rendered image and observed image, the network predicts a relative transformation which can be applied to refine the input pose.
  • 38. Deepim: Deep iterative matching for 6d pose estimation • The observed image, the rendered image, and the two masks, are concatenated into an 8- channel tensor input to the network (3 channels for observed/rendered image, 1 channel for each mask). • It uses the FlowNetSimple architecture as the backbone network, which is trained to predict optical flow between two images. • It tried using the VGG16 image classification network as the backbone network, but the results were very poor, confirming the intuition that a representation related to optical flow is very useful for pose matching. • The pose estimation branch takes the feature map after 10 convolution layers from FlowNetSimple as input. • It contains two fully-connected layers each with dimension 256, followed by two additional fully-connected layers for predicting the quaternion of the 3D rotation and the 3D translation, respectively. • During training, two auxiliary branches to regularize the feature representation of the network and increase training stability and performance. • One branch is trained for predicting optical flow btw the rendered and the observed image, and the other branch for predicting the foreground mask of the object in the observed image.
  • 39. Deepim: Deep iterative matching for 6d pose estimation DeepIM uses a FlowNetSimple backbone to predict a relative SE(3) transformation to match the observed and rendered image of an object. Taking observed image and rendered image and their corresponding masks as input, the conv. layers output a feature map which then be forwarded through several FCLs to predict the translation and rotation. The same feature map, combined with feature maps in the previous layers, will also be used to predict flow and FG mask during training.
  • 40. Deepim: Deep iterative matching for 6d pose estimation Pose refinement results on the Occlusion LINEMOD dataset
  • 41. CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation • 6-DoF object pose estimation from a single RGB image is a fundamental and long-standing problem in computer vision. • Current leading approaches solve it by training deep networks to either regress both rotation and translation from image directly or to construct 2D-3D correspondences and further solve them via PnP indirectly. • It argued rotation and translation should be treated differently for their significant difference. This work proposes a novel 6-DoF pose estimation approach: Coordinates-based Disentangled Pose Network (CDPN), which disentangles the pose to predict rotation and translation separately to achieve highly accurate and robust pose estimation. • This method is flexible, efficient, can deal with texture-less and occluded objects. • This approach exceeds the SOA RGB-based methods on commonly used metrics.
  • 42. CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation Given an input image, first zoom in on target object, and then, the rotation and translation are disentangled for estimation. Concretely, the rotation is solved by PnP from predicted 3D coordinates, while the translation is estimated directly from image.
  • 43. CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation • First, a fast, lightweight detector (e.g. tiny YOLOv3) is employed for coarse detection; • Second, a fixed size segmentation is implemented to extract the object pixels. • For detection, the pose estimation system can tolerate detection errors to a large extent attributing to the Dynamic Zoom-In (DZI), so a fast but less-precise detector is enough. • For segmentation, it is merged into coordinates regression to make enough light and fast. • This two-step pipeline can efficiently extract exact object region in various situations. • In terms of translation, to achieve more robust and accurate estimation, it predicts it from the image instead of 2D-3D correspondences to avoid the influence from the scale error in the predicted 3D coordinates. • Instead of regressing translation from the whole image, a Scale-Invariant Translation Estimation (SITE) method estimates it from the detected object region. • In this way, the disentangled processes regarding rotation and translation are unified into a single network, namely Coordinates-based Disentangled Pose Network (CDPN).
  • 44. CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation Qualitative results for 6-DoF pose estimation and 3D coordinates regression.
  • 45. DPOD: 6D Pose Object Detector and Refiner • This paper presents a deep learning method for 3D object detection and 6D pose estimation from RGB images only. • This method, named DPOD (Dense Pose Object Detector), estimates dense multi-class 2D-3D correspondence maps between an input image and available 3D models. • Given the correspondences, a 6DoF pose is computed via PnP and RANSAC. • An additional RGB pose refinement of the initial pose estimates is performed using a custom deep learning-based refinement scheme. • Unlike other methods that mainly use real data for training and do not train on synthetic renderings, it performs evaluation on both synthetic and real training data demonstrating superior results before and after refinement when compared to all recent detectors. • While being precise, the presented approach is still real-time capable.
  • 46. DPOD: 6D Pose Object Detector and Refiner Given an input RGB image, the correspondence block, featuring an encoder-decoder neural network, regresses the object ID mask and the correspondence map. The latter one provides us with explicit 2D-3D correspondences, whereas the ID mask estimates which correspondences should be taken for each detected object. The respective 6D poses are then efficiently computed by the pose block based on PnP + RANSAC.
  • 47. DPOD: 6D Pose Object Detector and Refiner • The inference pipeline is divided into two blocks: the correspondence and the pose block. • The correspondence block consists of an encoder-decoder CNN with three decoder heads which regress the ID mask and dense 2D-3D correspondence map from an RGB image of size 320×240×3. • The encoder part is based on a 12-layer ResNet-like architecture featuring residual layers that allow for faster convergence. • The decoders upsample the feature up to its original size using a stack of bilinear interpolations followed by convolutional layers. • The pose block is responsible for the pose prediction: Given the estimated ID mask, it observes which objects were detected in the image and their 2D locations, whereas the correspondence map maps each 2D point to a coordinate on an actual 3D model. • The 6D pose is then estimated using the Perspective-n-Point (PnP) pose estimation method that estimates the camera pose given correspondences and intrinsic parameters of the camera.
  • 48. DPOD: 6D Pose Object Detector and Refiner Correspondence model: Given a 3D model of interest (1), it applies a 2 channel correspondence texture (2) to it. The resulting correspondence model (3) is then used to generate GT maps and estimate poses. Refinement architecture: The network predicts a refined pose given an initial pose proposal. Crops of the real image and the rendering are fed into two parallel branches. The difference of the computed feature tensor is to estimate the refined pose. To learn dense 2D-3D correspondences, each model of the dataset is textured with a correspondence map.
  • 49. DPOD: 6D Pose Object Detector and Refiner Qualitative results: Poses predicted with the approach on (a) the LineMOD dataset and (b) the OCCLUSION dataset.
  • 50. ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation • Feature-based and template-based methods were popular for 6D object pose estimation. • Feature-based methods rely on distinguishable features and perform badly for texture-poor objects. • Template-based methods do not work well if objects are partially occluded. • With deep learning methods showing success for different image-related problem settings, models inspired or extending these have been used increasingly. • Symmetric objects pose a particular challenge for orientation estimation, because multiple solutions or manifolds of solutions exist. • This work introduces ConvPoseCNN, a fully convolutional architecture that avoids cutting out individual objects. • It put forward pixel-wise, dense prediction of both translation and orientation components of the object pose, where the dense orientation is represented in Quaternion form. • It presents different approaches for aggregation of the dense orientation predictions, including averaging and clustering schemes. • The dense orientation prediction implicitly learns to attend to occlusion-free, and feature-rich object regions.
  • 51. ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation Dense Prediction of 6D pose parameters inside ConvPoseCNN. The dense predictions are aggregated on the object level to form 6D pose outputs.
  • 52. ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation ConvPoseCNN
  • 53. ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation • The ConvPoseCNN architecture derived from PoseCNN, which predicts, starting from RGB images, 6D poses for each object in the image. • The network starts with the convolutional backbone of VGG16 that extracts features. • These are subsequently processed in three branches: The fully-convolutional segmentation branch that predicts a pixel-wise semantic segmentation, the fully-convolutional vertex branch, which predicts a pixel-wise estimation of the center direction and center depth, and the quaternion estimation branch. • The segmentation and vertex branch results are combined to vote for object centers in a Hough transform layer. • The Hough layer also predicts bounding boxes for the detected objects. • PoseCNN then uses these bounding boxes to crop and pool the extracted features which are then fed into a fully-connected neural network architecture. • This fully-connected part predicts an orientation quaternion for each bounding box.
  • 54. ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation Qualitative results from ConvPoseCNN L2 on the YCB-Video test set. Top: (orange) ground truth bounding boxes, (green) 6D pose prediction. Middle: Angular error of the dense quaternion prediction w.r.t. ground truth. Bottom: Quaternion prediction norm before normalization.
  • 55. LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation • Current 6D object pose estimation methods usually require a 3D model for each object. • These methods also require additional training in order to incorporate new objects. • As a result, they are difficult to scale to a large number of objects and cannot be directly applied to unseen objects. • This work proposes a framework for 6D pose estimation of unseen objects. • It designs an end-to-end neural network that reconstructs a latent 3D representation of an object using a small number of reference views of the object. • Using the learned 3D representation, the network is able to render the object from arbitrary views. • Using this neural renderer, it directly optimizes for the pose given an input image. • By training the network with a large number of 3D shapes for reconstruction and rendering, this network generalizes well to unseen objects. • A dataset for unseen object pose estimation–MOPED (Model-free Object Pose Estimation Dataset) is presented.
  • 56. LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation This is the end-to-end differentiable modeling and rendering pipeline to perform pose estimation using simple gradient updates.
  • 57. LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation • Given a set of N reference images with associated object poses and object segmentation masks, it seeks to construct a representation of the object which can be rendered with arbitrary camera parameters. • It represents the object as a latent 3D voxel grid, directly manipulated using standard 3D transformations–naturally accommodating the requirement of novel view rendering. • There are two main components to the reconstruction pipeline: 1) Modeling the object by predicting per-view feature volumes and fusing them into a single canonical latent representation; 2) Rendering the latent representation to depth and color images. • The modeling step is inspired by space carving in that the network takes observations from multiple views and leverages multi-view consistency to build a canonical representation. • The rendering modules takes the fused object volume and renders it given arbitrary camera parameters. • It does by first rendering depth and then using an image-based rendering approach to produce a color image, preserving high frequency details through a neural network.
  • 58. LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation A high-level overview of this architecture. 1) This modeling network takes an image and mask and predicts a feature volume for each input view. The predicted features volumes are then fused into a single canonical latent object by the fusion module. 2) Given the latent object, the rendering network produces a depth map and a mask for any output camera.
  • 59. LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation [6] X Deng, A Mousavian, Y Xiang, F Xia, T Bretl, and D Fox. “PoseRBPF: A rao-blackwellized particle filter for 6D object pose tracking”. Robotics: Science and Systems (RSS), 2019.
  • 60. HybridPose: 6D Object Pose Estimation under Hybrid Representations • HybridPose, a 6D object pose estimation approach, utilizes a hybrid intermediate representation to express different geometric information in the input image, including keypoints, edge vectors, and symmetry correspondences. • Compared to a unitary representation, the hybrid representation allows pose regression to exploit more and diverse features when one type of predicted representation is inaccurate (e.g., because of occlusion). • HybridPose leverages a robust regression module to filter out outliers in predicted intermediate representation. • All intermediate representations can be predicted by the same simple neural network without sacrificing the overall performance. • Compared to SOA pose estimation approaches, HybridPose is comparable in running time and is significantly more accurate. • The HybridPose code: https://github.com/chensong1995/HybridPose.
  • 61. HybridPose: 6D Object Pose Estimation under Hybrid Representations HybridPose predicts keypoints, edge vectors, and symmetry correspondences. (a) input RGB image. (b) red markers denote predicted 2D keypoints. (c) edge vectors are defined by a fully-connected graph among all keypoints. (d) symmetry correspondences connect each 2D pixel on the object to its symmetric counterpart.
  • 62. HybridPose: 6D Object Pose Estimation under Hybrid Representations • The input to HybridPose is an image containing an object in a known class, taken by a pinhole camera with known intrinsic parameters. • Assuming that the class of objects has a canonical coordinate system (i.e. the 3D point cloud), under which HybridPose outputs the 6D camera pose of the image object. • HybridPose consists of a prediction module and a pose regression module. • HybridPose utilizes three prediction networks to estimate a set of keypoints, a set of edges between keypoints and a set of symmetry correspondences between image pixels. • The keypoint network employs an off-the-shelf prediction network PVNet; • The edge network predicts edge vectors along a pre-defined graph of keypoints, which stabilizes pose regression when keypoints are cluttered in the input image; • The symmetry network predicts symmetry correspondences that reflect the underlying (partial) reflection symmetry Z(extension of the FlowNet 2.0). • The pose regression module optimizes the object pose to fit the output of the three prediction networks (similar to the P3P solver, following the EPnP framework).
  • 63. HybridPose: 6D Object Pose Estimation under Hybrid Representations HybridPose consists of intermediate representation prediction networks and a pose regression module. The prediction networks take an image as input, and output predicted keypoints, edge vectors, and symmetry correspondences. The pose regression module consists of an initialization sub-module and a refinement sub-module. The initialization sub-module solves a linear system with predicted intermediate representations to obtain an initial pose. The refinement sub-module utilizes GM robust norm in the optimization to obtain the final pose prediction.
  • 64. HybridPose: 6D Object Pose Estimation under Hybrid Representations HybridPose handles situations where the object has no occlusion (a, d, f, h), light occlusion (b, c), and severe occlusion (e, g).
  • 65. 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss • Estimating a 6DOF object pose from a single image is very challenging due to occlusions or texture-less appearances. • Vector-field based keypoint voting has demonstrated its effectiveness and superiority on tackling those issues. • However, direct regression of vector-fields neglects that the distances between pixels and keypoints also affect the deviations of hypotheses dramatically. • In other words, small errors in direction vectors may generate severely deviated hypotheses when pixels are far away from a keypoint. • This paper aims to reduce such errors by incorporating the distances between pixels and keypoints into the objective. • To this end, it develops a differentiable proxy voting loss (DPVL) which mimics the hypothesis selection in the voting procedure. • By exploiting the voting loss, it can train the network in an end-to-end manner.
  • 66. 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss Differentiable Proxy Voting Loss (DPVL) illustration. Provided that the estimation errors of direction vectors are the same (e.g., α), the distance between a pixel and a keypoint affects the closeness between a hypothesis and the keypoint. DPVL minimizes the distance d⋆ between a proxy hypothesis fk(p⋆ ) and a keypoint ki to achieve accurate hypotheses for keypoint voting.
  • 67. 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss • This work focuses on obtaining accurate initial pose estimation. • In particular, this method is designed to localize and estimate the orientations and translations of an object accurately without any refinement. • The object pose is represented by a rigid transformation from the object coordinate system to the camera coordinate system. • Since voting based methods have demonstrated their robustness to occlusions and view changes, here it follows the voting based pose estimation pipeline. • Specifically, this method firstly votes 2D positions of the object keypoints from the vector- fields and then estimates the 6DOF pose by solving a PnP problem. • Prior works regress pixel-wise vector-fields by an l1 loss. • However, small errors in the vector-fields may lead to large deviation errors of hypotheses because the loss does not take the distance between a pixel and a keypoint into account. • Therefore, it presents a differentiable proxy voting loss (DPVL) to reduce such errors by mimicking the hypothesis selection in the voting procedure. • Furthermore, benefiting from DPVL, the network is able to converge much faster.
  • 68. 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss The system pipeline
  • 69. 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss Qualitative results of pose estimation on the LINEMOD dataset Qualitative results on the Occlusion LINEMOD dataset