SlideShare a Scribd company logo
1 of 70
Download to read offline
Deconstructing
SfM-Netarchitecture
andbeyond
Deep LearningforStructure-from-Motion (SfM)
Purpose of this presentation
● Deconstruction of the novel SfM-Net deep learning
architecture for Structure-for-Motion (SfM)
- Anticipating the increased use of deep learning for
SfM and “geometric” computer vision problems in
general.
- Hopefully this presentation is able to give incite
feelings of “that could be done in a better way”
leading to better and better deep SfM architectures.
● “A bit of everything” for everyone approach targeted
mainly for computer vision engineers that are not at
advanced level in geometric deep learning.
● Trying to address the typical problem of finding the
relevant “seed literature” for a new topic helping fresh
grad students, postdocs, software engineers and
startup founders.
- Answer to “Do you know if someone has done some
work on the various steps involved in SfM” to identify
what wheels do not need to be re-invented
Background
SfM • Structure from Motion Basics recap • Camera Projections
Structure-from-Motion (SfM). Instead of a
single stereo pair, the SfM technique requires
multiple, overlapping photographs as input to
feature extraction and 3-D reconstruction
algorithms. - Westoby et al
SfM method basically computes the
relative camera positions between
all related photos. After every relative
camera position is found, the scheme
uses these matrices to reconstruct all
feature points using triangulation.
Thus there are two main problems:
1) Image registration (e.g. SIFT,
SURF, ORB, etc)
2) Pose Estimation (e.g.
Perspective-n-Point with
RANSAC)
Image registration
e.g Find corresponding features from image pair (http://cs.brown.edu/courses/cs143/proj3a/)
Depending on how robust is the
algorithm used to find features
(e.g. old school vs. deep learning),
the higher quality the reconstructed
point cloud is.
Camera Projection Matrix convert from 3D read world coordinates to 2D image coordinates
ults/proj3/html/agartia3/index.html
Perspective
Camera Toy
SfM • Structure from Motion Basics recap • Fundamental Matrix
SfM • Structure from Motion Basics recap • pose estimates
Evaluating Pose Estimation Methods for
Stereo Visual Odometry on Robots
Date of Original Version: 8-2010
Hatem Alismail, Carnegie Mellon University;
Brett Browning, Carnegie Mellon University;
M. Bernardine Dias, Carnegie Mellon University
http://repository.cmu.edu/robotics/745/
Structure-From-Motion (SFM) methods, using
stereo data, are among the best performing
algorithms for motion estimation from video
imagery, or visual odometry. Critical to the
success of SFM methods is the quality of the initial
pose estimation algorithm from feature
correspondences. In this work, we evaluate the
performance of pose estimation algorithms
commonly used in SFM visual odometry. We
consider two classes of techniques to develop the
initial pose estimate: Absolute Orientation (AO)
methods, and Perspective-n-Point (PnP)
methods.
To date, there has not been a comparative study of
their performance on robot visual odometry tasks.
We undertake such a study to measure the
accuracy, repeatability, and robustness of these
techniques for vehicles moving in indoor
environments and in outdoor suburban roadways.
Our results show that PnP methods outperform
AO methods, with P3P being the best performing
algorithm. This is particularly true when stereo
triangulation uncertainty is high due to a wide
Field of View lens and small stereo-rig baseline.
Random forests versus Neural Networks — What's best for camera localization?
Daniela Massiceti ; Alexander Krull ; Eric Brachmann ; Carsten Rother ; Philip H.S. Torr
Robotics and Automation (ICRA), 2017 IEEE International Conference on; https://doi.org/10.1109/ICRA.2017.7989598
“To summarize, our best method, a ForestNet with a robust average, which has an equivalent fast and lightweight
RF, improves over the state-of-the-art for camera localization on the 7-Scenes dataset. While this work focuses
on scene coordinate regression for camera localization, our innovations may also be applied to other continuous
regression tasks.”
Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional
Neural Network
Zakaria Laskar, Iaroslav Melekhov, Surya Kalia, Juho Kannala
https://arxiv.org/abs/1707.09733
“The neural network is trained for relative pose estimation in an end-to-end manner using training image pairs.
In contrast to previous work, our approach does not require scene-specific training of the network, which improves
scalability, and it can also be applied to scenes which are not available during the training of the network.”
DSAC - Differentiable RANSAC for Camera Localization
Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, Carsten Rother
https://arxiv.org/abs/1611.05705
“We call this approach DSAC, the differentiable counterpart of RANSAC. We apply DSAC to the problem of
camera localization, where deep learning has so far failed to improve on traditional approaches. We
demonstrate that by directly minimizing the expected loss of the output camera poses, robustly estimated by
RANSAC, we achieve an increase in accuracy. In the future, any deep learning pipeline can use DSAC as a robust
optimization component.”
Deep 6-DOF Tracking
Mathieu Garon, Jean-François Lalonde
https://arxiv.org/abs/1703.09771
“We present a temporal 6-DOF tracking method which leverages deep learning to achieve state-of-the-art
performance on challenging datasets of real world capture. Our method is both more accurate and more robust to
occlusions than the existing best performing approaches while maintaining real-time performance. To assess its
efficacy, we evaluate our approach on several challenging RGBD sequences of real objects in a variety of
conditions. Notably, we systematically evaluate robustness to occlusions through a series of sequences where the
object to be tracked is increasingly occluded. Finally, our approach is purely data-driven and does not require
any hand-designed features: robust tracking is automatically learned from data.”
SfM-Net • Intro
SfM-NeT • Abstract
Computer Science > Computer Vision and Pattern Recognition
SfM-Net: Learning of Structure and Motion from Video
Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki
Google Research; Inria, Grenoble, France; Carnegie Mellon University
(Submitted on 25 Apr 2017) arXiv:1704.07804 [cs.CV] | https://arxiv.org/abs/1704.07804
We propose SfM-Net, a geometry-aware neural network for motion
estimation in videos that decomposes frame-to-frame pixel motion in
terms of scene and object depth, camera motion and 3D object rotations
and translations. Given a sequence of frames, SfM-Net predicts depth,
segmentation, camera and rigid object motions, converts those into a
dense frame-to-frame motion field (optical flow), differentiably warps
frames in time to match pixels and back-propagates.
The model can be trained with various degrees of supervision:
1) Self-supervised by the re-projection photometric error (completely
unsupervised),
2) Supervised by ego-motion (camera motion), or
3) Supervised by depth (e.g., as provided by RGBD sensors). SfM-Net
extracts meaningful depth estimates and successfully estimates
frame-to-frame camera rotations and translations.
It often successfully segments the moving objects in the scene, even
though such supervision is never provided. SfM-Net: Given a pair of frames as input, our model decomposes frame-to-frame pixel
motion into 3D scene depth, 3D camera rotation and translation, a set of motion masks
and corresponding 3D rigid rotations and translations. It backprojects the resulting 3D
scene flow into 2D optical flow and warps accordingly to match pixels from one frame to
the next. Forwardbackward consistency checks constrain the estimated depth
SfM-NeT • Inspiration
SfM-Net is inspired by works that impose geometric
constraints on optical flow, exploiting rigidity of the visual
scene, such as early low-parametric optical flow methods
[e.g. Zelnik-Manor and Irani (2000)]
or the so-called direct methods for
visual SLAM (Simultaneous Localization and Mapping) that
perform dense pixel matching from frame to frame while
estimating a camera trajectory and depth of the pixels in the
scene [e.g. Schöps et al. (2014) and Engel et al. (2014)]
.
In contrast to those, instead of optimizing directly over
optical flow vectors, 3D point coordinates or camera
rotation and translation, our model optimizes over neural
network weights that, given a pair of frames, produce such
3D structure and motion. In this way, our method learns to
estimate structure and motion, and can in principle
improve as it processes more videos, in contrast to non-
learning based alternatives. It can thus be made robust to
lack of texture, degenerate camera motion trajectories or
dynamic objects (our model explicitly accounts for those),
by providing appropriate supervision.
Our work is also inspired and builds upon recent works on
learning geometrically interpretable optical flow fields for
point cloud prediction in time [Byravan and Fox (2016)]
and
backpropagating through camera projection for 3D human
pose estimation [Wu et al. (2016)]
or single-view depth estimation
[Zhou et al. (2017), https://github.com/tinghuiz/SfMLearner].
The training data to our system consists solely of
unlabeled image sequences capturing scene appearance
from different viewpoints, where the poses of the images
are not provided. Our training procedure produces two
models that operate independently, one for single-view
depth prediction, and one for multiview camera pose
estimation.
https://arxiv.org/abs/1704.07813
Overview of the Large Scale Direct Monocular SLAM (LSD-SLAM) algorithm
http://www.doc.ic.ac.uk/~ab9515/lsdslam.html.
SfM-NeT • related Work
Differentiable warping [Jaderberg et al. (2015)]
has been used to learn
end-to-end unsupervised optical flow [Yu et al. (2016)]
, disparity
flow in a stereo rig [Godard et al. (2016)]
and video prediction [
Patraucean et al. (2015)]
. The closest previous works to ours are
SE3-Nets[Byravan and Fox (2016)]
, 3D image interpreter [Wu et al. (2016)]
,
and Garg et al.’s (2016) depth CNN.
SE3-Nets[Byravan and Fox (2016)]
use an actuation force from a
robot and an input point cloud to forecast a set of 3D rigid
object motions (rotation and translations) and
corresponding pixel motion assignment masks under a
static camera assumption.
Our work uses similar representation of pixel motion masks
and 3D motions to capture the dynamic objects in the
scene. However, our work differs in that
1) We predict depth and camera motion while SE3-
Nets operate on given point clouds and assume no
camera motion,
2) SE3-Nets are supervised with pre-recorded 3D
optical flow, while this work admits diverse and
much weaker supervision, as well as complete
lack of supervision,
3) SE3-Nets consider one frame and an action as
input to predict the future motion, while our model
uses pairs of frames as input to estimate the intra-
frame motion, and
4) SE3-Nets are applied to toy or lab-like setups
whereas we show results on real videos.
https://doi.org/10.1109/ICRA.2017.7989023
SE3-NET architecture. Input is a 3D point cloud and an n-dimensional action vector (bold-italics), both of which are
encoded and concatenated to a joint feature vector (CAT). The decoder uses this encoding to predict "k" object masks M
and "k" SE(3) transforms which are used to transform the input cloud via the "Transform layer" to generate the output.
Mask weights are sharpened and normalized before use for prediction. Conv = Convolution, FC = Fully Connected,
Deconv = Deconvolution, CAT = Concatenation
SfM-Net • Architecture
SfM-NeT • SfM-Net architecture
SfM-Net architecture. For each pair of consecutive frames It
, It+1
, a conv/deconv sub-network predicts depth dt
while another predicts a set of K segmentation
masks mt
. The coarsest feature maps of the motion-mask encoder are further decoded through fully connected layers towards 3D rotations and translations for
the camera and the K segmentations. The predicted depth is converted into a per frame point-cloud using estimated or known camera intrinsics. Then, it is
transformed according to the predicted 3D scene flow, as composed by the 3D camera motion and independent 3D mask motions. Transformed 3D depth is
projected back to the 2D next frame, and thus provides corresponding 2D optical flow fields. Differentiable backward warping maps frame It+1
to It
, and gradients
are computed based on pixel errors. Forward-backward constraints are imposed by repeating this process for the inverted frame pair It+1
,, It
and constraining the
depths dt
and dt+1
to be consistent through the estimated scene motion.
coarse map
2 x fully connected layers
SfM-NeT • SfM-Net architecture • structure Network
We compute per frame depth using a
standard conv/deconv subnetwork
operating on a single frame (the
structure network on previous slide).
We use a RELU activation at our final
layer, since depth values are non-
negative.
Given depth d t , we obtain the 3D
point cloud corresponding to the
pixels in the scene using a pinhole
camera model. Let (xi
t
, yi
t
) be the
column and row positions of the ith
pixel in frame It
and let (cx
, cy
, f) be the
camera intrinsics, then
where dit
denotes the depth value of
the ith
pixel. We use the camera
intrinsics when available and revert
to default values of (0.5, 0.5, 1.0)
otherwise. Therefore, the predicted
depth will only be correct up to a
scalar multiplier.
https://youtu.be/vZELygPzV0M?t=51m47s
by Cyrill Stachniss
Xu et al. (2015)
https://machinelearningonline.blog/ by narasimman
Activation function variants for CIFAR-10/100 dataset
Godin et al. (2017): Test errors of the ResNet-110 architecture
using Dual ReLUs and Dual Exponential Linear Units [DELUs,
Clevert et al., (2016)], compared to the initial version with ReLUs
and the extension with Concatenated ReLU [CreLU,
Shang et al. (2017)] on CIFAR-10 and CIFAR-100, using an equal
parameter budget.
Comparison of activation functions. The
rectified linear unit (ReLU), the leaky ReLU (LReLU,
= 0.1), the shifted ReLUs (SReLUs), and theα
exponential linear unit (ELU, = 1.0).α
Clevert et al., (2016)
conv/deconv subnetwork
same as U-Net that first
downsamples [ENCODER]
the dense map (coarse map)
through max pooling which is
followed by upsampling
[DECODER].
Wojna et al. (2017)
SfM-NeT • SfM-Net architecture • Scene&Object Motion Network
We compute the motion of the camera and of independently moving objects in the
scene using a conv/deconv subnetwork that operates on a pair of images (the
motion network of architecture).
We depth-concatenate the pair of frames and use a series of convolutional layers
to produce an embedding layer. We use two fully-connected layers to predict the
motion of the camera between the frames and a predefined number K of rigid
body motions that explain moving objects in the scene.
The fully-connected layers are used to predict translation parameters tc
, the pivot
points of the camera rotation pc
., and sin α, sin β, sin γ. These last three
parameters are constrained to be in the interval [−1, 1] by using RELU activation and
the minimum function
We use similar representations as for camera motion and predict parameters using
fully-connected layers on top of the same embedding E. While camera motion is a
global transformation applied to all the pixels in the scene, the object motion
transforms are weighted by the predicted membership probability of each pixel to
each rigid motion. These masks are produced by feeding the embedding layer
through a deconvolutional tower. We use sigmoid activations at the last layer
instead of softmax in order to allow each pixel to belong to any number of rigid
body motions.
When a pixel has zero activation across all K maps it is assigned to the static
background whose motion is a function of the global camera motion alone. We
allow a pixel to belong to multiple rigid body transforms in order to capture
composition of motions, e.g., through kinematic chains, such as articulated
bodies. Learning the required number of motions for a sequence is an interesting
open problem. We found that we could fix K = 3 for all experiments presented here.
Note that our method can learn to ignore unnecessary object motions in a
sequence by assigning no pixels to the corresponding mask.
http://www.math.tau.ac.il/~dcor/Graphics/cg-sli
des/trans3d.pdf
SfM-NeT • SfM-Net architecture • optical Flow
We obtain optical flow by first transforming the point
cloud obtained in Equation (1) using the camera and object
motion rigid body transformations followed by projecting
the 3D point on to the image plane using the camera
intrinsics.
In the following, we drop the pixel superscript i from the 3D
coordinates, since it is clear we are referring to the motion
transformation of the ith
pixel of the tth
frame. We first apply
the object transformations:
Upgrade • SfM-NeT • Upgrade to architecture #1
The Devil is in the Decoder
Zbigniew Wojna, Vittorio Ferrari,
Sergio Guadarrama, Nathan Silberman,
Liang-Chieh Chen, Alireza Fathi, Jasper Uijlings
While encoders have been studied rigorously, relatively few
studies address the decoder side. Therefore this paper
presents an extensive comparison of a variety of
decoders for a variety of pixel-wise prediction tasks. Our
contributions are:
1) Decoders matter: we observe significant variance in
results between different types of decoders on various
problems.
2) We introduce a novel decoder: bilinear additive
upsampling.
3) We introduce new residual-like connections for decoders.
4) We identify two decoder types which give a consistently
high performance.
SfM-Net more like a proof-of-concept network that will be probably upgraded by the
authors themselves to skip-connection / residual-like connections for better
performance?
Wojna et al. (2017) found little advantage in depth prediction when using skip layers:
“For depth prediction, all layers except bilinear upsampling have good performance, whereas
adding skip layers to these results in equal performance except for depth-to-space, where it
slightly lowers performance”
When using residual connections, performance consistently improves:
“For the majority of combinations, we see that adding residual connections is beneficial.
Interestingly, we now can identify two upsampling methods which have consistently good
results on all problems presented in this paper, both which have residual connections: (1)
transposed convolutions + residual connections. (2) bilinear additive upsampling + residual
connections (both with and without skip connections).”
Our main results comparing a
variety of decoders on five
machine vision problems. The
upper part shows decoders
without residual-like connections;
the bottom shows decoders with
residual-like connections. The
colors represent relative
performance: red means top
performance, yellow means
reasonable performance, blue
means poor performance.
Upgrade • SfM-NeT • Upgrade to architecture #2
Learning a Multi-View
Stereo Machine
Abhishek Kar, Christian Häne, Jitendra Malik. UC Berkeley
https://people.eecs.berkeley.edu/~akar/deepmvs.pdf
In this work, we present Learnt Stereo Machines (LSM) -
a system which is able to reconstruct object geometry as
voxel occupancy grids or per-view depth maps from a small
number of views, including just a single image. We design
our system inspired by classical approaches while learning
each component from data embedded in an end to end
system. LSMs have built in projective geometry, enabling
reasoning in metric 3D space and effectively exploiting the
geometric structure of the Multi-view stereopsis (MVS)
problem.
Compared to classical approaches, which are designed to
exploit a specific cue such as silhouettes or photo-
consistency, our system learns to exploit the cues that are
relevant to the particular instance while also using priors
about shape to predict geometry for unseen regions.
Compared to recent learning based reconstruction
approaches, our system is able to better use camera pose
information leading to significantly large improvements
while adding more views. Finally, we show successful
generalization to unseen object categories demonstrating
that our network goes beyond semantic cues and strongly
uses geometric information for unified single and multi-view
3D reconstruction
Overview of a Learnt Stereo Machine (LSM). It takes as input one or more views and camera poses. The images are processed through a feature encoder
which are then unprojected into the 3D world frame using a differentiable unprojection operation. LSMs can produce two kinds of outputs – voxel
occupancy grids (Voxel LSM) decoded from Go
or per-view depth maps (Depth LSM) decodedafter a projection operation.
Qualitative results for
per-view depth map
prediction on
ShapeNet. We show
the depth maps
predicted by Depth-
LSM (visualized with
shading from a shifted
viewpoint) and the
point cloud obtained
by unprojecting them
into world coordinates.
Upgrade • SfM-NeT • Relu alternatives
use ELU non-linearity
without batchnorm or
ReLU with it.
A summary of recommendations:
Upgrade • SfM-NeT • Normalization techniques
Batch normalization, what was this?
TL;DR To reduce covariate shift (explained by Alex Smola)
[D] Weight normalization vs. layer normalization, has
anyone done benchmarks?
(self.MachineLearning)
submitted 3 months ago by carlthome
Batch normalization is the norm (pun
intended) but for RNNs or small batch sizes
layer normalization and
weight normalization look like attractive
alternatives.
In the NIPS submission for weight normalization,
they have the layer normalization paper listed as
a reference (although never cited in the text), but
it has since been removed. This got me thinking
about pros/cons of the respective methods. Has
anyone done benchmarks comparing weight
normalization to layer normalization (particularly
for ResNets or RNNs)?
PS: Recurrent batch normalization is
memory intensive and should be avoided IMO,
but that too would be interesting to benchmark.
Batch Renormalization: Towards Reducing Minibatch
Dependence in Batch-Normalized Models
Sergey Ioffe, Google Inc., sioffe@google.com
(Submitted on 10 Feb 2017 (v1), last revised 30 Mar 2017 (this version, v2))
https://arxiv.org/abs/1702.03275 | https://github.com/titu1994/BatchRenormalization
Batch Normalization—What the hey?
By Karl N. Jun 8, 2016
Batch ReNorm is useful
especially with smaller batch
sizes. Validation accuracy for
models trained with either
batchnorm or Batch Renorm, where
normalization is performed for sets
of 4 examples (but with the
gradients aggregated over all 50×32
examples processed by the 50
workers). Batch Renorm allows the
model to train faster and achieve a
higher accuracy, although
normalizing sets of 32 examples
performs better.
“Batch normalization is applied to all
convolutional layer outputs.”
Upgrade • SfM-NeT • pooling vs. strides vs. dilation
Christian Perone, R&D Machine Learning Engineer and Software Engineer at HP
https://www.slideshare.net/perone/deep-learning-convolutional-neural-networks
Typical architectures use pooling after each convolution
layer.
In practice on left, 16 convolutions need to be computed
with pooling and then max need to be computed for the
2x2 window. Whereas with 2x2 stride without pooling, one
would compute only 4 convolutions giving the same
downsampling.
Springenberg et al. (2015) suggest that the performance
decrease from this simplification may be negligible in
practice: “We find that max-pooling can simply be replaced
by a convolutional layer with increased stride without loss in
accuracy on several image recognition benchmarks”
Dilated convolution ( algorithm à trous,
Statonary wavelet transform in
practice) do not reduce
dimensionality but rather aggregate
multi-scale contextual information
See for example Yu et al. (2015)
IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: PP, Issue: 99
)
Generalizing Pooling Functions in CNNs:
Mixed, Gated, and Tree
Chen-Yu Lee ; Patrick Gallagher ; Zhuowen Tu
https://doi.org/10.1109/TPAMI.2017.2703082
Each convolutional layer consist of a series of 3 × 3
convolutional layers alternating between stride 1
and stride 2.
Upgrade • SfM-NeT • Enforcing sharp boundaries #1
Downsampling-Upsampling combo can smoothen or remove some thin
structures and sharp boundaries, and several papers have been written to
address this.
“Unlike skip connections and previous encoder-decoder methods, we first
learn a coarse feature map after the encoder stage in a feedforward pass, and
then refine this feature map in a top-down strategy during the decoder stage
utilizing features at successively lower layers. Therefore, the deconvolutional
process is conducted stepwise, which is guided by Deeply-Supervision Net
providing the integrated direct supervision.”
https://arxiv.org/abs/1705.04456
Upgrade • SfM-NeT • Enforcing sharp boundaries #2
Segmentation-Aware Convolutional
Networks Using Local Attention Masks
Adam W. Harley, Konstantinos G. Derpanis, Iasonas Kokkinos
(Submitted on 15 Aug 2017)
https://arxiv.org/abs/1708.04607 | http://cs.cmu.edu/~aharley/segaware
Segmentation-aware convolution
filters are invariant to backgrounds.
We achieve this in three steps: (i)
compute segmentation cues for each
pixel (i.e., “embeddings”), (ii) create a
foreground mask for each patch, and
(iii) combine the masks with
convolution, so that the filters only
process the local foreground in each
image patch.
Segmentation-aware bilateral filtering. Given an input image (left), a CNN
typically produces a smooth prediction map (middle top). Using learned per-pixel
embeddings (middle bottom), we adaptively smooth the FC8 feature map with our
segmentation-aware bilateral filter (right).
General schematic for our segmentation-aware CNN. The first part is an
embedding network, which is guided to compute embedding-like representations
at multiple scales, and constructs a final embedding as a weighted sum of the
intermediate embeddings. The loss on these layers operates on pairwise distances
computed from the embeddings. These same distances are then used to construct
local attention masks, that intercept the convolutions in a task-specific network.
The final objective backpropagates through both networks, fine-tuning the
embeddings for the task.
Visualizations of optical flow produced by
FlowNet and its segmentation-aware variant on
the FlyingChairs test set: segmentation-
awareness yields much sharper results than
the baseline.
SfM-Net • Supervision
SfM-NeT • Supervision
SfM-Net inverts the image
formation and extracts depth,
camera and object motions that gave
rise to the observed temporal
differences, similar to previous SfM
works [1, 6].
Such inverse problems are ill-posed
as many solutions of depth, camera
and object motion can give rise to the
same observed frame-to-frame pixel
values.
A learning-based solution, as
opposed to direct optimization, has
the advantage of learning to handle
such ambiguities through partial
supervision of their weights or
appropriate pre-training, or simply
because the same coefficients
(network weights) need to explain a
large abundance of video data
consistently.
We detail the various supervision
modes below and explore a subset of
them in the experimental section.
Kyong Hwan Jin ; Michael T. McCann ; Emmanuel Froustey ; Michael Unser | https://doi.org/10.1109/TIP.2017.2713099
IEEE Transactions on Image Processing ( Volume: 26, Issue: 9, Sept. 2017 )
ME5286 – Lecture 2 (Theory):
Image Formation and Cameras
by Saad J Bedros, University of Minnesota,
http://www.me.umn.edu/courses/me5286/
Adversarial Inversion: Inverse Graphics with Adversarial Priors
Hsiao-Yu Fish Tung, Adam Harley, William Seto, Katerina Fragkiadaki (Submitted on 31 May 2017)
https://arxiv.org/abs/1705.11166
Multi-view Supervision for Single-view Reconstruction via
Differentiable Ray Consistency
Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, Jitendra Malik
https://arxiv.org/abs/1704.06254
Toward Geometric Deep SLAM
MagicPoint and MagicWarp (from Magic Leap)
Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich (Submitted on 24 Jul 2017)
https://arxiv.org/abs/1707.07410
Geometric deep learning: going beyond Euclidean data
Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, Pierre Vandergheynst last revised 3 May 2017
https://arxiv.org/abs/1611.08097
SfM-NeT • Supervision • Self-supervision
Given unconstrained video, without
accompanying ground-truth
structure or motion information, our
model is trained to minimize the
photometric error between the
first frame and the second frame
warped towards the first according
to the predicted motion field, based
on well-known brightness
constancy assumptions (assuming
Lambertian surfaces). We use
differentiable image warping
proposed in the spatial transformer
work (Jaderberg et al., 2015) and
compute color constancy loss in a
fully differentiable manner.
https://arxiv.org/abs/1608.05842
“In particular, we use a loss function that combines a data
term that measures photometric constancy over time with a
spatial term that models the expected variation of flow across
the image. The photometric loss measures the difference
between the first input image and the (inverse) warped
subsequent image based on the predicted optical flow by the
network. The smoothness loss measures the difference
between spatially neighbouring flow predictions. Together,
these two losses form a proxy for losses based on the
groundtruth flow.”
https://www.slideshare.net/yuhuang/optic-flow-estimatio
n-with-deep-learning
https://arxiv.org/abs/1506.02025
Light diffuseness metric Part 1: Theory
L Xia, MSc, SC Pont, PhD, I Heynderickx, PhD
Lighting Research & Technology Vol 49, Issue 4, 2017
http://doi.org/10.1177/1477153516631391
Thomas Y. Lee; David H. Brainard
Journal of Vision January 2014, Vol.14, 24. doi: 10.1167/14.1.24
“Human perception not necessarily the same as
a camera system”
SfM-NeT • Supervision • Spatial smoothness priors
When our network is self-supervised, we add
robust spatial smoothness penalties on the
optical flow field, the depth, and the inferred
motion maps, by penalizing the L1 norm of the
gradients across adjacent pixels, as usually
done in previous works [Kong and Black (2015)].
For depth prediction, we penalize the norm of
second order gradients in order to encourage
not constant but rather smoothly changing
depth values.
http://www.chioka.in/differences-between-l
1-and-l2-as-loss-function-and-regularizati
on/
By Michael Zibulevsky andMichael Elad
IEEE SIGNAL PROCESSING MAGAZINE [76] MAY 2010
DOI: 10.1109/MSP.2010.936023
Depth map inpainting under a second-order smoothness prior
Daniel Herrera C.†, Juho Kannala† , Lubor Ladický‡ , and Janne Heikkilä†
†Center for Machine Vision Research University of Oulu, Finland
‡Visual Geometry Group University of Oxford, UK
Levin (2004)'s approach uses a first-order prior, i.e.
it favors constant depth. Whereas our second-
order prior favors constant depth derivative.
This is clearly seen in the results of Figure 1. Levin's
approach correctly separates the surfaces but fiills
the missing pixels (across the surface boundary)
with a constant depth, while our method provides a
smooth result that matches the ground truth
shape.
Difference between 1st
order and 2nd
order gradients for depth image
(zoomed portion below) as computed with Matlab’s imgradient
https://doi.org/10.1007/978-3-642-
38886-6_52
Cited by 9 articles
SfM-NeT • Supervision • Forward-backward consistency constraints
We incorporate forward-backward
consistency constraints between
inferred scene depth in different
frames. Composing scene flow
forward and backward across
consecutive frames allows us to
impose such forward-backward
consistency cycles across more than
one frame gaps, however, we have
not yet seen empirical gain from
doing so.
In other words one could
“robustify” the network by having
more temporal samples which
should improve inlier / outlier
separation
Science of Electrical Engineering (ICSEE), IEEE International Conference on the
A Depth Restoration Occlusionless Temporal Dataset
Daniel Rotman ; Guy Gilboa Electrical Engineering Department, Technion - Israel Institute of Technology.
https://doi.org/10.1109/3DV.2016.26
“Utilizing multiple frames, we create a number of possibilities for an initial degraded depth map,
which allows us to arrive at a more educated decision when refining depth images. Evaluating this
method with our dataset shows significant benefits, particularly for overcoming real sensor-noise
artifacts.”
The dataset is freely downloadable at: http://visl.technion.ac.il/databases/drot2016/
3D Vision (3DV), 2016 Fourth International Conference on, 16-18 Nov. 2016
Frame rate reduction of depth cameras by RGB-based depth prediction
Daniel Rotman ; Omer Cohen ; Guy Gilboa Electrical Engineering Department, Technion - Israel Institute of Technology.
https://doi.org/10.1109/ICSEE.2016.7806153
“Depth cameras are becoming widely used for facilitating fast and robust natural user interaction. But
measuring depth can be high in power consumption mainly due to the active infrared illumination
involved in the acquisition process, for both structured-light and time-of-flight technologies. It
becomes a critical issue when the sensors are mounted on hand-held (mobile) devices, where
power usage is of the essence. A method is proposed to reduce the depth acquisition frame rate,
possibly by factors of 2 or 3, thus saving considerable power.
The compensation is done by calculating reliable depth estimations using a coupled color (RGB)
camera working at full frame rate. These predictions, which are shown to perform outstandingly,
create for the end user or application the perception of a depth sensor working at full frame rate.
Quality measures based on skeleton extraction and depth inaccuracy are used to calculate the
deviation from the ground truth.”
SfM-NeT • Supervision • Supervising depth
If depth is available on parts of the
input image, such as with video
sequences captured by a Kinect
sensor, we can use depth supervision
in the form of robust depth regression.
No in theory we can give targets
automatically for SfM pipeline
designed to operate:
1) without depth sensor, such as
traditional smartphone
- Target with Kinect or high-
quality laser scanner
2) Google Tango smartphone
with “low-quality depth sensing”
- Target with high-quality
laser scanner
No need for massive Mechanic
Turker workforce for boring time-
consuming labeling
3D with Kinect
Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on
Jan Smisek ; Michal Jancosek ; Tomas Pajdla
Date of Conference: 6-13 Nov. 2011
https://doi.org/10.1109/ICCVW.2011.6130380
“We demonstrate the functionality of
Kinect calibration by integrating it
into an SfM pipeline where 3D
measurements from a moving Kinect
are transformed into a common
coordinate system by computing
relative poses from matches in color
camera.”
SfM
performs
better
when one
has both
RGB and
depth
data
available
SfM-NeT • Supervision • Supervising camera motion
Supervising camera
motion. If ground-truth
camera pose trajectories
are available, we can
supervise our model by
computing corresponding
ground-truth camera
rotation and translation
from frame to frame, and
constrain our camera
motion predictions
accordingly.
IEEE Transactions on Image Processing
( Volume: 23, Issue: 12, Dec. 2014 )
Online Camera-Gyroscope Autocalibration
for Cell Phones
Chao Jia ; Brian L. Evans https://doi.org/10.1109/TIP.2014.2360120
Our contributions are: simultaneous online
camera self-calibration and camera-
gyroscope calibration based on an implicit
extended Kalman filter and generalization of
the multiple-view coplanarity constraint on
camera rotation in a rolling shutter camera
model for cell phones.
Now on Google
Tango platform you
could use:
1) RGB Video
2) Depth Video
And
3) Gyroscope data
Sensor-based camera motion detection for unconstrained slam
Original Assignee: Qualcomm Incorporated
Publication date: Jul 12, 2016
US 9390344 B2 Techniques are presented for monocular visual
simultaneous localization and mapping (SLAM) based on detecting a
translational motion in the movement of the camera using at least one
motion sensor, while the camera is performing panoramic SLAM, and
initializing a three dimensional map for tracking of finite features. Motion
sensors may include one or more sensors, including inertial (gyroscope,
accelerometer), magnetic (compass), vision (camera) or any other sensors
built into mobile devices.
Virtual Reality, 2001. Proceedings. IEEE
Fusion of vision and gyro tracking for robust
augmented reality registration
S. You ; U. Neumann https://doi.org/10.1109/VR.2001.913772
The framework includes a two-channel
complementary motion filter that combines
the low-frequency stability of vision sensors
with the high-frequency tracking of gyroscope
sensors, hence achieving stable static and
dynamic six-degree-of-freedom pose tracking.
Our implementation uses an extended Kalman
filter (EKF).
Poling and Lerman (2016): We present a deeply integrated method of
exploiting low-cost gyroscopes to improve general purpose feature
tracking. Most previous methods use gyroscopes to initialize and bound
the search for features. In contrast, we use them to regularize the tracking
energy function so that they can directly assist in the tracking of
ambiguous and poor-quality features.
SfM-NeT • Supervision • Supervising optical flow and object motion
Supervising optical flow and
object motion. Ground-truth
optical flow, object masks, or
object motions require
expensive human annotation
on real videos. However,
these signals are available in
recent synthetic datasets
[20]. In such cases, our
model could be trained to
minimize, for example, an L1
regression loss between
predicted and ground-truth
low vectors
https://arxiv.org/abs/1707.06316
In this paper, we propose to use DenseNet for optical flow prediction. Our contributions are
two-fold. First, we extend current DenseNet to a fully convolutional network. Our model is
totally unsupervised, and achieves performance close to supervised approaches. Second,
we empirically show that replacing convolutions with dense blocks in the expanding part
yields better performance
https://arxiv.org/abs/1603.03911
(a) Semantic segmentation breaks the image into regions such as road, bike, person, sky, etc. (c) Existing optical flow
algorithms do not have access to either the segmentations or the semantics of the classes. (d) Our semantic optical
flow algorithm computes motion differently in different regions, depending on the semantic class label, resulting in
more precise flow, particularly at object boundaries. (b) The flow also helps refine the segmentation of the
foreground objects.
https://arxiv.org/abs/1603.03911
Upgrade • Supervision • Loss Function #1
http://doi.ieeecomputersociety.org/10.1109/TPAMI.2007.1171
Nearly all existing methods for stereo reconstruction assume that scene reflectance is
Lambertian{*}
and make use of brightness constancy (BC) as a matching invariant. We
introduce a new invariant for stereo reconstruction called light transport constancy
(LTC), which allows completely arbitrary scene reflectance (bidirectional reflectance
distribution functions (BRDFs)). This invariant can be used to formulate a rank constraint
on multiview stereo matching when the scene is observed by several lighting
configurations in which only the lighting intensity varies.
{*}
Lambertian reflectance in practice means
that the surface would look as bright
independent from where you look at it. This is
not true for specular and mirror reflection.
“CS 354 Lighting” by Mark Kilgard
Graphics Software Engineer at NVIDIA
Direct Visual Odometry using Bit-Planes
Hatem Alismail, Brett Browning, and Simon Lucey, The Robotics Institute, Carnegie Mellon University
https://arxiv.org/abs/1604.00990 (2016)
At the core of direct Visual SLAM is the reliance on a consistent photometric
appearance across images, otherwise known as the brightness constancy assumption.
Unfortunately, brightness constancy seldom holds in real world applications
In this work, we overcome brightness constancy by incorporating feature descriptors
into a direct visual odometry framework. This combination results in an efficient
algorithm that combines the strength of both feature-based algorithms and direct
methods. Namely, we achieve robustness to arbitrary photometric variations while
operating in low-textured and poorly lit environments.
An illustration of our Bit-Planes
descriptor where each channel is
composed of bits. Since the residual
vector is binary, least squares
minimization becomes equivalent to
minimizing the Hamming distance.
Principles of Remote Sensing; Soudarissanane (2016)]
Upgrade • Supervision • Loss Function #2
Geometric Loss Functions for
Camera Pose Regression with
Deep Learning
Alex Kendall, Roberto Cipolla
(Submitted on 2 Apr 2017 (v1), last revised 23 May 2017 (this version, v2))
https://arxiv.org/abs/1704.00390
We show that our geometric approach can improve PoseNet’s efficacy across many different datasets – narrowing
the deficit to traditional SIFT feature-based algorithms. For outdoor scenes ranging from 50, 000m2
to 2km2
we
can achieve relocalisation accuracies of a few meters and a few degrees. In small rooms we are able to achieve
accuracies of 0.2 − 0.4m.
Comparison of different loss functions. We use an L1 distance for the
residuals in each loss. Linear sum combines position and orientation losses
with a constant scaling parameter β (Kendall and Cipolla 2015) and is defined
in (2). Learn weighting is the loss function in (3) which learns to combine
position and orientation using homoscedastic uncertainty. Reprojection error
implicitly combines rotation and translation by using the reprojection error of
the scene geometry as the loss (7). We find that homoscedastic uncertainty is
able to learn an effective weighting between position and orientation
quantities. The reprojection loss was not able to converge from random
initialisation. However, when used to fine-tune a network pretrained with (3) it
yields the best results.
Upgrade • Supervision • Semi-supervised targets
Recurrent Ladder Networks
Alexander Ilin, Isabeau Prémont-Schwarz, Tele Hotloo Hao,
Antti Rasmus, Rinu Boney, Harri Valpola
(Submitted on 28 Jul 2017)
https://arxiv.org/abs/1707.09219
“We propose a recurrent
extension of the Ladder network,
which is motivated by the
inference required in hierarchical
latent variable models. We
demonstrate that the recurrent
Ladder is able to handle a wide
variety of complex learning tasks
that benefit from iterative
inference and temporal
modeling. The architecture
shows close-to-optimal results
on temporal modeling of video
data, competitive results on
music modeling, and improved
perceptual grouping based on
higher order abstractions, such
as stochastic textures and
motion cues. We present results
for fully supervised, semi-
supervised, and unsupervised
tasks. The results suggest that
the proposed architecture and
principles are powerful tools for
learning a hierarchy of
abstractions, handling temporal
information, modeling relations
and interactions between
objects.”
(a): Simple static hierarchical latent variable model. (b): Directions of message propagation. (c):
Computational graph implementing message propagation in (b). (d): The structure of the
Ladder network can be seen as a computational graph implementing message propagation in
(c). The red circles mark the operations corresponding to the nodes of the graph in (b). (d): The
structure of the recurrent Ladder (RLadder) network.
https://arxiv.org/abs/1703.01780
https://github.com/CuriousAI/mean-teacher
A sketch of a binary classification task with two labeled examples (large blue dots) and one unlabeled
example, demonstrating how the choice of unlabeled target (black circle) affects the fitted function
(gray curve).
(a) A model with no regularization is free to fit any function. (b) A model trained with noisy labeled data
(small dots) learns to give consistent predictions around labeled data points. (c) Consistency to noise
around unlabeled examples provides additional smoothing. For the clarity of illustration, the teacher
model (blue curve) is first fitted to the labeled examples, and then left unchanged during the training of
the student model. Also for clarity, we will omit the small dots in figures d and e. (d) Noise on the teacher
model reduces the bias of the targets without additional training. The expected direction of stochastic
gradient descent is towards the mean (large blue circle) of individual noisy targets (small blue circles).
(e) An ensemble of models gives an even better expected target. Both Temporal Ensembling
and the Mean Teacher method use this approach
Upgrade • Supervision • “proxy” supervised targets
https://arxiv.org/abs/1702.02295 (Submitted on 8 Feb 2017 (v1), last revised 1 Jul 2017 (this version, v2))
We study the unsupervised learning of CNNs for optical flow estimation using proxy
ground truth data. Supervised CNNs, due to their immense learning capacity, have
shown superior performance on a range of computer vision problems including optical flow
prediction. They however require the ground truth flow which is usually not accessible
except on limited synthetic data. Without the guidance of ground truth optical flow,
unsupervised CNNs often perform worse as they are naturally ill-conditioned.
We therefore propose a novel framework in which proxy ground truth data generated
from classical approaches is used to guide the CNN learning. The models are further
refined in an unsupervised fashion using an image reconstruction loss. Our guided
learning approach is competitive with or superior to state-of-the-art approaches on three
standard benchmark datasets yet is completely unsupervised and can run in real time.
“More broadly, we introduce a paradigm which can be integrated into
future state-of-the-art motion estimation networks [Ranjan and Black (2016)]
to improve performance. In future work, we plan to experiment with
large-scale video corpora to learn non-rigid real world motion patterns
rather than just learning limited motions found in synthetic datasets.”
Upgrade • Supervision • Self-supervision
https://arxiv.org/abs/1612.03777
We improve CNN-based optical flow estimation in real videos by adding the
extra self-supervised task of future frame prediction, and training the
network with a mixture of synthetic and real-world videos. This
combination is made possible by putting a “multiplexer” at the entry of the
network which mixes data from the two sources on a timely basis.
Upgrade • Supervision • Dense object segmentation
Semantic Video Segmentation by
Gated Recurrent Flow Propagation
David Nilsson, Cristian Sminchisescu
(Submitted on 28 Dec 2016)
https://arxiv.org/abs/1612.08871
Deep Semantic Segmentation for Automated
Driving: Taxonomy, Roadmap and Challenges
Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, Senthil Yogamani
Submitted on 8 Jul 2017 (v1), last revised 3 Aug 2017
https://arxiv.org/abs/1707.02432
Overview of our Spatial
Transformer Gated
Recurrent Unit (STGRU),
combining a Spatial
Transformer Network for
optical flow warping with
a Gated Recurrent Unit to
adaptively propagate and
fuse semantic
segmentation information
over time.
Modular End to End learning: We
use the term modular end to end
learning when there are auxiliary
losses to ensure safety and interpret
ability. For instance, segmentation
loss can be added as an auxiliary loss
for an end to end driving CNN [
Xu et al. (2016)]. Using this auxiliary
loss, the CNN loosely learns to
semantically segment, but it is also
learns to have a better representation
for the intermediate features. It was
shown in that work that using auxiliary
loss outperforms the vanilla end to
end learning. The work also uses
recurrent gated unit after the CNN to
model temporal information.
Upgrade • Supervision • generative motion and content
https://arxiv.org/abs/1707.04993
https://github.com/sergeytulyakov/mocogan
https://arxiv.org/abs/1706.08033
https://sites.google.com/a/umich.edu/rubenevillegas/iclr2017
We propose a deep neural network for the prediction of future frames in natural video
sequences. To effectively handle complex evolution of pixels in videos, we propose to
decompose the motion and content, two key components generating dynamics in videos. Our
model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional
LSTM for pixel-level prediction, which independently capture the spatial layout of an image
and the corresponding temporal dynamics.
By independently modeling motion and content, predicting the next frame reduces to
converting the extracted content features into the next frame content by the identified motion
features, which simplifies the task of prediction. Our model is end-to-end trainable over
multiple time steps, and naturally learns to decompose motion and content without separate
training.
Upgrade • Supervision • data Augmentation
Depth degradation techniques [taken from Yang et al. (2012)].
(a) under-sampling, (b) undersampling with signal-dependant
noise, (c) random missing, and (d) structural missing.
[Rotman and Gilboa (2016)]
From left to right, Kinect 1, 2 and R200 RealSense. The Kinect 1
sensor features invalid (black) depth values, and crooked edges.
The Kinect 2 has false intermediate depth values (on the right side
of the object). The RS shows depth artifacts with erroneous values
(outlier white pixel on left bottom). [Rotman and Gilboa (2016)]
Alismail et al. (2016)
An Image Degradation Model for Depth-
augmented Image Editing
(2015) James W. Hennessey, Niloy J. Mitra,
http://dx.doi.org/10.1111/cgf.12707
Upgrade • Supervision
• (multimodal) decomposition
Intrinsic Depth.
(a) Input video.
(b),(c) Albedo and
shading estimated by the intrinsic
video method.
(d) Surface
contours modified to combine RGB,
albedo and shading information.
(e) Proxy depth by propagating
sparse SfM depth using video
segments from [9].
(f) Depth estimated by our
method, which combines the previous
two methods.
(g) Depth from the original Depth
Transfer method.
(h) Depth from the fully-metric
method.
(i) Depth from the example-based
single image method.
(j) Ground truth depth. Note that
integrating information from different
intrinsic images improves the
estimation of the depth structure.
In (e) and (j), black pixels
indicate that no valid depth values are
provided.
Kong and Black (2015)
Decomposing Single Images for Layered Photo Retouching
Carlo Innamorati, Tobias Ritschel. Tim Weyrich. Niloy J. Mitra
University College London
http://dx.doi.org/10.1111/cgf.13220
http://geometry.cs.ucl.ac.uk/projects/2017/layered-retouching/
Outline of proposed technique. (a) The Kinect depth of an object is combined with (b) three photos at different rotations of a polarizing
filter. (c) Integration of surface normals obtained from Fresnel equations. Note the azimuthal ambiguity (observed as a flip in the shape)
and distortion of the zenith angle (observed as flatness in the shape). (d) Integration of surface normals after correcting for azimuthal
ambiguity removes the flip, and the final result is shown in (e) after correcting for zenith distortion and using physics-based integration.
Kadambi et al. (2015) - http://web.media.mit.edu/~achoo/polar3D/ - http://news.mit.edu/2015/algorithms-boost-3-d-imaging-resolution-1000-times-1201
Polarization-sensing to improve depth
Upgrade • Supervision • Multimodal Sensing • Rolling shutter motion
IEEE Transactions on Visualization and Computer Graphics ( Volume: 22, Issue: 11, Nov. 2016 )
Towards Kilo-Hertz 6-DoF Visual Tracking Using an Egocentric
Cluster of Rolling Shutter Cameras
Akash Bapat ; Enrique Dunn ; Jan-Michael Frahm
https://doi.org/10.1109/TVCG.2016.2593757
“The key idea is that a rolling shutter camera works by capturing the rows of
an image in rapid succession, essentially acting as a high-frequency 1D
image sensor. By integrating multiple rolling shutter cameras on the AR
device, our tracker is able to perform 6-DOF markerless tracking in a static
indoor environment with minimal latency.”
https://arxiv.org/abs/1704.07163
“Rolling Shutter (RS) cameras have become popularized because of low-cost imaging
capability. However, the RS cameras suffer from undesirable artifacts when the camera or
the subject is moving, or illumination condition changes. For that reason, Monocular Visual
Odometry (MVO) with RS cameras produces inaccurate ego-motion estimates. Previous
works solve this RS distortion problem with motion prediction from images and/or inertial
sensors. However, the MVO still has trouble in handling the RS distortion when the camera
motion changes abruptly (e.g. vibration of mobile cameras causes extremely fast motion
instantaneously).”
Coded Rolling Shutter Photography:
Flexible Space-Time Sampling
http://www.cs.columbia.edu/CAVE/projects/crsp/
The interlaced readout
can be used to compute
optical flow between the
two sub-images after
vertical interpolation. The
optical flow can be used
for motion interpolation,
skew compensation, and
motion deblur. Please refer
to the paper for details.
Upgrade • Supervision • Transfer learning #1
Application of transfer learning in
RGB-D object recognition
Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on
Abhishek Kumar ; S. Nithin Shrivatsav ; G. R. K. S. Subrahmanyam ; Deepak Mishra
https://doi.org/10.1109/ICACCI.2016.7732108
“Firstly we trained a CNN network with 10 classes of different objects and then we
transfer the parameters to RGB and depth CNN network. This enables the network to
train faster and also achieve higher accuracy for a given number of epochs.”
Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better
than Transferring from RGB-CNNs
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)
Xinhang Song, Luis Herranz, Shuqiang Jiang
https://github.com/songxinhang/D-CNN
https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14695/14310
HHA encoding for depth data (Gupta et al. 2014), is a three
channel representation (horizontal disparity, height above
ground, and angle with the direction of gravity) of depth data.
Transferring deep representations
within the same modality (e.g. Places-
CNN fine tuned on SUN397) works
well, since low-level patterns have
similar distributions, and bottom layers
can be reused while adjusting the more
dataset-specific top layers. However,
fine tuning is not that effective in inter-
modal transfer, such as Places-CNN
to depth in the HHA space, where low-
level features require modality-
specific filters. In this paper, we focus
on the bottom layers, because they are
more critical to represent depth data
properly. By reducing the number of
parameters of the network, and using
weakly supervised learning over
patches, the complexity of the model
matches better the amount of data
available. This depth representation is
not only more discriminative than those
fine tuned from Places-CNN but also
when combined with RGB features the
gain is higher, showing that both are
complementary. Notice also, that we do
not depend (for depth) on large
datasets such as Places.
https://doi.org/10.1016/j.patcog.2017.07.026
“The RGB-specific detection
network is initialized with
ImageNet [Deng et al. (2009)]
RGB
classification model. 3 To
better leverage the depth
information, the modality-
correlated and depth-specific
network are initialized from a
supervision transfer model [
Gupta et al. (2016)]
”
Upgrade • Supervision • Transfer learning #2
Learning Transferrable Knowledge for Semantic
Segmentation With Deep Convolutional Neural Network
Seunghoon Hong, Junhyuk Oh, Honglak Lee, Bohyung Han;
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3204-3212
https://doi.org/10.1109/CVPR.2016.349
Overall architecture of the proposed algorithm. Given a
feature extracted from the encoder, the attention model
estimates adaptive spatial saliency of each category associated
with input image. The outputs of attention model are
subsequently fed into the decoder, which generates foreground
segmentation mask of each focused region. During training, we
fix the encoder by pre-trained weights, and leverage the
segmentation annotations from source domain to train both the
decoder and the attention model, and image-level class labels in
both domains to train the attention model under classification
objective. After training, semantic segmentation on the target
domain is performed naturally by exploiting the decoder trained
with source images and the attention model adapted to target
domain
The contributions of this paper are summarized below.
● We propose a new paradigm for weakly-supervised semantic segmentation, which exploits segmentation annotations
from different categories to guide segmentations with weak annotations. To our knowledge, this is the first attempt to tackle the
weakly-supervised semantic segmentation problem by transfer learning.
● We propose a novel encoder-decoder architecture with attention model, which is appropriate to transfer the segmentation
knowledge across categories.
● The proposed algorithm achieves substantial performance improvement over existing weakly-supervised approaches by
exploiting segmentation annotations in exclusive categories.
Upgrade • Supervision • Transfer learning #3
Borrowing Treasures from the Wealthy:
Deep Transfer Learning through Selective
Joint Fine-tuning
Weifeng Ge, Yizhou Yu (Submitted on 28 Feb 2017 (v1),
last revised 6 Jun 2017 (this version, v2))
https://arxiv.org/abs/1702.08690
https://github.com/ZYYSzj/Selective-Joint-Fine-tuning
In this paper, we introduce a source-target selective joint
fine-tuning scheme for improving the performance of
deep learning tasks with insufficient training data. In this
scheme, a target learning task with insufficient training
data is carried out simultaneously with another source
learning task with abundant training data. However, the
source learning task does not use all existing training data.
Our core idea is to identify and use a subset of training
images from the original source learning task whose low-
level characteristics are similar to those from the target
learning task, and jointly fine-tune shared convolutional
layers for both tasks.
Pipeline of the proposed selective joint fine-tuning. From left to right: (a) Datasets in the source domain and the target domain. (b) Select nearest
neighbors of each target domain training sample in the source domain via a low-level feature space. (c) Deep convolutional neural network initialized
with weights pre-trained on ImageNet or Places. (d) Jointly optimize the source and target cost functions in their own label spaces.
Similar Image Search
There is a unique step in our pipeline. For each image
from the target domain, we search a certain number
of images with similar low-level characteristics
from the source domain. Only images returned from
these searches are used as training images for the
source learning task in selective joint fine-tuning. We
elaborate this image search step below.
In summary, this paper has the following contributions:
● We introduce a new deep transfer learning scheme, called selective joint fine-tuning, for improving the
performance of deep learning tasks with insufficient training data. It is an important step forward in the
context of the widely adopted strategy of fine-tuning a pre-trained deep neural network.
● We develop a novel pipeline for implementing this deep transfer learning scheme. Specifically, we
compute descriptors from linear or nonlinear filter bank responses on training images from both
tasks, and use such descriptors to search for a desired subset of training samples for the source
learning task.
● Experiments demonstrate that our deep transfer learning scheme achieves state-of-the-art
performance on multiple visual classification tasks with insufficient training data for deep learning.
SfM-Net • Implementation
SfM-NeT • implementation Details
coarse map
2 x fully connected layers
Our depth-predicting structure and object-mask-
predicting motion conv/deconv networks share similar
architectures but use independent weights. Each consist of
a series of 3×3 convolutional layers alternating between
stride 1 and stride 2 followed by deconvolutional operations
consisting of a depth-to-space upsampling, concatentation
with corresponding feature maps from the convolutional
portion, and a 3×3 convolutional layer. Batch normalization
is applied to all convolutional layer outputs.
The structure network takes a single frame as input, while
the motion network takes a pair of frames. We predict depth
values using a 1×1 convolutional layer on top of the image-
sized feature map. We use RELU activations because
depths are positive and a bias of 1 to prevent small depth
values. The maximum predicted depth value is further
clipped at 100 to prevent large gradients.
We predict object masks from the image-sized feature
map of the motion network using a 1×1 convolutional layer
with sigmoid activations. To encourage sharp masks we
multiply the logits of the masks by a parameter that is a
function of the number of step for which the network has
been trained. The pivot variables are predicted as heat
maps using a softmax function over all the locations in the
image followed by a weighted average of the pixel locations.
keras.layers.convolutional.Conv2D
(filters, kernel_size, strides)
filters = 32
strides = (1,1)
kernel_size = (3,3)
filters = 64
strides = (2,2)
kernel_size = (3,3) keras.layers.convolutional.UpSamp
ling2D(size=(2, 2))
SfM-Net • Results
SfM-NeT • Experimental Results #1
Qualitative comparison of the estimated
depth using our unsupervised model
on sequences versus using stereo pairs
in the KITTI 2012 benchmark. When
using stereo pairs the camera pose
between the pair is constant and hence
the model is equivalent to the approach
of Garg et al.’s (2016).
For sequences, our model needs to
additionally predict camera rotation and
translation between the two frames. The
first six rows show successful
predictions even without camera pose
information and the last two illustrate
failure cases.
The failure cases show that when there
is no translation between the two frames
depth estimation fails whereas when
using stereo pairs there is always a
constant offset between the frames.
SfM-NeT • Experimental Results #2
Ground truth
segmentation and
flow compared to
predicted motion
masks and flow
from SfM-Net in
KITTI 2015.
The model was
trained in a fully
unsupervised
manner. The top
six rows show
successful
prediction and the
last two show
typical failure
cases.
SfM-NeT • Experimental Results #3
Motion segments computed from SfM-Net in MoSeg [Brox and Malik (2010]
.
The model was trained in a fully unsupervised manner.
“We report camera rotation and translation error in Table 2 for
each of the Freiburg sequences compared to the error in the
benchmark’s baseline trajectories. Our model was trained from
scratch for each sequence and used the focal length value
provided with the dataset. We observe that our results better
estimate the frame-to-frame translation and are comparable for
rotation.”
SfM-Net • Summary
SfM-NeT • Conclusion
Current geometric SLAM methods obtain excellent egomotion and rigid 3D reconstruction
results, but often come at a price of extensive engineering, low tolerance to moving
objects — which are treated as noise during reconstruction — and sensitivity to camera
calibration.
Furthermore, matching and reconstruction are difficult in low textured regions.
Incorporating learning into depth reconstruction, camera motion prediction and object
segmentation, while still preserving the constraints of image formation,is a promising way to
robustify SLAM and visual odometry even further. However, the exact training scenario
required to solve this more difficult inference problem remains an open question.
Exploiting long history and far in time forward-backward constraints with visibility
reasoning is an important future direction. Further, exploiting a small amount of
annotated videos for object segmentation, depth, and camera motion, and combining
those with an abundance of self-supervised videos, could help initialize the network
weights in the right regime and facilitate learning. Many other curriculum learning
regimes, including those that incorporate synthetic datasets, can also be considered
t geom
Future • Architecture
Pipeline • Future paths • “GRID” Architecture • Review
https://arxiv.org/abs/1702.08670
Our technical results are corroborated by an extensive
set of evaluations, presented in this paper as well as
independent empirical observations reported by other
groups. We also perform experiments showing the
practical implications of our framework for choosing the
best fully-connected design for a given problem.
Due to fast pace of deep learning as a field, very hard to find good methodological review. Thus, be cautious when
reading these as e.g. Feb 2017 is already quite old when reading the review in August 2017.
https://arxiv.org/abs/1512.07108
Systematic evaluation of CNN advances on the ImageNet
Dmytro Mishkin, Nikolay Sergievskiy, Jiri Matas 16 May 2017
https://doi.org/10.1016/j.cviu.2017.05.007
https://arxiv.org/abs/1606.02228
https://www.researchgate.net/publication/316970253_Systematic_Evaluation_of_Convo
lution_Neural_Network_Advances_on_the_ImageNet
The commonly used input to CNN is raw RGB pixels and the commonly adopted
recommendation is not to use any pre-processing. There has not been much
research on the optimal colorspace or pre-processing techniques for CNN.
Rachmadi and Purnama (2015) explored different colorspaces for vehicle color
identification, Dong et al. (2014) compared YCrCb and RGB channels for image
superresolution, Graham (2015) extractedlocal average color from retina images in
winning solution to the Kaggle Diabetic Retinopathy Detection competition.
Petteri: The authors could have tested CIELab as well which might have
been interesting colorspace especially in photo enhancement applications [e.g.
Yan et al. (2016)]
Pipeline • Future paths • ConvNet Architecture • DenseNet #1
To ensure maximum information flow between layers in the network, we connect all layers (with matching feature-map
sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all
preceding layers and passes on its own feature-maps to all subsequent layers.
Implicit Deep Supervision. One explanation for the improved accuracy of dense convolutional networks may be that
individual layers receive additional supervision from the loss function through the shorter connections. One can
interpret DenseNets to perform a kind of “deep supervision”. The benefits of deep supervision have previously been
shown in deeply-supervised nets (DSN; Lee et al. 2014), which have classifiers attached to every hidden layer, enforcing
the intermediate layers to learn discriminative features.
https://arxiv.org/abs/1611.09326
Pipeline • Future paths • ConvNet Architecture • DenseNet #2
https://arxiv.org/abs/1608.06993
Image classification Semantic Segmentation Optical Flow
https://arxiv.org/abs/1611.09326 https://arxiv.org/abs/1707.06316
Pipeline • Future paths • ConvNet Architecture • DenseNet #3
https://arxiv.org/abs/1702.02295 https://arxiv.org/abs/1707.06316
Classical expanding uses series of convolutions, deconvolutions, and skip connections to recover
the spatial resolution in order to get the perpixel prediction results. Due to the good properties of
DenseNet, we propose to replace the convolutions with dense blocks during expanding as
well. However, if we follow the same dense connectivity pattern, the number of feature maps after
each dense block will keep increasing. Considering that the resolution of the feature maps also
increases during expanding, the computational cost will be intractable for current GPUs. Thus, for a
dense block in the expanding part, we do not concatenate the input to its final output. For example,
if the input has k0
channels, the output of an L layer dense block will have Lk feature maps. k is the
growth rate of a DenseNet, defining the number of feature maps each layer produces. Note that
dense blocks in the contracting part will output k0
+ Lk feature maps. For symmetry, we also
introduce four dense blocks in the expanding part, each of which has four layers. The bottom layer
feature maps at the same resolution are concatenated through skip connections. Between the
dense blocks, there are transition up layers composed of two 3×3 deconvolutions with a stride of
2. One is for upsampling the estimated optical flow, and the other is for upsampling the feature
maps.
“Our model is totally unsupervised., thus we can experiment
with large-scale video corpora in future work, to learn non-
rigid real world motion patterns. Through comparison of
popular CNN architectures, we found that it is important to
design novel operators or networks for optical flow
estimation instead of relying on existing architectures
for image classification.”
In this work, we choose FlowFields (Bailer et al. 2015) as our classical optical flow estimator. To
our knowledge, it is one of the most accurate flow estimators among the published work. We
hope that by using FlowFields to generate proxy ground truth, we can learn to estimate
motion between image pairs as effectively as using the true ground truth.
If a classical approach fails to detect certain motion patterns, a network trained on the proxy
ground truth is also likely to miss these patterns. This leads us to ask if there is other
unsupervised guidance that can improve the network training?
The unsupervised approach (Yu et al. 2016) treats optical flow estimation as an image
reconstruction problem based on the intuition that if the estimated flow and the next frame can
be used to reconstruct the current frame then the network has learned useful representations
of the underlying motions.
Note that we could add additional unsupervised guides like a gradient constancy assumption
or an edge-aware weighted smoothness loss (Godard et al 2016) to further fine tune our models.
Pipeline • Future paths • ConvNet Architecture • DenseNet #4
degradation. However, during inference the
lower layers do not know about high layer
features, although they contain contextual high
semantics that benefit low layers to adaptively
extract informative features for later layers.
In this paper, we study the influence of
backward skip connections which are in the
opposite direction to forward shortcuts, i.e.
paths from high layers to low layers. To achieve
this -- which indeed runs counter to the nature
of feed-forward networks -- we propose a new
fully convolutional model that consists of a pair
of networks. A `Slave' network is dedicated to
provide the backward connections from its top
layers to the `Master' network's bottom layers.
‘Forward skip’ is an illustration of one forward skip connection in our
own baseline FCN-8s-F1. ‘Backward Skip’ is our proposed design
where we align two networks, a Master and Slave. The Slave network
provides the backward skip connections from its higher layers to the
Master network lower layers. ‘Block in MSNet’ shows an illustration of
our final model’s skip connections, where additional forward skip
connections have been added to the Master.
A detailed overview of our MSNet-FB1 model. The model consists of two networks; Master
and Slave. Slave is a basic FCN-8s. The Master is a FCN-8s-F1 (as our baseline) and
contains the backward skip connections taken from the Slave network. The forward
propagation starts in the Slave network, and then the generated feature maps are fed back
to lower layers in the Master network serving as backward skip connections within the
Master. Notice that the backward skip connection (blue arrows) are exactly the inverse of
the forward skip connections (purple arrows).
An illustration of several convolutional blocks of our MSNet-FB2
with dense skip connections. Notice that the Master network is
FCN-8s-F2. We set P = 3 and N = 3. Here the dense backward skip
connections are in inverse directions to the dense forward skip
connections
Pipeline • Future paths • ConvNet Architecture • DenseNet #5
The use of backward connection actually is closer to the cliché of deep learning being analogous to human
brain
Deep neural network explains early visual and inferior temporal (IT) representations of object images.
Each representation in model and brain was characterized by the dissimilarity matrix of the response patterns
elicited by a set of real-world photos of objects.
(a) Representations become monotonically more similar to those of human inferior temporal (IT) cortex as we
ascend the layers of the Krizhevsky et al. (2012) neural network. When the final representational stages are
linearly remixed to emphasize the same semantic dimensions as IT using linear category discriminants
(second bar from the right), and when each layer and each discriminant are assigned a weight to model the
prevalence of different computational features in IT (cross-validated to avoid overfitting to the image set;
rightmost bar), the noise ceiling ( gray shaded region) is reached, indicating that the model fully explains the
data. (b) Lower layers of the deep neural network resemble the representations in the foveal confluence of
early visual areas (V1–V3).
http://dx.doi.org/10.1146/annurev-vision-082114-035447
https://arxiv.org/abs/1604.03640
Center for Brains, Minds and Machines, McGovern Institute, MIT
“Feed-forward zombie” The Sciences of Consciousness: Progress and Problems:
Center for Brains, Minds and Machines (CBMM), Christof Koch - Allen Institute for Brain Science,
https://youtu.be/4gT-1S3FO4s?t=1h9m34s “Not pleasing the people worshipping alter of computalism”
Pipeline • Future paths • Uncertainty • with DenseNet
https://arxiv.org/abs/1506.02142 https://arxiv.org/abs/1705.07832
https://arxiv.org/abs/1703.04977
There are two major types of uncertainty one can
model. Aleatoric uncertainty captures noise inherent in
the observations. On the other hand, epistemic
uncertainty accounts for uncertainty in the model --
uncertainty which can be explained away given enough
data. Traditionally it has been difficult to model
epistemic uncertainty in computer vision, but with new
Bayesian deep learning tools this is now possible. We
study the benefits of modeling epistemic vs. aleatoric
uncertainty in Bayesian deep learning models for vision
tasks.
Our model based on DenseNet can process a 640 ×
480 resolution image in 150ms on a NVIDIA Titan X
GPU. The aleatoric uncertainty models add negligible
compute. However, epistemic models require
expensive Monte Carlo dropout sampling. For
models such as ResNet, this is possible to achieve
economically because only the last few layers contain
dropout. Other models, like DenseNet, require the entire
architecture to be sampled. This is difficult to parallelize
due to GPU memory constraints, and often results in a
50× slowdown for 50 Monte Carlo samples.
https://arxiv.org/abs/1703.02910
Pipeline • Future paths • Uncertainty • With model compression
Bayesian Compression
for Deep Learning
Christos Louizos, Karen Ullrich, Max Welling
(Submitted on 24 May 2017 (v1), last revised 10 Aug 2017 (this version, v3))
https://arxiv.org/abs/1705.08665
From a Bayesian perspective network pruning and
reducing bit precision for the weights is aligned
with achieving high accuracy, because Bayesian
methods search for the optimal model structure
(which leads to pruning with sparsity inducing
priors), and reward uncertain posteriors over
parameters through the bits back argument
[Hinton and Van Camp, 1993]
(which leads to removing
insignificant bits). This relation is made explicit in
the MDL principle [Grünwald, 2007]
which is known to be
related to Bayesian inference.
By employing sparsity inducing priors for hidden
units (and not individual weights) we can prune
neurons including all their ingoing and outgoing
weights. This avoids more complicated and
inefficient coding schemes needed for pruning or
vector quantizing individual weights. As a
additional Bayesian bonus we can use the
posterior uncertainties to assess which bits
are significant and remove the ones which
fluctuate too much under posterior sampling.
From this we derive the optimal fixed point
precision per layer, which is still practical on chip.
For the actual compression task
we compare our method to
current work in three different
scenarios: (i) compression
achieved only by pruning, here, for
non-group methods we use the
CSC format to store parameters;
(ii) compression based on the
former but with reduced bit
precision per layer (only for the
weights); and (iii) the maximum
compression rate as proposed by
Han et al. [2016]. We believe these
to be relevant scenarios because
(i) can be applied with already
existing frameworks such as
Tensorflow, (ii) is a practical
scheme given upcoming GPUs and
frameworks will be designed to
work with low and mixed precision
arithmetics [Lin and Talathi, 2016,
Gysel, 2016]
Pipeline • Future paths • Uncertainty • Geometric problems
http://mi.eng.cam.ac.uk/projects/relocalisation/
Future • Geometric Architectures
Pipeline • Future paths • Geometric Deep Learning #1
Bronstein et al. (July 2017): “Geometric deep learning (
http://geometricdeeplearning.com/) is an umbrella term for e merging
techniques attempting to generalize (structured) deep neural models to non-
Euclidean domains, such as graphs and manifolds. The purpose of this article
is to overview different examples of geometric deep-learning problems and
present available solutions, key difficulties, applications, and future research
directions in this nascent field”
SCNN (2013)
GCNN/ChebNet (2016)
GCN (2016)
GNN (2009)
Geodesic CNN (2015)
Anisotropic CNN (2016)
MoNet (2016)
Localized SCNN (2015)
Pipeline • Future paths • Geometric Deep Learning #2
Bronstein et al. (July 2017): “The non-Euclidean nature of data
implies that there are no such familiar properties as global
parameterization, common system of coordinates, vector space
structure, or shift-invariance. Consequently, basic operations like
convolution that are taken for granted in the Euclidean case are even
not well defined on non-Euclidean domains.”
“First attempts to generalize neural networks to graphs we are aware of
are due to Mori et al. (2005) who proposed a scheme combining
recurrent neural networks and random walk models. This approach
went almost unnoticed, re-emerging in a modern form in
Suhkbaatar et al. (2016) and Li et al. (2015) due to the renewed recent
interest in deep learning.”
“In a parallel effort in the computer vision and graphics community,
Masci et al. (2015) showed the first CNN model on meshed surfaces,
resorting to a spatial definition of the convolution operation based on
local intrinsic patches. Among other applications, such models were
shown to achieve state-of-the-art performance in finding
correspondence between deformable 3D shapes. Followup works
proposed different construction of intrinsic patches on point clouds
Boscaini et al. (2016)a,b and general graphs Monti et al. (2016).”
In calculus, the notion of derivative describes
how the value of a function changes with an
infinitesimal change of its argument. One of the
big differences distinguishing classical calculus
from differential geometry is a lack of vector
space structure on the manifold, prohibiting us
from naïvely using expressions like f(x+dx). The
conceptual leap that is required to generalize
such notions to manifolds is the need to work
locally in the tangent space.
Physically, a tangent vector field can be
thought of as a flow of material on a manifold.
The divergence measures the net flow of a field
at a point, allowing to distinguish between field
‘sources’ and ‘sinks’. Finally, the Laplacian (or
Laplace-Beltrami operator in differential
geometric jargon)
“A centerpiece of classical Euclidean signal processing is the property of the Fourier
transform diagonalizing the convolution operator, colloquially referred to as the
Convolution Theorem. This property allows to express the convolution f⋆g of two
functions in the spectral domain as the element-wise product of their Fourier transforms.
Unfortunately, in the non-Euclidean case we cannot even define the operation x-x’ on the
manifold or graph, so the notion of convolution does not directly extend to this case.
Pipeline • Future paths • Geometric Deep Learning #3
Bronstein et al. (July 2017): “We expect the following years to bring exciting new approaches
and results, and conclude our review with a few observations of current key difficulties and
potential directions of future research.”
Generalization: Generalizing
deep learning models to
geometric data requires not only
finding non-Euclidean
counterparts of basic building
blocks (such as convolutional
and pooling layers), but also
generalization across different
domains. Generalization
capability is a key requirement in
many applications, including
computer graphics, where a
model is learned on a training set
of non-Euclidean domains (3D
shapes) and then applied to
previously unseen ones.
Time-varying domains: An
interesting extension of geometric
deep learning problems discussed
in this review is coping with signals
defined over a dynamically
changing structure. In this case, we
cannot assume a fixed domain and
must track how these changes
affect signals. This could prove
useful to tackle applications such
as abnormal activity detection in
social or financial networks. In the
domain of computer graphics and
vision, potential applications deal
with dynamic shapes (e.g. 3D video
captured by a range sensor).
Computation: The final consideration is
a computational one. All existing deep
learning software frameworks are
primarily optimized for Euclidean data.
One of the main reasons for the
computational efficiency of deep
learning architectures (and one of the
factors that contributed to their
renaissance) is the assumption of
regularly structured data on 1D or 2D
grid, allowing to take advantage of
modern GPU hardware. Geometric data,
on the other hand, in most cases do not
have a grid structure, requiring different
ways to achieve efficient computations.
It seems that computational paradigms
developed for large-scale graph
processing are more adequate
frameworks for such applications.
Pipeline • Future paths • sparsity primer from neuroscience
Questioning the role of sparse
coding in the brain
Anton Spanne, Henrik Jörntell
July 2015 - http://doi.org/10.1016/j.tins.2015.05.005
Optimal Degrees of Synaptic
Connectivity
Litwin-Kumar A, Harris KD, Axel R, Sompolinsky H, Abbott LF
Feb 2017 - http://doi.org/10.1016/j.tins.2015.05.005
3D Visual Response Properties of MSTd Emerge
from an Efficient, Sparse Population Code
Michael Beyeler, Nikil Dutt and Jeffrey L. Krichmar
Journal of Neuroscience 10 August 2016
http://doi.org/10.1523/JNEUROSCI.0396-16.2016
● Sparse coding is questioned on both theoretical
and experimental grounds.
● Generalization is important to current brain
models but is weak under sparse coding.
● The beneficial properties ascribed to sparse
coding can be achieved by alternative means.
Sparse code is a compromise between local code and dense code. (A)
Comparison of coding schemes that differ in their ratio of active neurons: in
other words, in their sparseness. The activities within the population during
two hypothetical contexts (context A and context B) are shown as examples of
how different contexts are represented within the population. Note that by our
definition only a single context would be active at any time because a context
represents the global brain state (i.e., all the neurons). In local code, a context is
represented by the activity of a single neuron, or a small subset of neurons, and
different contexts are represented by different neurons. Notably, the activities
of the neurons are not independent because if a neuron is responding to
context A, it will not respond to any other context. In dense code, all neurons
are active and their combined activity is used to encode each context. Any
state in between the two extreme cases of local and dense code can in
principle be labeled sparse code. The reduction of average activation leads to a
reduction in the overlap or interference between the activation during
different contexts. (B) In the special case of binary activation functions,
maximal representational capacity is obtained if 50% of the neurons are active
during each context. For this reason an average activation of 50% is usually
considered dense code in the binary case.
● Sparse synaptic wiring can optimize a neural
representation for associative learning
● Maximizing dimension predicts the degree of
connectivity for cerebellum-like circuits
● Supervised plasticity of input connections is
needed to exploit dense wiring
● Performance of a Hebbian readout neuron is
formally related to dimension
(A) Schematic of network with a classifier that computes a weighted sum
of mixed-layer activity to determine the valence of an input pattern.
Example flow fields generated with the motion field model
[Longuet-Higgins and Prazdny (1980); Raudies (2013)]. Generated from a
pinhole camera with image plane.
Pipeline • Future paths• Non-euclidean computability & Geometric Operators #1
https://arxiv.org/abs/1707.06316
“Our model is totally unsupervised., thus we can experiment with
large-scale video corpora in future work, to learn non-rigid real world
motion patterns. Through comparison of popular CNN architectures,
we found that it is important to design novel operators or
networks for optical flow estimation instead of relying on
existing architectures for image classification”
pixels remain. The labels would be
transformed in the same way but are not
shown here.
Away from GRID
Szegedy et al. 2014 (Google, cited by 4,027 articles):
“Today’s computing infrastructures are very inefficient
when it comes to numerical calculation on non-uniform
sparse data structures. Even if the number of arithmetic
operations is reduced by 100×, the overhead of lookups
and cache misses is so dominant that switching to sparse
matrices would not pay off. The gap is widened even further
by the use of steadily improving, highly tuned, numerical
libraries that allow for extremely fast dense matrix
multiplication. Also, non-uniform sparse models require
more sophisticated engineering and computing
infrastructure.”
https://arxiv.org/abs/1704.04760
Google not quite there yet for hardware-accelerated
sparse matrix deep learning.
“This Matrix unit of a custom ASIC—called a Tensor Processing Unit
(TPU) if is designed for dense matrices. Sparse architectural support
was omitted for time-to-deploy reasons. Sparsity will have high
priority in future designs.”
Pipeline • Future paths• Non-euclidean computability & Geometric Operators #2
Speeding up Convolutional Neural Networks
By Exploiting the Sparsity of Rectifier Units
Shaohuai Shi, Xiaowen Chu
(Submitted on 25 Apr 2017 (v1), last revised 15 May 2017 (this version, v2))
https://arxiv.org/abs/1704.07724
Rectifier neuron units (ReLUs) have been widely used in deep
convolutional networks. An ReLU converts negative values to zeros,
and does not change positive values, which leads to a high sparsity of
neurons. In this work, we first examine the sparsity of the outputs of
ReLUs in some popular deep convolutional architectures. And then
we use the sparsity property of ReLUs to accelerate the calculation
of convolution by skipping calculations of zero-valued neurons. The
proposed sparse convolution algorithm achieves some speedup
improvements on CPUs compared to the traditional matrix-matrix
multiplication algorithm for convolution when the sparsity is not less
than 0.9.
We measure the speed of compared algorithms on the Intel CPU: E5-2630v4 at
the core frequency of 2.20GHz with 128 GB memory.
We propose the inverse sparse convolution (ISC) algorithm by three steps: First,
we skip all the zero elements of the input data, and store the non-zero values in a
vector with their column and row information. Second, the kernel matrix is stored
as column-major matrix such that for each non-zero element (Ic,i,j
) of inputs, a
continuous memory that stores kernels can be fetched and multiplied by Ic,i,j
at one
time with AVX or SSE techniques. Third, transpose temporary results from the
second step to generate outputs.
The Power of Sparsity in
Convolutional Neural Networks
Soravit Changpinyo, Mark Sandler, Andrey Zhmoginov
(Submitted on 21 Feb 2017)
https://arxiv.org/abs/1702.06257
We deactivate connections between filters in convolutional
layers in a way that allows us to harvest savings both in run-time
and memory for many network architectures. More specifically,
we generalize 2D convolution to use a channel-wise sparse
connection structure and show that this leads to significantly
better results than the baseline approach for large networks
including VGG and Inception V3.
“For example, when applied to Inception V3 (Fig. 4) achieves
AlexNet-level accuracy with fewer than 400K parameters and
VGG-level one (Fig. 5) with roughly 3.5M parameters. In addition, we
show that our method leads to an interesting novel incremental training technique,
where we take advantage of sparse (and smaller) models to build a dense network. One
interesting open direction is to enable incremental training not to simply densify the
network over time, but also increase the number of chaannels. This would allow us to
grow the network without having to fix its original shape in place.”
Efficient Sparse-Winograd Con
-volutional Neural Networks
Xingyu Liu, Song Han, Huizi Mao, William J. Dally
17 Feb 2017 (modified: 19 Feb 2017)
ICLR 2017 workshop submission
https://openreview.net/forum?id=r1rqJyHKg
Convolutional Neural Networks (CNNs) are
compute intensive which limits their application
on mobile devices. Their energy is dominated by
the number of multiplies needed to perform the
convolutions. Winograd’s minimal filtering
algorithm (Lavin and Gray (2015)) and network
pruning (Han et al. (2015)) reduce the operation
count. Unfortunately, these two methods cannot
be combined—because applying the Winograd
transform fills in the sparsity in both the weights
and the activations.
We propose two modifications to Winograd-based
CNNs to enable these methods to exploit sparsity.
First, we prune the weights in the ”Winograd
domain” (after the transform) to exploit static
weight sparsity.
Second, we move the ReLU operation into the
”Winograd domain” to improve the sparsity of the
transformed activations. On CIFAR-10, our method
reduces the number of multiplications in the VGG-
nagadomi model by 10.2x with no loss of
accuracy.
Pipeline • Future paths• Non-euclidean computability & Geometric Operators #3
SPARCNet: A Hardware Accelerator
for Efficient Deployment of Sparse
Convolutional Networks
Adam Page, Ali Jafari, Colin Shea, Tinoosh Mohsenin
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Hardware
and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems.
Volume 13 Issue 3, May 2017 - Article No. 31.
https://doi.org/10.1145/3005448
The SPARCNet accelerator with different numbers of
processing engines is implemented on a low-power Artix-7
FPGA platform. The FPGA-based accelerator is developed
using a combination of pure HDL written in Verilog and IP
cores developed using Xilinx’s Vivado HLS.
Additionally, the same networks are optimally implemented on a number of
embedded commercial-off-the-shelf platforms including NVIDIAs
CPU+GPU SoCs TK1 and TX1 and Intel Edison. Compared to NVIDIAs TK1
and TX1, the FPGA-based accelerator obtains 11.8 × and 7.5 × improvement
in energy efficiency In addition to improving efficiency, the accelerator has
built-in support for sparsification techniques and ability to perform in-place
rectified linear unit (ReLU) activation function, max-pooling, and batch
normalization.
Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC
E Nurvitadhi, D Sheffield, J Sim… 2017
Field-Programmable Technology (FPT), 2016 International Conference on
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
E Nurvitadhi, G Venkatesh, J Sim, D Marr, R Huang FPGA’17
Hardware accelerator for analytics of sparse data
E Nurvitadhi, A Mishra, Y Wang, G Venkatesh… - Proceedings of the 2016
Sparse Matrix Multiplication on CAM Based Accelerator
L Yavits, R Ginosar - arXiv preprint arXiv:1705.09937, 2017
Cambricon-X: An accelerator for sparse neural networks
S Zhang, Z Du, L Zhang, H Lan, S Liu… - … (MICRO), 2016 49th …, 2016
Accelerator for Sparse Machine Learning
L Yavits, R Ginosar - IEEE Computer Architecture Letters, 2017
A Scalable FPGA-Based Accelerator for High-Throughput MCMC Algorithms
M Hosseini, R Islam, A Kulkarni… - … (FCCM), 2017 IEEE
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
A Parashar, M Rhu, A Mukkara, A Puglielli… - Proceedings of the 44th …, 2017
NullHop: A Flexible Convolutional Neural Network Accelerator Based on
Sparse Representations of Feature Maps
A Aimar, H Mostafa, E Calabrese… - arXiv preprint arXiv: …, 2017
Geometric DNNs • implementation options in practice #1: GVNN
ankurhanda/gvnn
Insights gvnn: Geometric Vision with Neural
Networks
gvnn is primarily intended for self-supervised
learning using low-level vision. It is inspired by the
Spatial Transformer Networks (STN) paper that
appeared in NIPS in 2015 and its open source code
made available by Maxime Oquab. The code is self
contained i.e. the original implementation of STN by
Maxime is also within the repository.
STs were mainly limited to applying only 2D
transformations to the input. We added a new set of
transformations often needed for manipulating
data in 3D geometric computer vision. These
include the 3D counterparts of what were used in
original STN together with a lot more new
transformations and different M-estimators.
SO3 Layer
Rotations are represented as so(3) 3-vector. This vector is turned into rotation matrix via the exponential
map. For a more detailed view of the so(3) representation and exponential map read this tutorial from Ethan
Eade: Lie-Algebra Tutorial. This is what the exponential map is Exponential Map. Also, Tom Drummond's
notes on Lie-Algebra are a great source to learn about exponential maps Tom Drummond's notes. The
reason for choosing so3 representation is mainly due to its appealing properties when linearising rotations
(via Taylor series expansion) for iterative image alignment via classic linearise-solve-update rule. The figure
below shows how linearisation for SO3 is fitting a local plane on the sphere
Optical Flow
Lens Distortion
Projection Layer

More Related Content

What's hot

Object Pose Estimation
Object Pose EstimationObject Pose Estimation
Object Pose EstimationArithmer Inc.
 
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmSelf Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmChenghao Jin
 
Chapter 9 morphological image processing
Chapter 9   morphological image processingChapter 9   morphological image processing
Chapter 9 morphological image processingAhmed Daoud
 
Image ORB feature
Image ORB featureImage ORB feature
Image ORB featureGavin Gao
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learningYu Huang
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillationNAVER Engineering
 
ピンホールカメラモデル
ピンホールカメラモデルピンホールカメラモデル
ピンホールカメラモデルShohei Mori
 
Lec12: Shape Models and Medical Image Segmentation
Lec12: Shape Models and Medical Image SegmentationLec12: Shape Models and Medical Image Segmentation
Lec12: Shape Models and Medical Image SegmentationUlaş Bağcı
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsShunta Saito
 
computer animation languages-N.Kavitha.pptx
computer animation languages-N.Kavitha.pptxcomputer animation languages-N.Kavitha.pptx
computer animation languages-N.Kavitha.pptxComputerScienceDepar6
 
Computer Vision – From traditional approaches to deep neural networks
Computer Vision – From traditional approaches to deep neural networksComputer Vision – From traditional approaches to deep neural networks
Computer Vision – From traditional approaches to deep neural networksinovex GmbH
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
PRML勉強会 #4 @筑波大学 発表スライド
PRML勉強会 #4 @筑波大学 発表スライドPRML勉強会 #4 @筑波大学 発表スライド
PRML勉強会 #4 @筑波大学 発表スライドSatoshi Yoshikawa
 
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View SynthesisNeural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View SynthesisVincent Sitzmann
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionPetteriTeikariPhD
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation岳華 杜
 

What's hot (20)

Object Pose Estimation
Object Pose EstimationObject Pose Estimation
Object Pose Estimation
 
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmSelf Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
 
Ray tracing
Ray tracingRay tracing
Ray tracing
 
Chapter 9 morphological image processing
Chapter 9   morphological image processingChapter 9   morphological image processing
Chapter 9 morphological image processing
 
Image ORB feature
Image ORB featureImage ORB feature
Image ORB feature
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
ピンホールカメラモデル
ピンホールカメラモデルピンホールカメラモデル
ピンホールカメラモデル
 
Lec12: Shape Models and Medical Image Segmentation
Lec12: Shape Models and Medical Image SegmentationLec12: Shape Models and Medical Image Segmentation
Lec12: Shape Models and Medical Image Segmentation
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
computer animation languages-N.Kavitha.pptx
computer animation languages-N.Kavitha.pptxcomputer animation languages-N.Kavitha.pptx
computer animation languages-N.Kavitha.pptx
 
Computer Vision – From traditional approaches to deep neural networks
Computer Vision – From traditional approaches to deep neural networksComputer Vision – From traditional approaches to deep neural networks
Computer Vision – From traditional approaches to deep neural networks
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
PRML勉強会 #4 @筑波大学 発表スライド
PRML勉強会 #4 @筑波大学 発表スライドPRML勉強会 #4 @筑波大学 発表スライド
PRML勉強会 #4 @筑波大学 発表スライド
 
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View SynthesisNeural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer Vision
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 

Similar to Deep Learning for Structure-from-Motion (SfM)

Emerging 3D Scanning Technologies for PropTech
Emerging 3D Scanning Technologies for PropTechEmerging 3D Scanning Technologies for PropTech
Emerging 3D Scanning Technologies for PropTechPetteriTeikariPhD
 
Deep VO and SLAM IV
Deep VO and SLAM IVDeep VO and SLAM IV
Deep VO and SLAM IVYu Huang
 
Effective Object Detection and Background Subtraction by using M.O.I
Effective Object Detection and Background Subtraction by using M.O.IEffective Object Detection and Background Subtraction by using M.O.I
Effective Object Detection and Background Subtraction by using M.O.IIJMTST Journal
 
hpe3d_report.pdf
hpe3d_report.pdfhpe3d_report.pdf
hpe3d_report.pdfgsrawat
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Trackingijsrd.com
 
Face recognition across non uniform motion blur, illumination, and pose
Face recognition across non uniform motion blur, illumination, and poseFace recognition across non uniform motion blur, illumination, and pose
Face recognition across non uniform motion blur, illumination, and posePvrtechnologies Nellore
 
Dj31514517
Dj31514517Dj31514517
Dj31514517IJMER
 
Dj31514517
Dj31514517Dj31514517
Dj31514517IJMER
 
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES sipij
 
Robust techniques for background subtraction in urban
Robust techniques for background subtraction in urbanRobust techniques for background subtraction in urban
Robust techniques for background subtraction in urbantaylor_1313
 
Automatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVMAutomatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVMIRJET Journal
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
 
ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...
ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...
ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...cscpconf
 
Robust Human Tracking Method Based on Apperance and Geometrical Features in N...
Robust Human Tracking Method Based on Apperance and Geometrical Features in N...Robust Human Tracking Method Based on Apperance and Geometrical Features in N...
Robust Human Tracking Method Based on Apperance and Geometrical Features in N...csandit
 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...IRJET Journal
 
Wireless Vision based Real time Object Tracking System Using Template Matching
Wireless Vision based Real time Object Tracking System Using Template MatchingWireless Vision based Real time Object Tracking System Using Template Matching
Wireless Vision based Real time Object Tracking System Using Template MatchingIDES Editor
 
An Innovative Moving Object Detection and Tracking System by Using Modified R...
An Innovative Moving Object Detection and Tracking System by Using Modified R...An Innovative Moving Object Detection and Tracking System by Using Modified R...
An Innovative Moving Object Detection and Tracking System by Using Modified R...sipij
 
People or human tracking system
People or human tracking systemPeople or human tracking system
People or human tracking systemVenkatesan S
 

Similar to Deep Learning for Structure-from-Motion (SfM) (20)

Emerging 3D Scanning Technologies for PropTech
Emerging 3D Scanning Technologies for PropTechEmerging 3D Scanning Technologies for PropTech
Emerging 3D Scanning Technologies for PropTech
 
Deep VO and SLAM IV
Deep VO and SLAM IVDeep VO and SLAM IV
Deep VO and SLAM IV
 
Effective Object Detection and Background Subtraction by using M.O.I
Effective Object Detection and Background Subtraction by using M.O.IEffective Object Detection and Background Subtraction by using M.O.I
Effective Object Detection and Background Subtraction by using M.O.I
 
hpe3d_report.pdf
hpe3d_report.pdfhpe3d_report.pdf
hpe3d_report.pdf
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
 
Face recognition across non uniform motion blur, illumination, and pose
Face recognition across non uniform motion blur, illumination, and poseFace recognition across non uniform motion blur, illumination, and pose
Face recognition across non uniform motion blur, illumination, and pose
 
Dj31514517
Dj31514517Dj31514517
Dj31514517
 
Dj31514517
Dj31514517Dj31514517
Dj31514517
 
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
 
Robust techniques for background subtraction in urban
Robust techniques for background subtraction in urbanRobust techniques for background subtraction in urban
Robust techniques for background subtraction in urban
 
Automatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVMAutomatism System Using Faster R-CNN and SVM
Automatism System Using Faster R-CNN and SVM
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
X36141145
X36141145X36141145
X36141145
 
ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...
ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...
ROBUST HUMAN TRACKING METHOD BASED ON APPEARANCE AND GEOMETRICAL FEATURES IN ...
 
Robust Human Tracking Method Based on Apperance and Geometrical Features in N...
Robust Human Tracking Method Based on Apperance and Geometrical Features in N...Robust Human Tracking Method Based on Apperance and Geometrical Features in N...
Robust Human Tracking Method Based on Apperance and Geometrical Features in N...
 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
 
Wireless Vision based Real time Object Tracking System Using Template Matching
Wireless Vision based Real time Object Tracking System Using Template MatchingWireless Vision based Real time Object Tracking System Using Template Matching
Wireless Vision based Real time Object Tracking System Using Template Matching
 
An Innovative Moving Object Detection and Tracking System by Using Modified R...
An Innovative Moving Object Detection and Tracking System by Using Modified R...An Innovative Moving Object Detection and Tracking System by Using Modified R...
An Innovative Moving Object Detection and Tracking System by Using Modified R...
 
People or human tracking system
People or human tracking systemPeople or human tracking system
People or human tracking system
 

More from PetteriTeikariPhD

ML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung SoundsML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung SoundsPetteriTeikariPhD
 
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and OculomicsNext Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and OculomicsPetteriTeikariPhD
 
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...PetteriTeikariPhD
 
Wearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung SensingWearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung SensingPetteriTeikariPhD
 
Precision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthmaPrecision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthmaPetteriTeikariPhD
 
Two-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature SegmentationTwo-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature SegmentationPetteriTeikariPhD
 
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phaseSkin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phasePetteriTeikariPhD
 
Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...PetteriTeikariPhD
 
Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...PetteriTeikariPhD
 
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging featuresIntracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging featuresPetteriTeikariPhD
 
Hand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical ApplicationsHand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical ApplicationsPetteriTeikariPhD
 
Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1PetteriTeikariPhD
 
Multimodal RGB-D+RF-based sensing for human movement analysis
Multimodal RGB-D+RF-based sensing for human movement analysisMultimodal RGB-D+RF-based sensing for human movement analysis
Multimodal RGB-D+RF-based sensing for human movement analysisPetteriTeikariPhD
 
Creativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technologyCreativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technologyPetteriTeikariPhD
 
Deep Learning for Biomedical Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time SeriesDeep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical Unstructured Time SeriesPetteriTeikariPhD
 
Hyperspectral Retinal Imaging
Hyperspectral Retinal ImagingHyperspectral Retinal Imaging
Hyperspectral Retinal ImagingPetteriTeikariPhD
 
Instrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopyInstrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopyPetteriTeikariPhD
 
Future of Retinal Diagnostics
Future of Retinal DiagnosticsFuture of Retinal Diagnostics
Future of Retinal DiagnosticsPetteriTeikariPhD
 
OCT Monte Carlo & Deep Learning
OCT Monte Carlo & Deep LearningOCT Monte Carlo & Deep Learning
OCT Monte Carlo & Deep LearningPetteriTeikariPhD
 

More from PetteriTeikariPhD (20)

ML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung SoundsML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung Sounds
 
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and OculomicsNext Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
 
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
 
Wearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung SensingWearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung Sensing
 
Precision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthmaPrecision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthma
 
Two-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature SegmentationTwo-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature Segmentation
 
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phaseSkin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
 
Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...
 
Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...
 
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging featuresIntracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
 
Hand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical ApplicationsHand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical Applications
 
Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1
 
Multimodal RGB-D+RF-based sensing for human movement analysis
Multimodal RGB-D+RF-based sensing for human movement analysisMultimodal RGB-D+RF-based sensing for human movement analysis
Multimodal RGB-D+RF-based sensing for human movement analysis
 
Creativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technologyCreativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technology
 
Light Treatment Glasses
Light Treatment GlassesLight Treatment Glasses
Light Treatment Glasses
 
Deep Learning for Biomedical Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time SeriesDeep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical Unstructured Time Series
 
Hyperspectral Retinal Imaging
Hyperspectral Retinal ImagingHyperspectral Retinal Imaging
Hyperspectral Retinal Imaging
 
Instrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopyInstrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopy
 
Future of Retinal Diagnostics
Future of Retinal DiagnosticsFuture of Retinal Diagnostics
Future of Retinal Diagnostics
 
OCT Monte Carlo & Deep Learning
OCT Monte Carlo & Deep LearningOCT Monte Carlo & Deep Learning
OCT Monte Carlo & Deep Learning
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Deep Learning for Structure-from-Motion (SfM)

  • 2. Purpose of this presentation ● Deconstruction of the novel SfM-Net deep learning architecture for Structure-for-Motion (SfM) - Anticipating the increased use of deep learning for SfM and “geometric” computer vision problems in general. - Hopefully this presentation is able to give incite feelings of “that could be done in a better way” leading to better and better deep SfM architectures. ● “A bit of everything” for everyone approach targeted mainly for computer vision engineers that are not at advanced level in geometric deep learning. ● Trying to address the typical problem of finding the relevant “seed literature” for a new topic helping fresh grad students, postdocs, software engineers and startup founders. - Answer to “Do you know if someone has done some work on the various steps involved in SfM” to identify what wheels do not need to be re-invented
  • 4. SfM • Structure from Motion Basics recap • Camera Projections Structure-from-Motion (SfM). Instead of a single stereo pair, the SfM technique requires multiple, overlapping photographs as input to feature extraction and 3-D reconstruction algorithms. - Westoby et al SfM method basically computes the relative camera positions between all related photos. After every relative camera position is found, the scheme uses these matrices to reconstruct all feature points using triangulation. Thus there are two main problems: 1) Image registration (e.g. SIFT, SURF, ORB, etc) 2) Pose Estimation (e.g. Perspective-n-Point with RANSAC) Image registration e.g Find corresponding features from image pair (http://cs.brown.edu/courses/cs143/proj3a/) Depending on how robust is the algorithm used to find features (e.g. old school vs. deep learning), the higher quality the reconstructed point cloud is. Camera Projection Matrix convert from 3D read world coordinates to 2D image coordinates ults/proj3/html/agartia3/index.html Perspective Camera Toy
  • 5. SfM • Structure from Motion Basics recap • Fundamental Matrix
  • 6. SfM • Structure from Motion Basics recap • pose estimates Evaluating Pose Estimation Methods for Stereo Visual Odometry on Robots Date of Original Version: 8-2010 Hatem Alismail, Carnegie Mellon University; Brett Browning, Carnegie Mellon University; M. Bernardine Dias, Carnegie Mellon University http://repository.cmu.edu/robotics/745/ Structure-From-Motion (SFM) methods, using stereo data, are among the best performing algorithms for motion estimation from video imagery, or visual odometry. Critical to the success of SFM methods is the quality of the initial pose estimation algorithm from feature correspondences. In this work, we evaluate the performance of pose estimation algorithms commonly used in SFM visual odometry. We consider two classes of techniques to develop the initial pose estimate: Absolute Orientation (AO) methods, and Perspective-n-Point (PnP) methods. To date, there has not been a comparative study of their performance on robot visual odometry tasks. We undertake such a study to measure the accuracy, repeatability, and robustness of these techniques for vehicles moving in indoor environments and in outdoor suburban roadways. Our results show that PnP methods outperform AO methods, with P3P being the best performing algorithm. This is particularly true when stereo triangulation uncertainty is high due to a wide Field of View lens and small stereo-rig baseline. Random forests versus Neural Networks — What's best for camera localization? Daniela Massiceti ; Alexander Krull ; Eric Brachmann ; Carsten Rother ; Philip H.S. Torr Robotics and Automation (ICRA), 2017 IEEE International Conference on; https://doi.org/10.1109/ICRA.2017.7989598 “To summarize, our best method, a ForestNet with a robust average, which has an equivalent fast and lightweight RF, improves over the state-of-the-art for camera localization on the 7-Scenes dataset. While this work focuses on scene coordinate regression for camera localization, our innovations may also be applied to other continuous regression tasks.” Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network Zakaria Laskar, Iaroslav Melekhov, Surya Kalia, Juho Kannala https://arxiv.org/abs/1707.09733 “The neural network is trained for relative pose estimation in an end-to-end manner using training image pairs. In contrast to previous work, our approach does not require scene-specific training of the network, which improves scalability, and it can also be applied to scenes which are not available during the training of the network.” DSAC - Differentiable RANSAC for Camera Localization Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, Carsten Rother https://arxiv.org/abs/1611.05705 “We call this approach DSAC, the differentiable counterpart of RANSAC. We apply DSAC to the problem of camera localization, where deep learning has so far failed to improve on traditional approaches. We demonstrate that by directly minimizing the expected loss of the output camera poses, robustly estimated by RANSAC, we achieve an increase in accuracy. In the future, any deep learning pipeline can use DSAC as a robust optimization component.” Deep 6-DOF Tracking Mathieu Garon, Jean-François Lalonde https://arxiv.org/abs/1703.09771 “We present a temporal 6-DOF tracking method which leverages deep learning to achieve state-of-the-art performance on challenging datasets of real world capture. Our method is both more accurate and more robust to occlusions than the existing best performing approaches while maintaining real-time performance. To assess its efficacy, we evaluate our approach on several challenging RGBD sequences of real objects in a variety of conditions. Notably, we systematically evaluate robustness to occlusions through a series of sequences where the object to be tracked is increasingly occluded. Finally, our approach is purely data-driven and does not require any hand-designed features: robust tracking is automatically learned from data.”
  • 8. SfM-NeT • Abstract Computer Science > Computer Vision and Pattern Recognition SfM-Net: Learning of Structure and Motion from Video Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki Google Research; Inria, Grenoble, France; Carnegie Mellon University (Submitted on 25 Apr 2017) arXiv:1704.07804 [cs.CV] | https://arxiv.org/abs/1704.07804 We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and back-propagates. The model can be trained with various degrees of supervision: 1) Self-supervised by the re-projection photometric error (completely unsupervised), 2) Supervised by ego-motion (camera motion), or 3) Supervised by depth (e.g., as provided by RGBD sensors). SfM-Net extracts meaningful depth estimates and successfully estimates frame-to-frame camera rotations and translations. It often successfully segments the moving objects in the scene, even though such supervision is never provided. SfM-Net: Given a pair of frames as input, our model decomposes frame-to-frame pixel motion into 3D scene depth, 3D camera rotation and translation, a set of motion masks and corresponding 3D rigid rotations and translations. It backprojects the resulting 3D scene flow into 2D optical flow and warps accordingly to match pixels from one frame to the next. Forwardbackward consistency checks constrain the estimated depth
  • 9. SfM-NeT • Inspiration SfM-Net is inspired by works that impose geometric constraints on optical flow, exploiting rigidity of the visual scene, such as early low-parametric optical flow methods [e.g. Zelnik-Manor and Irani (2000)] or the so-called direct methods for visual SLAM (Simultaneous Localization and Mapping) that perform dense pixel matching from frame to frame while estimating a camera trajectory and depth of the pixels in the scene [e.g. Schöps et al. (2014) and Engel et al. (2014)] . In contrast to those, instead of optimizing directly over optical flow vectors, 3D point coordinates or camera rotation and translation, our model optimizes over neural network weights that, given a pair of frames, produce such 3D structure and motion. In this way, our method learns to estimate structure and motion, and can in principle improve as it processes more videos, in contrast to non- learning based alternatives. It can thus be made robust to lack of texture, degenerate camera motion trajectories or dynamic objects (our model explicitly accounts for those), by providing appropriate supervision. Our work is also inspired and builds upon recent works on learning geometrically interpretable optical flow fields for point cloud prediction in time [Byravan and Fox (2016)] and backpropagating through camera projection for 3D human pose estimation [Wu et al. (2016)] or single-view depth estimation [Zhou et al. (2017), https://github.com/tinghuiz/SfMLearner]. The training data to our system consists solely of unlabeled image sequences capturing scene appearance from different viewpoints, where the poses of the images are not provided. Our training procedure produces two models that operate independently, one for single-view depth prediction, and one for multiview camera pose estimation. https://arxiv.org/abs/1704.07813 Overview of the Large Scale Direct Monocular SLAM (LSD-SLAM) algorithm http://www.doc.ic.ac.uk/~ab9515/lsdslam.html.
  • 10. SfM-NeT • related Work Differentiable warping [Jaderberg et al. (2015)] has been used to learn end-to-end unsupervised optical flow [Yu et al. (2016)] , disparity flow in a stereo rig [Godard et al. (2016)] and video prediction [ Patraucean et al. (2015)] . The closest previous works to ours are SE3-Nets[Byravan and Fox (2016)] , 3D image interpreter [Wu et al. (2016)] , and Garg et al.’s (2016) depth CNN. SE3-Nets[Byravan and Fox (2016)] use an actuation force from a robot and an input point cloud to forecast a set of 3D rigid object motions (rotation and translations) and corresponding pixel motion assignment masks under a static camera assumption. Our work uses similar representation of pixel motion masks and 3D motions to capture the dynamic objects in the scene. However, our work differs in that 1) We predict depth and camera motion while SE3- Nets operate on given point clouds and assume no camera motion, 2) SE3-Nets are supervised with pre-recorded 3D optical flow, while this work admits diverse and much weaker supervision, as well as complete lack of supervision, 3) SE3-Nets consider one frame and an action as input to predict the future motion, while our model uses pairs of frames as input to estimate the intra- frame motion, and 4) SE3-Nets are applied to toy or lab-like setups whereas we show results on real videos. https://doi.org/10.1109/ICRA.2017.7989023 SE3-NET architecture. Input is a 3D point cloud and an n-dimensional action vector (bold-italics), both of which are encoded and concatenated to a joint feature vector (CAT). The decoder uses this encoding to predict "k" object masks M and "k" SE(3) transforms which are used to transform the input cloud via the "Transform layer" to generate the output. Mask weights are sharpened and normalized before use for prediction. Conv = Convolution, FC = Fully Connected, Deconv = Deconvolution, CAT = Concatenation
  • 12. SfM-NeT • SfM-Net architecture SfM-Net architecture. For each pair of consecutive frames It , It+1 , a conv/deconv sub-network predicts depth dt while another predicts a set of K segmentation masks mt . The coarsest feature maps of the motion-mask encoder are further decoded through fully connected layers towards 3D rotations and translations for the camera and the K segmentations. The predicted depth is converted into a per frame point-cloud using estimated or known camera intrinsics. Then, it is transformed according to the predicted 3D scene flow, as composed by the 3D camera motion and independent 3D mask motions. Transformed 3D depth is projected back to the 2D next frame, and thus provides corresponding 2D optical flow fields. Differentiable backward warping maps frame It+1 to It , and gradients are computed based on pixel errors. Forward-backward constraints are imposed by repeating this process for the inverted frame pair It+1 ,, It and constraining the depths dt and dt+1 to be consistent through the estimated scene motion. coarse map 2 x fully connected layers
  • 13. SfM-NeT • SfM-Net architecture • structure Network We compute per frame depth using a standard conv/deconv subnetwork operating on a single frame (the structure network on previous slide). We use a RELU activation at our final layer, since depth values are non- negative. Given depth d t , we obtain the 3D point cloud corresponding to the pixels in the scene using a pinhole camera model. Let (xi t , yi t ) be the column and row positions of the ith pixel in frame It and let (cx , cy , f) be the camera intrinsics, then where dit denotes the depth value of the ith pixel. We use the camera intrinsics when available and revert to default values of (0.5, 0.5, 1.0) otherwise. Therefore, the predicted depth will only be correct up to a scalar multiplier. https://youtu.be/vZELygPzV0M?t=51m47s by Cyrill Stachniss Xu et al. (2015) https://machinelearningonline.blog/ by narasimman Activation function variants for CIFAR-10/100 dataset Godin et al. (2017): Test errors of the ResNet-110 architecture using Dual ReLUs and Dual Exponential Linear Units [DELUs, Clevert et al., (2016)], compared to the initial version with ReLUs and the extension with Concatenated ReLU [CreLU, Shang et al. (2017)] on CIFAR-10 and CIFAR-100, using an equal parameter budget. Comparison of activation functions. The rectified linear unit (ReLU), the leaky ReLU (LReLU, = 0.1), the shifted ReLUs (SReLUs), and theα exponential linear unit (ELU, = 1.0).α Clevert et al., (2016) conv/deconv subnetwork same as U-Net that first downsamples [ENCODER] the dense map (coarse map) through max pooling which is followed by upsampling [DECODER]. Wojna et al. (2017)
  • 14. SfM-NeT • SfM-Net architecture • Scene&Object Motion Network We compute the motion of the camera and of independently moving objects in the scene using a conv/deconv subnetwork that operates on a pair of images (the motion network of architecture). We depth-concatenate the pair of frames and use a series of convolutional layers to produce an embedding layer. We use two fully-connected layers to predict the motion of the camera between the frames and a predefined number K of rigid body motions that explain moving objects in the scene. The fully-connected layers are used to predict translation parameters tc , the pivot points of the camera rotation pc ., and sin α, sin β, sin γ. These last three parameters are constrained to be in the interval [−1, 1] by using RELU activation and the minimum function We use similar representations as for camera motion and predict parameters using fully-connected layers on top of the same embedding E. While camera motion is a global transformation applied to all the pixels in the scene, the object motion transforms are weighted by the predicted membership probability of each pixel to each rigid motion. These masks are produced by feeding the embedding layer through a deconvolutional tower. We use sigmoid activations at the last layer instead of softmax in order to allow each pixel to belong to any number of rigid body motions. When a pixel has zero activation across all K maps it is assigned to the static background whose motion is a function of the global camera motion alone. We allow a pixel to belong to multiple rigid body transforms in order to capture composition of motions, e.g., through kinematic chains, such as articulated bodies. Learning the required number of motions for a sequence is an interesting open problem. We found that we could fix K = 3 for all experiments presented here. Note that our method can learn to ignore unnecessary object motions in a sequence by assigning no pixels to the corresponding mask. http://www.math.tau.ac.il/~dcor/Graphics/cg-sli des/trans3d.pdf
  • 15. SfM-NeT • SfM-Net architecture • optical Flow We obtain optical flow by first transforming the point cloud obtained in Equation (1) using the camera and object motion rigid body transformations followed by projecting the 3D point on to the image plane using the camera intrinsics. In the following, we drop the pixel superscript i from the 3D coordinates, since it is clear we are referring to the motion transformation of the ith pixel of the tth frame. We first apply the object transformations:
  • 16. Upgrade • SfM-NeT • Upgrade to architecture #1 The Devil is in the Decoder Zbigniew Wojna, Vittorio Ferrari, Sergio Guadarrama, Nathan Silberman, Liang-Chieh Chen, Alireza Fathi, Jasper Uijlings While encoders have been studied rigorously, relatively few studies address the decoder side. Therefore this paper presents an extensive comparison of a variety of decoders for a variety of pixel-wise prediction tasks. Our contributions are: 1) Decoders matter: we observe significant variance in results between different types of decoders on various problems. 2) We introduce a novel decoder: bilinear additive upsampling. 3) We introduce new residual-like connections for decoders. 4) We identify two decoder types which give a consistently high performance. SfM-Net more like a proof-of-concept network that will be probably upgraded by the authors themselves to skip-connection / residual-like connections for better performance? Wojna et al. (2017) found little advantage in depth prediction when using skip layers: “For depth prediction, all layers except bilinear upsampling have good performance, whereas adding skip layers to these results in equal performance except for depth-to-space, where it slightly lowers performance” When using residual connections, performance consistently improves: “For the majority of combinations, we see that adding residual connections is beneficial. Interestingly, we now can identify two upsampling methods which have consistently good results on all problems presented in this paper, both which have residual connections: (1) transposed convolutions + residual connections. (2) bilinear additive upsampling + residual connections (both with and without skip connections).” Our main results comparing a variety of decoders on five machine vision problems. The upper part shows decoders without residual-like connections; the bottom shows decoders with residual-like connections. The colors represent relative performance: red means top performance, yellow means reasonable performance, blue means poor performance.
  • 17. Upgrade • SfM-NeT • Upgrade to architecture #2 Learning a Multi-View Stereo Machine Abhishek Kar, Christian Häne, Jitendra Malik. UC Berkeley https://people.eecs.berkeley.edu/~akar/deepmvs.pdf In this work, we present Learnt Stereo Machines (LSM) - a system which is able to reconstruct object geometry as voxel occupancy grids or per-view depth maps from a small number of views, including just a single image. We design our system inspired by classical approaches while learning each component from data embedded in an end to end system. LSMs have built in projective geometry, enabling reasoning in metric 3D space and effectively exploiting the geometric structure of the Multi-view stereopsis (MVS) problem. Compared to classical approaches, which are designed to exploit a specific cue such as silhouettes or photo- consistency, our system learns to exploit the cues that are relevant to the particular instance while also using priors about shape to predict geometry for unseen regions. Compared to recent learning based reconstruction approaches, our system is able to better use camera pose information leading to significantly large improvements while adding more views. Finally, we show successful generalization to unseen object categories demonstrating that our network goes beyond semantic cues and strongly uses geometric information for unified single and multi-view 3D reconstruction Overview of a Learnt Stereo Machine (LSM). It takes as input one or more views and camera poses. The images are processed through a feature encoder which are then unprojected into the 3D world frame using a differentiable unprojection operation. LSMs can produce two kinds of outputs – voxel occupancy grids (Voxel LSM) decoded from Go or per-view depth maps (Depth LSM) decodedafter a projection operation. Qualitative results for per-view depth map prediction on ShapeNet. We show the depth maps predicted by Depth- LSM (visualized with shading from a shifted viewpoint) and the point cloud obtained by unprojecting them into world coordinates.
  • 18. Upgrade • SfM-NeT • Relu alternatives use ELU non-linearity without batchnorm or ReLU with it. A summary of recommendations:
  • 19. Upgrade • SfM-NeT • Normalization techniques Batch normalization, what was this? TL;DR To reduce covariate shift (explained by Alex Smola) [D] Weight normalization vs. layer normalization, has anyone done benchmarks? (self.MachineLearning) submitted 3 months ago by carlthome Batch normalization is the norm (pun intended) but for RNNs or small batch sizes layer normalization and weight normalization look like attractive alternatives. In the NIPS submission for weight normalization, they have the layer normalization paper listed as a reference (although never cited in the text), but it has since been removed. This got me thinking about pros/cons of the respective methods. Has anyone done benchmarks comparing weight normalization to layer normalization (particularly for ResNets or RNNs)? PS: Recurrent batch normalization is memory intensive and should be avoided IMO, but that too would be interesting to benchmark. Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models Sergey Ioffe, Google Inc., sioffe@google.com (Submitted on 10 Feb 2017 (v1), last revised 30 Mar 2017 (this version, v2)) https://arxiv.org/abs/1702.03275 | https://github.com/titu1994/BatchRenormalization Batch Normalization—What the hey? By Karl N. Jun 8, 2016 Batch ReNorm is useful especially with smaller batch sizes. Validation accuracy for models trained with either batchnorm or Batch Renorm, where normalization is performed for sets of 4 examples (but with the gradients aggregated over all 50×32 examples processed by the 50 workers). Batch Renorm allows the model to train faster and achieve a higher accuracy, although normalizing sets of 32 examples performs better. “Batch normalization is applied to all convolutional layer outputs.”
  • 20. Upgrade • SfM-NeT • pooling vs. strides vs. dilation Christian Perone, R&D Machine Learning Engineer and Software Engineer at HP https://www.slideshare.net/perone/deep-learning-convolutional-neural-networks Typical architectures use pooling after each convolution layer. In practice on left, 16 convolutions need to be computed with pooling and then max need to be computed for the 2x2 window. Whereas with 2x2 stride without pooling, one would compute only 4 convolutions giving the same downsampling. Springenberg et al. (2015) suggest that the performance decrease from this simplification may be negligible in practice: “We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks” Dilated convolution ( algorithm à trous, Statonary wavelet transform in practice) do not reduce dimensionality but rather aggregate multi-scale contextual information See for example Yu et al. (2015) IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: PP, Issue: 99 ) Generalizing Pooling Functions in CNNs: Mixed, Gated, and Tree Chen-Yu Lee ; Patrick Gallagher ; Zhuowen Tu https://doi.org/10.1109/TPAMI.2017.2703082 Each convolutional layer consist of a series of 3 × 3 convolutional layers alternating between stride 1 and stride 2.
  • 21. Upgrade • SfM-NeT • Enforcing sharp boundaries #1 Downsampling-Upsampling combo can smoothen or remove some thin structures and sharp boundaries, and several papers have been written to address this. “Unlike skip connections and previous encoder-decoder methods, we first learn a coarse feature map after the encoder stage in a feedforward pass, and then refine this feature map in a top-down strategy during the decoder stage utilizing features at successively lower layers. Therefore, the deconvolutional process is conducted stepwise, which is guided by Deeply-Supervision Net providing the integrated direct supervision.” https://arxiv.org/abs/1705.04456
  • 22. Upgrade • SfM-NeT • Enforcing sharp boundaries #2 Segmentation-Aware Convolutional Networks Using Local Attention Masks Adam W. Harley, Konstantinos G. Derpanis, Iasonas Kokkinos (Submitted on 15 Aug 2017) https://arxiv.org/abs/1708.04607 | http://cs.cmu.edu/~aharley/segaware Segmentation-aware convolution filters are invariant to backgrounds. We achieve this in three steps: (i) compute segmentation cues for each pixel (i.e., “embeddings”), (ii) create a foreground mask for each patch, and (iii) combine the masks with convolution, so that the filters only process the local foreground in each image patch. Segmentation-aware bilateral filtering. Given an input image (left), a CNN typically produces a smooth prediction map (middle top). Using learned per-pixel embeddings (middle bottom), we adaptively smooth the FC8 feature map with our segmentation-aware bilateral filter (right). General schematic for our segmentation-aware CNN. The first part is an embedding network, which is guided to compute embedding-like representations at multiple scales, and constructs a final embedding as a weighted sum of the intermediate embeddings. The loss on these layers operates on pairwise distances computed from the embeddings. These same distances are then used to construct local attention masks, that intercept the convolutions in a task-specific network. The final objective backpropagates through both networks, fine-tuning the embeddings for the task. Visualizations of optical flow produced by FlowNet and its segmentation-aware variant on the FlyingChairs test set: segmentation- awareness yields much sharper results than the baseline.
  • 24. SfM-NeT • Supervision SfM-Net inverts the image formation and extracts depth, camera and object motions that gave rise to the observed temporal differences, similar to previous SfM works [1, 6]. Such inverse problems are ill-posed as many solutions of depth, camera and object motion can give rise to the same observed frame-to-frame pixel values. A learning-based solution, as opposed to direct optimization, has the advantage of learning to handle such ambiguities through partial supervision of their weights or appropriate pre-training, or simply because the same coefficients (network weights) need to explain a large abundance of video data consistently. We detail the various supervision modes below and explore a subset of them in the experimental section. Kyong Hwan Jin ; Michael T. McCann ; Emmanuel Froustey ; Michael Unser | https://doi.org/10.1109/TIP.2017.2713099 IEEE Transactions on Image Processing ( Volume: 26, Issue: 9, Sept. 2017 ) ME5286 – Lecture 2 (Theory): Image Formation and Cameras by Saad J Bedros, University of Minnesota, http://www.me.umn.edu/courses/me5286/ Adversarial Inversion: Inverse Graphics with Adversarial Priors Hsiao-Yu Fish Tung, Adam Harley, William Seto, Katerina Fragkiadaki (Submitted on 31 May 2017) https://arxiv.org/abs/1705.11166 Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, Jitendra Malik https://arxiv.org/abs/1704.06254 Toward Geometric Deep SLAM MagicPoint and MagicWarp (from Magic Leap) Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich (Submitted on 24 Jul 2017) https://arxiv.org/abs/1707.07410 Geometric deep learning: going beyond Euclidean data Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, Pierre Vandergheynst last revised 3 May 2017 https://arxiv.org/abs/1611.08097
  • 25. SfM-NeT • Supervision • Self-supervision Given unconstrained video, without accompanying ground-truth structure or motion information, our model is trained to minimize the photometric error between the first frame and the second frame warped towards the first according to the predicted motion field, based on well-known brightness constancy assumptions (assuming Lambertian surfaces). We use differentiable image warping proposed in the spatial transformer work (Jaderberg et al., 2015) and compute color constancy loss in a fully differentiable manner. https://arxiv.org/abs/1608.05842 “In particular, we use a loss function that combines a data term that measures photometric constancy over time with a spatial term that models the expected variation of flow across the image. The photometric loss measures the difference between the first input image and the (inverse) warped subsequent image based on the predicted optical flow by the network. The smoothness loss measures the difference between spatially neighbouring flow predictions. Together, these two losses form a proxy for losses based on the groundtruth flow.” https://www.slideshare.net/yuhuang/optic-flow-estimatio n-with-deep-learning https://arxiv.org/abs/1506.02025 Light diffuseness metric Part 1: Theory L Xia, MSc, SC Pont, PhD, I Heynderickx, PhD Lighting Research & Technology Vol 49, Issue 4, 2017 http://doi.org/10.1177/1477153516631391 Thomas Y. Lee; David H. Brainard Journal of Vision January 2014, Vol.14, 24. doi: 10.1167/14.1.24 “Human perception not necessarily the same as a camera system”
  • 26. SfM-NeT • Supervision • Spatial smoothness priors When our network is self-supervised, we add robust spatial smoothness penalties on the optical flow field, the depth, and the inferred motion maps, by penalizing the L1 norm of the gradients across adjacent pixels, as usually done in previous works [Kong and Black (2015)]. For depth prediction, we penalize the norm of second order gradients in order to encourage not constant but rather smoothly changing depth values. http://www.chioka.in/differences-between-l 1-and-l2-as-loss-function-and-regularizati on/ By Michael Zibulevsky andMichael Elad IEEE SIGNAL PROCESSING MAGAZINE [76] MAY 2010 DOI: 10.1109/MSP.2010.936023 Depth map inpainting under a second-order smoothness prior Daniel Herrera C.†, Juho Kannala† , Lubor Ladický‡ , and Janne Heikkilä† †Center for Machine Vision Research University of Oulu, Finland ‡Visual Geometry Group University of Oxford, UK Levin (2004)'s approach uses a first-order prior, i.e. it favors constant depth. Whereas our second- order prior favors constant depth derivative. This is clearly seen in the results of Figure 1. Levin's approach correctly separates the surfaces but fiills the missing pixels (across the surface boundary) with a constant depth, while our method provides a smooth result that matches the ground truth shape. Difference between 1st order and 2nd order gradients for depth image (zoomed portion below) as computed with Matlab’s imgradient https://doi.org/10.1007/978-3-642- 38886-6_52 Cited by 9 articles
  • 27. SfM-NeT • Supervision • Forward-backward consistency constraints We incorporate forward-backward consistency constraints between inferred scene depth in different frames. Composing scene flow forward and backward across consecutive frames allows us to impose such forward-backward consistency cycles across more than one frame gaps, however, we have not yet seen empirical gain from doing so. In other words one could “robustify” the network by having more temporal samples which should improve inlier / outlier separation Science of Electrical Engineering (ICSEE), IEEE International Conference on the A Depth Restoration Occlusionless Temporal Dataset Daniel Rotman ; Guy Gilboa Electrical Engineering Department, Technion - Israel Institute of Technology. https://doi.org/10.1109/3DV.2016.26 “Utilizing multiple frames, we create a number of possibilities for an initial degraded depth map, which allows us to arrive at a more educated decision when refining depth images. Evaluating this method with our dataset shows significant benefits, particularly for overcoming real sensor-noise artifacts.” The dataset is freely downloadable at: http://visl.technion.ac.il/databases/drot2016/ 3D Vision (3DV), 2016 Fourth International Conference on, 16-18 Nov. 2016 Frame rate reduction of depth cameras by RGB-based depth prediction Daniel Rotman ; Omer Cohen ; Guy Gilboa Electrical Engineering Department, Technion - Israel Institute of Technology. https://doi.org/10.1109/ICSEE.2016.7806153 “Depth cameras are becoming widely used for facilitating fast and robust natural user interaction. But measuring depth can be high in power consumption mainly due to the active infrared illumination involved in the acquisition process, for both structured-light and time-of-flight technologies. It becomes a critical issue when the sensors are mounted on hand-held (mobile) devices, where power usage is of the essence. A method is proposed to reduce the depth acquisition frame rate, possibly by factors of 2 or 3, thus saving considerable power. The compensation is done by calculating reliable depth estimations using a coupled color (RGB) camera working at full frame rate. These predictions, which are shown to perform outstandingly, create for the end user or application the perception of a depth sensor working at full frame rate. Quality measures based on skeleton extraction and depth inaccuracy are used to calculate the deviation from the ground truth.”
  • 28. SfM-NeT • Supervision • Supervising depth If depth is available on parts of the input image, such as with video sequences captured by a Kinect sensor, we can use depth supervision in the form of robust depth regression. No in theory we can give targets automatically for SfM pipeline designed to operate: 1) without depth sensor, such as traditional smartphone - Target with Kinect or high- quality laser scanner 2) Google Tango smartphone with “low-quality depth sensing” - Target with high-quality laser scanner No need for massive Mechanic Turker workforce for boring time- consuming labeling 3D with Kinect Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on Jan Smisek ; Michal Jancosek ; Tomas Pajdla Date of Conference: 6-13 Nov. 2011 https://doi.org/10.1109/ICCVW.2011.6130380 “We demonstrate the functionality of Kinect calibration by integrating it into an SfM pipeline where 3D measurements from a moving Kinect are transformed into a common coordinate system by computing relative poses from matches in color camera.” SfM performs better when one has both RGB and depth data available
  • 29. SfM-NeT • Supervision • Supervising camera motion Supervising camera motion. If ground-truth camera pose trajectories are available, we can supervise our model by computing corresponding ground-truth camera rotation and translation from frame to frame, and constrain our camera motion predictions accordingly. IEEE Transactions on Image Processing ( Volume: 23, Issue: 12, Dec. 2014 ) Online Camera-Gyroscope Autocalibration for Cell Phones Chao Jia ; Brian L. Evans https://doi.org/10.1109/TIP.2014.2360120 Our contributions are: simultaneous online camera self-calibration and camera- gyroscope calibration based on an implicit extended Kalman filter and generalization of the multiple-view coplanarity constraint on camera rotation in a rolling shutter camera model for cell phones. Now on Google Tango platform you could use: 1) RGB Video 2) Depth Video And 3) Gyroscope data Sensor-based camera motion detection for unconstrained slam Original Assignee: Qualcomm Incorporated Publication date: Jul 12, 2016 US 9390344 B2 Techniques are presented for monocular visual simultaneous localization and mapping (SLAM) based on detecting a translational motion in the movement of the camera using at least one motion sensor, while the camera is performing panoramic SLAM, and initializing a three dimensional map for tracking of finite features. Motion sensors may include one or more sensors, including inertial (gyroscope, accelerometer), magnetic (compass), vision (camera) or any other sensors built into mobile devices. Virtual Reality, 2001. Proceedings. IEEE Fusion of vision and gyro tracking for robust augmented reality registration S. You ; U. Neumann https://doi.org/10.1109/VR.2001.913772 The framework includes a two-channel complementary motion filter that combines the low-frequency stability of vision sensors with the high-frequency tracking of gyroscope sensors, hence achieving stable static and dynamic six-degree-of-freedom pose tracking. Our implementation uses an extended Kalman filter (EKF). Poling and Lerman (2016): We present a deeply integrated method of exploiting low-cost gyroscopes to improve general purpose feature tracking. Most previous methods use gyroscopes to initialize and bound the search for features. In contrast, we use them to regularize the tracking energy function so that they can directly assist in the tracking of ambiguous and poor-quality features.
  • 30. SfM-NeT • Supervision • Supervising optical flow and object motion Supervising optical flow and object motion. Ground-truth optical flow, object masks, or object motions require expensive human annotation on real videos. However, these signals are available in recent synthetic datasets [20]. In such cases, our model could be trained to minimize, for example, an L1 regression loss between predicted and ground-truth low vectors https://arxiv.org/abs/1707.06316 In this paper, we propose to use DenseNet for optical flow prediction. Our contributions are two-fold. First, we extend current DenseNet to a fully convolutional network. Our model is totally unsupervised, and achieves performance close to supervised approaches. Second, we empirically show that replacing convolutions with dense blocks in the expanding part yields better performance https://arxiv.org/abs/1603.03911 (a) Semantic segmentation breaks the image into regions such as road, bike, person, sky, etc. (c) Existing optical flow algorithms do not have access to either the segmentations or the semantics of the classes. (d) Our semantic optical flow algorithm computes motion differently in different regions, depending on the semantic class label, resulting in more precise flow, particularly at object boundaries. (b) The flow also helps refine the segmentation of the foreground objects. https://arxiv.org/abs/1603.03911
  • 31. Upgrade • Supervision • Loss Function #1 http://doi.ieeecomputersociety.org/10.1109/TPAMI.2007.1171 Nearly all existing methods for stereo reconstruction assume that scene reflectance is Lambertian{*} and make use of brightness constancy (BC) as a matching invariant. We introduce a new invariant for stereo reconstruction called light transport constancy (LTC), which allows completely arbitrary scene reflectance (bidirectional reflectance distribution functions (BRDFs)). This invariant can be used to formulate a rank constraint on multiview stereo matching when the scene is observed by several lighting configurations in which only the lighting intensity varies. {*} Lambertian reflectance in practice means that the surface would look as bright independent from where you look at it. This is not true for specular and mirror reflection. “CS 354 Lighting” by Mark Kilgard Graphics Software Engineer at NVIDIA Direct Visual Odometry using Bit-Planes Hatem Alismail, Brett Browning, and Simon Lucey, The Robotics Institute, Carnegie Mellon University https://arxiv.org/abs/1604.00990 (2016) At the core of direct Visual SLAM is the reliance on a consistent photometric appearance across images, otherwise known as the brightness constancy assumption. Unfortunately, brightness constancy seldom holds in real world applications In this work, we overcome brightness constancy by incorporating feature descriptors into a direct visual odometry framework. This combination results in an efficient algorithm that combines the strength of both feature-based algorithms and direct methods. Namely, we achieve robustness to arbitrary photometric variations while operating in low-textured and poorly lit environments. An illustration of our Bit-Planes descriptor where each channel is composed of bits. Since the residual vector is binary, least squares minimization becomes equivalent to minimizing the Hamming distance. Principles of Remote Sensing; Soudarissanane (2016)]
  • 32. Upgrade • Supervision • Loss Function #2 Geometric Loss Functions for Camera Pose Regression with Deep Learning Alex Kendall, Roberto Cipolla (Submitted on 2 Apr 2017 (v1), last revised 23 May 2017 (this version, v2)) https://arxiv.org/abs/1704.00390 We show that our geometric approach can improve PoseNet’s efficacy across many different datasets – narrowing the deficit to traditional SIFT feature-based algorithms. For outdoor scenes ranging from 50, 000m2 to 2km2 we can achieve relocalisation accuracies of a few meters and a few degrees. In small rooms we are able to achieve accuracies of 0.2 − 0.4m. Comparison of different loss functions. We use an L1 distance for the residuals in each loss. Linear sum combines position and orientation losses with a constant scaling parameter β (Kendall and Cipolla 2015) and is defined in (2). Learn weighting is the loss function in (3) which learns to combine position and orientation using homoscedastic uncertainty. Reprojection error implicitly combines rotation and translation by using the reprojection error of the scene geometry as the loss (7). We find that homoscedastic uncertainty is able to learn an effective weighting between position and orientation quantities. The reprojection loss was not able to converge from random initialisation. However, when used to fine-tune a network pretrained with (3) it yields the best results.
  • 33. Upgrade • Supervision • Semi-supervised targets Recurrent Ladder Networks Alexander Ilin, Isabeau Prémont-Schwarz, Tele Hotloo Hao, Antti Rasmus, Rinu Boney, Harri Valpola (Submitted on 28 Jul 2017) https://arxiv.org/abs/1707.09219 “We propose a recurrent extension of the Ladder network, which is motivated by the inference required in hierarchical latent variable models. We demonstrate that the recurrent Ladder is able to handle a wide variety of complex learning tasks that benefit from iterative inference and temporal modeling. The architecture shows close-to-optimal results on temporal modeling of video data, competitive results on music modeling, and improved perceptual grouping based on higher order abstractions, such as stochastic textures and motion cues. We present results for fully supervised, semi- supervised, and unsupervised tasks. The results suggest that the proposed architecture and principles are powerful tools for learning a hierarchy of abstractions, handling temporal information, modeling relations and interactions between objects.” (a): Simple static hierarchical latent variable model. (b): Directions of message propagation. (c): Computational graph implementing message propagation in (b). (d): The structure of the Ladder network can be seen as a computational graph implementing message propagation in (c). The red circles mark the operations corresponding to the nodes of the graph in (b). (d): The structure of the recurrent Ladder (RLadder) network. https://arxiv.org/abs/1703.01780 https://github.com/CuriousAI/mean-teacher A sketch of a binary classification task with two labeled examples (large blue dots) and one unlabeled example, demonstrating how the choice of unlabeled target (black circle) affects the fitted function (gray curve). (a) A model with no regularization is free to fit any function. (b) A model trained with noisy labeled data (small dots) learns to give consistent predictions around labeled data points. (c) Consistency to noise around unlabeled examples provides additional smoothing. For the clarity of illustration, the teacher model (blue curve) is first fitted to the labeled examples, and then left unchanged during the training of the student model. Also for clarity, we will omit the small dots in figures d and e. (d) Noise on the teacher model reduces the bias of the targets without additional training. The expected direction of stochastic gradient descent is towards the mean (large blue circle) of individual noisy targets (small blue circles). (e) An ensemble of models gives an even better expected target. Both Temporal Ensembling and the Mean Teacher method use this approach
  • 34. Upgrade • Supervision • “proxy” supervised targets https://arxiv.org/abs/1702.02295 (Submitted on 8 Feb 2017 (v1), last revised 1 Jul 2017 (this version, v2)) We study the unsupervised learning of CNNs for optical flow estimation using proxy ground truth data. Supervised CNNs, due to their immense learning capacity, have shown superior performance on a range of computer vision problems including optical flow prediction. They however require the ground truth flow which is usually not accessible except on limited synthetic data. Without the guidance of ground truth optical flow, unsupervised CNNs often perform worse as they are naturally ill-conditioned. We therefore propose a novel framework in which proxy ground truth data generated from classical approaches is used to guide the CNN learning. The models are further refined in an unsupervised fashion using an image reconstruction loss. Our guided learning approach is competitive with or superior to state-of-the-art approaches on three standard benchmark datasets yet is completely unsupervised and can run in real time. “More broadly, we introduce a paradigm which can be integrated into future state-of-the-art motion estimation networks [Ranjan and Black (2016)] to improve performance. In future work, we plan to experiment with large-scale video corpora to learn non-rigid real world motion patterns rather than just learning limited motions found in synthetic datasets.”
  • 35. Upgrade • Supervision • Self-supervision https://arxiv.org/abs/1612.03777 We improve CNN-based optical flow estimation in real videos by adding the extra self-supervised task of future frame prediction, and training the network with a mixture of synthetic and real-world videos. This combination is made possible by putting a “multiplexer” at the entry of the network which mixes data from the two sources on a timely basis.
  • 36. Upgrade • Supervision • Dense object segmentation Semantic Video Segmentation by Gated Recurrent Flow Propagation David Nilsson, Cristian Sminchisescu (Submitted on 28 Dec 2016) https://arxiv.org/abs/1612.08871 Deep Semantic Segmentation for Automated Driving: Taxonomy, Roadmap and Challenges Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, Senthil Yogamani Submitted on 8 Jul 2017 (v1), last revised 3 Aug 2017 https://arxiv.org/abs/1707.02432 Overview of our Spatial Transformer Gated Recurrent Unit (STGRU), combining a Spatial Transformer Network for optical flow warping with a Gated Recurrent Unit to adaptively propagate and fuse semantic segmentation information over time. Modular End to End learning: We use the term modular end to end learning when there are auxiliary losses to ensure safety and interpret ability. For instance, segmentation loss can be added as an auxiliary loss for an end to end driving CNN [ Xu et al. (2016)]. Using this auxiliary loss, the CNN loosely learns to semantically segment, but it is also learns to have a better representation for the intermediate features. It was shown in that work that using auxiliary loss outperforms the vanilla end to end learning. The work also uses recurrent gated unit after the CNN to model temporal information.
  • 37. Upgrade • Supervision • generative motion and content https://arxiv.org/abs/1707.04993 https://github.com/sergeytulyakov/mocogan https://arxiv.org/abs/1706.08033 https://sites.google.com/a/umich.edu/rubenevillegas/iclr2017 We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training.
  • 38. Upgrade • Supervision • data Augmentation Depth degradation techniques [taken from Yang et al. (2012)]. (a) under-sampling, (b) undersampling with signal-dependant noise, (c) random missing, and (d) structural missing. [Rotman and Gilboa (2016)] From left to right, Kinect 1, 2 and R200 RealSense. The Kinect 1 sensor features invalid (black) depth values, and crooked edges. The Kinect 2 has false intermediate depth values (on the right side of the object). The RS shows depth artifacts with erroneous values (outlier white pixel on left bottom). [Rotman and Gilboa (2016)] Alismail et al. (2016) An Image Degradation Model for Depth- augmented Image Editing (2015) James W. Hennessey, Niloy J. Mitra, http://dx.doi.org/10.1111/cgf.12707
  • 39. Upgrade • Supervision • (multimodal) decomposition Intrinsic Depth. (a) Input video. (b),(c) Albedo and shading estimated by the intrinsic video method. (d) Surface contours modified to combine RGB, albedo and shading information. (e) Proxy depth by propagating sparse SfM depth using video segments from [9]. (f) Depth estimated by our method, which combines the previous two methods. (g) Depth from the original Depth Transfer method. (h) Depth from the fully-metric method. (i) Depth from the example-based single image method. (j) Ground truth depth. Note that integrating information from different intrinsic images improves the estimation of the depth structure. In (e) and (j), black pixels indicate that no valid depth values are provided. Kong and Black (2015) Decomposing Single Images for Layered Photo Retouching Carlo Innamorati, Tobias Ritschel. Tim Weyrich. Niloy J. Mitra University College London http://dx.doi.org/10.1111/cgf.13220 http://geometry.cs.ucl.ac.uk/projects/2017/layered-retouching/ Outline of proposed technique. (a) The Kinect depth of an object is combined with (b) three photos at different rotations of a polarizing filter. (c) Integration of surface normals obtained from Fresnel equations. Note the azimuthal ambiguity (observed as a flip in the shape) and distortion of the zenith angle (observed as flatness in the shape). (d) Integration of surface normals after correcting for azimuthal ambiguity removes the flip, and the final result is shown in (e) after correcting for zenith distortion and using physics-based integration. Kadambi et al. (2015) - http://web.media.mit.edu/~achoo/polar3D/ - http://news.mit.edu/2015/algorithms-boost-3-d-imaging-resolution-1000-times-1201 Polarization-sensing to improve depth
  • 40. Upgrade • Supervision • Multimodal Sensing • Rolling shutter motion IEEE Transactions on Visualization and Computer Graphics ( Volume: 22, Issue: 11, Nov. 2016 ) Towards Kilo-Hertz 6-DoF Visual Tracking Using an Egocentric Cluster of Rolling Shutter Cameras Akash Bapat ; Enrique Dunn ; Jan-Michael Frahm https://doi.org/10.1109/TVCG.2016.2593757 “The key idea is that a rolling shutter camera works by capturing the rows of an image in rapid succession, essentially acting as a high-frequency 1D image sensor. By integrating multiple rolling shutter cameras on the AR device, our tracker is able to perform 6-DOF markerless tracking in a static indoor environment with minimal latency.” https://arxiv.org/abs/1704.07163 “Rolling Shutter (RS) cameras have become popularized because of low-cost imaging capability. However, the RS cameras suffer from undesirable artifacts when the camera or the subject is moving, or illumination condition changes. For that reason, Monocular Visual Odometry (MVO) with RS cameras produces inaccurate ego-motion estimates. Previous works solve this RS distortion problem with motion prediction from images and/or inertial sensors. However, the MVO still has trouble in handling the RS distortion when the camera motion changes abruptly (e.g. vibration of mobile cameras causes extremely fast motion instantaneously).” Coded Rolling Shutter Photography: Flexible Space-Time Sampling http://www.cs.columbia.edu/CAVE/projects/crsp/ The interlaced readout can be used to compute optical flow between the two sub-images after vertical interpolation. The optical flow can be used for motion interpolation, skew compensation, and motion deblur. Please refer to the paper for details.
  • 41. Upgrade • Supervision • Transfer learning #1 Application of transfer learning in RGB-D object recognition Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on Abhishek Kumar ; S. Nithin Shrivatsav ; G. R. K. S. Subrahmanyam ; Deepak Mishra https://doi.org/10.1109/ICACCI.2016.7732108 “Firstly we trained a CNN network with 10 classes of different objects and then we transfer the parameters to RGB and depth CNN network. This enables the network to train faster and also achieve higher accuracy for a given number of epochs.” Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Xinhang Song, Luis Herranz, Shuqiang Jiang https://github.com/songxinhang/D-CNN https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14695/14310 HHA encoding for depth data (Gupta et al. 2014), is a three channel representation (horizontal disparity, height above ground, and angle with the direction of gravity) of depth data. Transferring deep representations within the same modality (e.g. Places- CNN fine tuned on SUN397) works well, since low-level patterns have similar distributions, and bottom layers can be reused while adjusting the more dataset-specific top layers. However, fine tuning is not that effective in inter- modal transfer, such as Places-CNN to depth in the HHA space, where low- level features require modality- specific filters. In this paper, we focus on the bottom layers, because they are more critical to represent depth data properly. By reducing the number of parameters of the network, and using weakly supervised learning over patches, the complexity of the model matches better the amount of data available. This depth representation is not only more discriminative than those fine tuned from Places-CNN but also when combined with RGB features the gain is higher, showing that both are complementary. Notice also, that we do not depend (for depth) on large datasets such as Places. https://doi.org/10.1016/j.patcog.2017.07.026 “The RGB-specific detection network is initialized with ImageNet [Deng et al. (2009)] RGB classification model. 3 To better leverage the depth information, the modality- correlated and depth-specific network are initialized from a supervision transfer model [ Gupta et al. (2016)] ”
  • 42. Upgrade • Supervision • Transfer learning #2 Learning Transferrable Knowledge for Semantic Segmentation With Deep Convolutional Neural Network Seunghoon Hong, Junhyuk Oh, Honglak Lee, Bohyung Han; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3204-3212 https://doi.org/10.1109/CVPR.2016.349 Overall architecture of the proposed algorithm. Given a feature extracted from the encoder, the attention model estimates adaptive spatial saliency of each category associated with input image. The outputs of attention model are subsequently fed into the decoder, which generates foreground segmentation mask of each focused region. During training, we fix the encoder by pre-trained weights, and leverage the segmentation annotations from source domain to train both the decoder and the attention model, and image-level class labels in both domains to train the attention model under classification objective. After training, semantic segmentation on the target domain is performed naturally by exploiting the decoder trained with source images and the attention model adapted to target domain The contributions of this paper are summarized below. ● We propose a new paradigm for weakly-supervised semantic segmentation, which exploits segmentation annotations from different categories to guide segmentations with weak annotations. To our knowledge, this is the first attempt to tackle the weakly-supervised semantic segmentation problem by transfer learning. ● We propose a novel encoder-decoder architecture with attention model, which is appropriate to transfer the segmentation knowledge across categories. ● The proposed algorithm achieves substantial performance improvement over existing weakly-supervised approaches by exploiting segmentation annotations in exclusive categories.
  • 43. Upgrade • Supervision • Transfer learning #3 Borrowing Treasures from the Wealthy: Deep Transfer Learning through Selective Joint Fine-tuning Weifeng Ge, Yizhou Yu (Submitted on 28 Feb 2017 (v1), last revised 6 Jun 2017 (this version, v2)) https://arxiv.org/abs/1702.08690 https://github.com/ZYYSzj/Selective-Joint-Fine-tuning In this paper, we introduce a source-target selective joint fine-tuning scheme for improving the performance of deep learning tasks with insufficient training data. In this scheme, a target learning task with insufficient training data is carried out simultaneously with another source learning task with abundant training data. However, the source learning task does not use all existing training data. Our core idea is to identify and use a subset of training images from the original source learning task whose low- level characteristics are similar to those from the target learning task, and jointly fine-tune shared convolutional layers for both tasks. Pipeline of the proposed selective joint fine-tuning. From left to right: (a) Datasets in the source domain and the target domain. (b) Select nearest neighbors of each target domain training sample in the source domain via a low-level feature space. (c) Deep convolutional neural network initialized with weights pre-trained on ImageNet or Places. (d) Jointly optimize the source and target cost functions in their own label spaces. Similar Image Search There is a unique step in our pipeline. For each image from the target domain, we search a certain number of images with similar low-level characteristics from the source domain. Only images returned from these searches are used as training images for the source learning task in selective joint fine-tuning. We elaborate this image search step below. In summary, this paper has the following contributions: ● We introduce a new deep transfer learning scheme, called selective joint fine-tuning, for improving the performance of deep learning tasks with insufficient training data. It is an important step forward in the context of the widely adopted strategy of fine-tuning a pre-trained deep neural network. ● We develop a novel pipeline for implementing this deep transfer learning scheme. Specifically, we compute descriptors from linear or nonlinear filter bank responses on training images from both tasks, and use such descriptors to search for a desired subset of training samples for the source learning task. ● Experiments demonstrate that our deep transfer learning scheme achieves state-of-the-art performance on multiple visual classification tasks with insufficient training data for deep learning.
  • 45. SfM-NeT • implementation Details coarse map 2 x fully connected layers Our depth-predicting structure and object-mask- predicting motion conv/deconv networks share similar architectures but use independent weights. Each consist of a series of 3×3 convolutional layers alternating between stride 1 and stride 2 followed by deconvolutional operations consisting of a depth-to-space upsampling, concatentation with corresponding feature maps from the convolutional portion, and a 3×3 convolutional layer. Batch normalization is applied to all convolutional layer outputs. The structure network takes a single frame as input, while the motion network takes a pair of frames. We predict depth values using a 1×1 convolutional layer on top of the image- sized feature map. We use RELU activations because depths are positive and a bias of 1 to prevent small depth values. The maximum predicted depth value is further clipped at 100 to prevent large gradients. We predict object masks from the image-sized feature map of the motion network using a 1×1 convolutional layer with sigmoid activations. To encourage sharp masks we multiply the logits of the masks by a parameter that is a function of the number of step for which the network has been trained. The pivot variables are predicted as heat maps using a softmax function over all the locations in the image followed by a weighted average of the pixel locations. keras.layers.convolutional.Conv2D (filters, kernel_size, strides) filters = 32 strides = (1,1) kernel_size = (3,3) filters = 64 strides = (2,2) kernel_size = (3,3) keras.layers.convolutional.UpSamp ling2D(size=(2, 2))
  • 47. SfM-NeT • Experimental Results #1 Qualitative comparison of the estimated depth using our unsupervised model on sequences versus using stereo pairs in the KITTI 2012 benchmark. When using stereo pairs the camera pose between the pair is constant and hence the model is equivalent to the approach of Garg et al.’s (2016). For sequences, our model needs to additionally predict camera rotation and translation between the two frames. The first six rows show successful predictions even without camera pose information and the last two illustrate failure cases. The failure cases show that when there is no translation between the two frames depth estimation fails whereas when using stereo pairs there is always a constant offset between the frames.
  • 48. SfM-NeT • Experimental Results #2 Ground truth segmentation and flow compared to predicted motion masks and flow from SfM-Net in KITTI 2015. The model was trained in a fully unsupervised manner. The top six rows show successful prediction and the last two show typical failure cases.
  • 49. SfM-NeT • Experimental Results #3 Motion segments computed from SfM-Net in MoSeg [Brox and Malik (2010] . The model was trained in a fully unsupervised manner. “We report camera rotation and translation error in Table 2 for each of the Freiburg sequences compared to the error in the benchmark’s baseline trajectories. Our model was trained from scratch for each sequence and used the focal length value provided with the dataset. We observe that our results better estimate the frame-to-frame translation and are comparable for rotation.”
  • 51. SfM-NeT • Conclusion Current geometric SLAM methods obtain excellent egomotion and rigid 3D reconstruction results, but often come at a price of extensive engineering, low tolerance to moving objects — which are treated as noise during reconstruction — and sensitivity to camera calibration. Furthermore, matching and reconstruction are difficult in low textured regions. Incorporating learning into depth reconstruction, camera motion prediction and object segmentation, while still preserving the constraints of image formation,is a promising way to robustify SLAM and visual odometry even further. However, the exact training scenario required to solve this more difficult inference problem remains an open question. Exploiting long history and far in time forward-backward constraints with visibility reasoning is an important future direction. Further, exploiting a small amount of annotated videos for object segmentation, depth, and camera motion, and combining those with an abundance of self-supervised videos, could help initialize the network weights in the right regime and facilitate learning. Many other curriculum learning regimes, including those that incorporate synthetic datasets, can also be considered t geom
  • 53. Pipeline • Future paths • “GRID” Architecture • Review https://arxiv.org/abs/1702.08670 Our technical results are corroborated by an extensive set of evaluations, presented in this paper as well as independent empirical observations reported by other groups. We also perform experiments showing the practical implications of our framework for choosing the best fully-connected design for a given problem. Due to fast pace of deep learning as a field, very hard to find good methodological review. Thus, be cautious when reading these as e.g. Feb 2017 is already quite old when reading the review in August 2017. https://arxiv.org/abs/1512.07108 Systematic evaluation of CNN advances on the ImageNet Dmytro Mishkin, Nikolay Sergievskiy, Jiri Matas 16 May 2017 https://doi.org/10.1016/j.cviu.2017.05.007 https://arxiv.org/abs/1606.02228 https://www.researchgate.net/publication/316970253_Systematic_Evaluation_of_Convo lution_Neural_Network_Advances_on_the_ImageNet The commonly used input to CNN is raw RGB pixels and the commonly adopted recommendation is not to use any pre-processing. There has not been much research on the optimal colorspace or pre-processing techniques for CNN. Rachmadi and Purnama (2015) explored different colorspaces for vehicle color identification, Dong et al. (2014) compared YCrCb and RGB channels for image superresolution, Graham (2015) extractedlocal average color from retina images in winning solution to the Kaggle Diabetic Retinopathy Detection competition. Petteri: The authors could have tested CIELab as well which might have been interesting colorspace especially in photo enhancement applications [e.g. Yan et al. (2016)]
  • 54. Pipeline • Future paths • ConvNet Architecture • DenseNet #1 To ensure maximum information flow between layers in the network, we connect all layers (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Implicit Deep Supervision. One explanation for the improved accuracy of dense convolutional networks may be that individual layers receive additional supervision from the loss function through the shorter connections. One can interpret DenseNets to perform a kind of “deep supervision”. The benefits of deep supervision have previously been shown in deeply-supervised nets (DSN; Lee et al. 2014), which have classifiers attached to every hidden layer, enforcing the intermediate layers to learn discriminative features. https://arxiv.org/abs/1611.09326
  • 55. Pipeline • Future paths • ConvNet Architecture • DenseNet #2 https://arxiv.org/abs/1608.06993 Image classification Semantic Segmentation Optical Flow https://arxiv.org/abs/1611.09326 https://arxiv.org/abs/1707.06316
  • 56. Pipeline • Future paths • ConvNet Architecture • DenseNet #3 https://arxiv.org/abs/1702.02295 https://arxiv.org/abs/1707.06316 Classical expanding uses series of convolutions, deconvolutions, and skip connections to recover the spatial resolution in order to get the perpixel prediction results. Due to the good properties of DenseNet, we propose to replace the convolutions with dense blocks during expanding as well. However, if we follow the same dense connectivity pattern, the number of feature maps after each dense block will keep increasing. Considering that the resolution of the feature maps also increases during expanding, the computational cost will be intractable for current GPUs. Thus, for a dense block in the expanding part, we do not concatenate the input to its final output. For example, if the input has k0 channels, the output of an L layer dense block will have Lk feature maps. k is the growth rate of a DenseNet, defining the number of feature maps each layer produces. Note that dense blocks in the contracting part will output k0 + Lk feature maps. For symmetry, we also introduce four dense blocks in the expanding part, each of which has four layers. The bottom layer feature maps at the same resolution are concatenated through skip connections. Between the dense blocks, there are transition up layers composed of two 3×3 deconvolutions with a stride of 2. One is for upsampling the estimated optical flow, and the other is for upsampling the feature maps. “Our model is totally unsupervised., thus we can experiment with large-scale video corpora in future work, to learn non- rigid real world motion patterns. Through comparison of popular CNN architectures, we found that it is important to design novel operators or networks for optical flow estimation instead of relying on existing architectures for image classification.” In this work, we choose FlowFields (Bailer et al. 2015) as our classical optical flow estimator. To our knowledge, it is one of the most accurate flow estimators among the published work. We hope that by using FlowFields to generate proxy ground truth, we can learn to estimate motion between image pairs as effectively as using the true ground truth. If a classical approach fails to detect certain motion patterns, a network trained on the proxy ground truth is also likely to miss these patterns. This leads us to ask if there is other unsupervised guidance that can improve the network training? The unsupervised approach (Yu et al. 2016) treats optical flow estimation as an image reconstruction problem based on the intuition that if the estimated flow and the next frame can be used to reconstruct the current frame then the network has learned useful representations of the underlying motions. Note that we could add additional unsupervised guides like a gradient constancy assumption or an edge-aware weighted smoothness loss (Godard et al 2016) to further fine tune our models.
  • 57. Pipeline • Future paths • ConvNet Architecture • DenseNet #4 degradation. However, during inference the lower layers do not know about high layer features, although they contain contextual high semantics that benefit low layers to adaptively extract informative features for later layers. In this paper, we study the influence of backward skip connections which are in the opposite direction to forward shortcuts, i.e. paths from high layers to low layers. To achieve this -- which indeed runs counter to the nature of feed-forward networks -- we propose a new fully convolutional model that consists of a pair of networks. A `Slave' network is dedicated to provide the backward connections from its top layers to the `Master' network's bottom layers. ‘Forward skip’ is an illustration of one forward skip connection in our own baseline FCN-8s-F1. ‘Backward Skip’ is our proposed design where we align two networks, a Master and Slave. The Slave network provides the backward skip connections from its higher layers to the Master network lower layers. ‘Block in MSNet’ shows an illustration of our final model’s skip connections, where additional forward skip connections have been added to the Master. A detailed overview of our MSNet-FB1 model. The model consists of two networks; Master and Slave. Slave is a basic FCN-8s. The Master is a FCN-8s-F1 (as our baseline) and contains the backward skip connections taken from the Slave network. The forward propagation starts in the Slave network, and then the generated feature maps are fed back to lower layers in the Master network serving as backward skip connections within the Master. Notice that the backward skip connection (blue arrows) are exactly the inverse of the forward skip connections (purple arrows). An illustration of several convolutional blocks of our MSNet-FB2 with dense skip connections. Notice that the Master network is FCN-8s-F2. We set P = 3 and N = 3. Here the dense backward skip connections are in inverse directions to the dense forward skip connections
  • 58. Pipeline • Future paths • ConvNet Architecture • DenseNet #5 The use of backward connection actually is closer to the cliché of deep learning being analogous to human brain Deep neural network explains early visual and inferior temporal (IT) representations of object images. Each representation in model and brain was characterized by the dissimilarity matrix of the response patterns elicited by a set of real-world photos of objects. (a) Representations become monotonically more similar to those of human inferior temporal (IT) cortex as we ascend the layers of the Krizhevsky et al. (2012) neural network. When the final representational stages are linearly remixed to emphasize the same semantic dimensions as IT using linear category discriminants (second bar from the right), and when each layer and each discriminant are assigned a weight to model the prevalence of different computational features in IT (cross-validated to avoid overfitting to the image set; rightmost bar), the noise ceiling ( gray shaded region) is reached, indicating that the model fully explains the data. (b) Lower layers of the deep neural network resemble the representations in the foveal confluence of early visual areas (V1–V3). http://dx.doi.org/10.1146/annurev-vision-082114-035447 https://arxiv.org/abs/1604.03640 Center for Brains, Minds and Machines, McGovern Institute, MIT “Feed-forward zombie” The Sciences of Consciousness: Progress and Problems: Center for Brains, Minds and Machines (CBMM), Christof Koch - Allen Institute for Brain Science, https://youtu.be/4gT-1S3FO4s?t=1h9m34s “Not pleasing the people worshipping alter of computalism”
  • 59. Pipeline • Future paths • Uncertainty • with DenseNet https://arxiv.org/abs/1506.02142 https://arxiv.org/abs/1705.07832 https://arxiv.org/abs/1703.04977 There are two major types of uncertainty one can model. Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model -- uncertainty which can be explained away given enough data. Traditionally it has been difficult to model epistemic uncertainty in computer vision, but with new Bayesian deep learning tools this is now possible. We study the benefits of modeling epistemic vs. aleatoric uncertainty in Bayesian deep learning models for vision tasks. Our model based on DenseNet can process a 640 × 480 resolution image in 150ms on a NVIDIA Titan X GPU. The aleatoric uncertainty models add negligible compute. However, epistemic models require expensive Monte Carlo dropout sampling. For models such as ResNet, this is possible to achieve economically because only the last few layers contain dropout. Other models, like DenseNet, require the entire architecture to be sampled. This is difficult to parallelize due to GPU memory constraints, and often results in a 50× slowdown for 50 Monte Carlo samples. https://arxiv.org/abs/1703.02910
  • 60. Pipeline • Future paths • Uncertainty • With model compression Bayesian Compression for Deep Learning Christos Louizos, Karen Ullrich, Max Welling (Submitted on 24 May 2017 (v1), last revised 10 Aug 2017 (this version, v3)) https://arxiv.org/abs/1705.08665 From a Bayesian perspective network pruning and reducing bit precision for the weights is aligned with achieving high accuracy, because Bayesian methods search for the optimal model structure (which leads to pruning with sparsity inducing priors), and reward uncertain posteriors over parameters through the bits back argument [Hinton and Van Camp, 1993] (which leads to removing insignificant bits). This relation is made explicit in the MDL principle [Grünwald, 2007] which is known to be related to Bayesian inference. By employing sparsity inducing priors for hidden units (and not individual weights) we can prune neurons including all their ingoing and outgoing weights. This avoids more complicated and inefficient coding schemes needed for pruning or vector quantizing individual weights. As a additional Bayesian bonus we can use the posterior uncertainties to assess which bits are significant and remove the ones which fluctuate too much under posterior sampling. From this we derive the optimal fixed point precision per layer, which is still practical on chip. For the actual compression task we compare our method to current work in three different scenarios: (i) compression achieved only by pruning, here, for non-group methods we use the CSC format to store parameters; (ii) compression based on the former but with reduced bit precision per layer (only for the weights); and (iii) the maximum compression rate as proposed by Han et al. [2016]. We believe these to be relevant scenarios because (i) can be applied with already existing frameworks such as Tensorflow, (ii) is a practical scheme given upcoming GPUs and frameworks will be designed to work with low and mixed precision arithmetics [Lin and Talathi, 2016, Gysel, 2016]
  • 61. Pipeline • Future paths • Uncertainty • Geometric problems http://mi.eng.cam.ac.uk/projects/relocalisation/
  • 62. Future • Geometric Architectures
  • 63. Pipeline • Future paths • Geometric Deep Learning #1 Bronstein et al. (July 2017): “Geometric deep learning ( http://geometricdeeplearning.com/) is an umbrella term for e merging techniques attempting to generalize (structured) deep neural models to non- Euclidean domains, such as graphs and manifolds. The purpose of this article is to overview different examples of geometric deep-learning problems and present available solutions, key difficulties, applications, and future research directions in this nascent field” SCNN (2013) GCNN/ChebNet (2016) GCN (2016) GNN (2009) Geodesic CNN (2015) Anisotropic CNN (2016) MoNet (2016) Localized SCNN (2015)
  • 64. Pipeline • Future paths • Geometric Deep Learning #2 Bronstein et al. (July 2017): “The non-Euclidean nature of data implies that there are no such familiar properties as global parameterization, common system of coordinates, vector space structure, or shift-invariance. Consequently, basic operations like convolution that are taken for granted in the Euclidean case are even not well defined on non-Euclidean domains.” “First attempts to generalize neural networks to graphs we are aware of are due to Mori et al. (2005) who proposed a scheme combining recurrent neural networks and random walk models. This approach went almost unnoticed, re-emerging in a modern form in Suhkbaatar et al. (2016) and Li et al. (2015) due to the renewed recent interest in deep learning.” “In a parallel effort in the computer vision and graphics community, Masci et al. (2015) showed the first CNN model on meshed surfaces, resorting to a spatial definition of the convolution operation based on local intrinsic patches. Among other applications, such models were shown to achieve state-of-the-art performance in finding correspondence between deformable 3D shapes. Followup works proposed different construction of intrinsic patches on point clouds Boscaini et al. (2016)a,b and general graphs Monti et al. (2016).” In calculus, the notion of derivative describes how the value of a function changes with an infinitesimal change of its argument. One of the big differences distinguishing classical calculus from differential geometry is a lack of vector space structure on the manifold, prohibiting us from naïvely using expressions like f(x+dx). The conceptual leap that is required to generalize such notions to manifolds is the need to work locally in the tangent space. Physically, a tangent vector field can be thought of as a flow of material on a manifold. The divergence measures the net flow of a field at a point, allowing to distinguish between field ‘sources’ and ‘sinks’. Finally, the Laplacian (or Laplace-Beltrami operator in differential geometric jargon) “A centerpiece of classical Euclidean signal processing is the property of the Fourier transform diagonalizing the convolution operator, colloquially referred to as the Convolution Theorem. This property allows to express the convolution f⋆g of two functions in the spectral domain as the element-wise product of their Fourier transforms. Unfortunately, in the non-Euclidean case we cannot even define the operation x-x’ on the manifold or graph, so the notion of convolution does not directly extend to this case.
  • 65. Pipeline • Future paths • Geometric Deep Learning #3 Bronstein et al. (July 2017): “We expect the following years to bring exciting new approaches and results, and conclude our review with a few observations of current key difficulties and potential directions of future research.” Generalization: Generalizing deep learning models to geometric data requires not only finding non-Euclidean counterparts of basic building blocks (such as convolutional and pooling layers), but also generalization across different domains. Generalization capability is a key requirement in many applications, including computer graphics, where a model is learned on a training set of non-Euclidean domains (3D shapes) and then applied to previously unseen ones. Time-varying domains: An interesting extension of geometric deep learning problems discussed in this review is coping with signals defined over a dynamically changing structure. In this case, we cannot assume a fixed domain and must track how these changes affect signals. This could prove useful to tackle applications such as abnormal activity detection in social or financial networks. In the domain of computer graphics and vision, potential applications deal with dynamic shapes (e.g. 3D video captured by a range sensor). Computation: The final consideration is a computational one. All existing deep learning software frameworks are primarily optimized for Euclidean data. One of the main reasons for the computational efficiency of deep learning architectures (and one of the factors that contributed to their renaissance) is the assumption of regularly structured data on 1D or 2D grid, allowing to take advantage of modern GPU hardware. Geometric data, on the other hand, in most cases do not have a grid structure, requiring different ways to achieve efficient computations. It seems that computational paradigms developed for large-scale graph processing are more adequate frameworks for such applications.
  • 66. Pipeline • Future paths • sparsity primer from neuroscience Questioning the role of sparse coding in the brain Anton Spanne, Henrik Jörntell July 2015 - http://doi.org/10.1016/j.tins.2015.05.005 Optimal Degrees of Synaptic Connectivity Litwin-Kumar A, Harris KD, Axel R, Sompolinsky H, Abbott LF Feb 2017 - http://doi.org/10.1016/j.tins.2015.05.005 3D Visual Response Properties of MSTd Emerge from an Efficient, Sparse Population Code Michael Beyeler, Nikil Dutt and Jeffrey L. Krichmar Journal of Neuroscience 10 August 2016 http://doi.org/10.1523/JNEUROSCI.0396-16.2016 ● Sparse coding is questioned on both theoretical and experimental grounds. ● Generalization is important to current brain models but is weak under sparse coding. ● The beneficial properties ascribed to sparse coding can be achieved by alternative means. Sparse code is a compromise between local code and dense code. (A) Comparison of coding schemes that differ in their ratio of active neurons: in other words, in their sparseness. The activities within the population during two hypothetical contexts (context A and context B) are shown as examples of how different contexts are represented within the population. Note that by our definition only a single context would be active at any time because a context represents the global brain state (i.e., all the neurons). In local code, a context is represented by the activity of a single neuron, or a small subset of neurons, and different contexts are represented by different neurons. Notably, the activities of the neurons are not independent because if a neuron is responding to context A, it will not respond to any other context. In dense code, all neurons are active and their combined activity is used to encode each context. Any state in between the two extreme cases of local and dense code can in principle be labeled sparse code. The reduction of average activation leads to a reduction in the overlap or interference between the activation during different contexts. (B) In the special case of binary activation functions, maximal representational capacity is obtained if 50% of the neurons are active during each context. For this reason an average activation of 50% is usually considered dense code in the binary case. ● Sparse synaptic wiring can optimize a neural representation for associative learning ● Maximizing dimension predicts the degree of connectivity for cerebellum-like circuits ● Supervised plasticity of input connections is needed to exploit dense wiring ● Performance of a Hebbian readout neuron is formally related to dimension (A) Schematic of network with a classifier that computes a weighted sum of mixed-layer activity to determine the valence of an input pattern. Example flow fields generated with the motion field model [Longuet-Higgins and Prazdny (1980); Raudies (2013)]. Generated from a pinhole camera with image plane.
  • 67. Pipeline • Future paths• Non-euclidean computability & Geometric Operators #1 https://arxiv.org/abs/1707.06316 “Our model is totally unsupervised., thus we can experiment with large-scale video corpora in future work, to learn non-rigid real world motion patterns. Through comparison of popular CNN architectures, we found that it is important to design novel operators or networks for optical flow estimation instead of relying on existing architectures for image classification” pixels remain. The labels would be transformed in the same way but are not shown here. Away from GRID Szegedy et al. 2014 (Google, cited by 4,027 articles): “Today’s computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off. The gap is widened even further by the use of steadily improving, highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure.” https://arxiv.org/abs/1704.04760 Google not quite there yet for hardware-accelerated sparse matrix deep learning. “This Matrix unit of a custom ASIC—called a Tensor Processing Unit (TPU) if is designed for dense matrices. Sparse architectural support was omitted for time-to-deploy reasons. Sparsity will have high priority in future designs.”
  • 68. Pipeline • Future paths• Non-euclidean computability & Geometric Operators #2 Speeding up Convolutional Neural Networks By Exploiting the Sparsity of Rectifier Units Shaohuai Shi, Xiaowen Chu (Submitted on 25 Apr 2017 (v1), last revised 15 May 2017 (this version, v2)) https://arxiv.org/abs/1704.07724 Rectifier neuron units (ReLUs) have been widely used in deep convolutional networks. An ReLU converts negative values to zeros, and does not change positive values, which leads to a high sparsity of neurons. In this work, we first examine the sparsity of the outputs of ReLUs in some popular deep convolutional architectures. And then we use the sparsity property of ReLUs to accelerate the calculation of convolution by skipping calculations of zero-valued neurons. The proposed sparse convolution algorithm achieves some speedup improvements on CPUs compared to the traditional matrix-matrix multiplication algorithm for convolution when the sparsity is not less than 0.9. We measure the speed of compared algorithms on the Intel CPU: E5-2630v4 at the core frequency of 2.20GHz with 128 GB memory. We propose the inverse sparse convolution (ISC) algorithm by three steps: First, we skip all the zero elements of the input data, and store the non-zero values in a vector with their column and row information. Second, the kernel matrix is stored as column-major matrix such that for each non-zero element (Ic,i,j ) of inputs, a continuous memory that stores kernels can be fetched and multiplied by Ic,i,j at one time with AVX or SSE techniques. Third, transpose temporary results from the second step to generate outputs. The Power of Sparsity in Convolutional Neural Networks Soravit Changpinyo, Mark Sandler, Andrey Zhmoginov (Submitted on 21 Feb 2017) https://arxiv.org/abs/1702.06257 We deactivate connections between filters in convolutional layers in a way that allows us to harvest savings both in run-time and memory for many network architectures. More specifically, we generalize 2D convolution to use a channel-wise sparse connection structure and show that this leads to significantly better results than the baseline approach for large networks including VGG and Inception V3. “For example, when applied to Inception V3 (Fig. 4) achieves AlexNet-level accuracy with fewer than 400K parameters and VGG-level one (Fig. 5) with roughly 3.5M parameters. In addition, we show that our method leads to an interesting novel incremental training technique, where we take advantage of sparse (and smaller) models to build a dense network. One interesting open direction is to enable incremental training not to simply densify the network over time, but also increase the number of chaannels. This would allow us to grow the network without having to fix its original shape in place.” Efficient Sparse-Winograd Con -volutional Neural Networks Xingyu Liu, Song Han, Huizi Mao, William J. Dally 17 Feb 2017 (modified: 19 Feb 2017) ICLR 2017 workshop submission https://openreview.net/forum?id=r1rqJyHKg Convolutional Neural Networks (CNNs) are compute intensive which limits their application on mobile devices. Their energy is dominated by the number of multiplies needed to perform the convolutions. Winograd’s minimal filtering algorithm (Lavin and Gray (2015)) and network pruning (Han et al. (2015)) reduce the operation count. Unfortunately, these two methods cannot be combined—because applying the Winograd transform fills in the sparsity in both the weights and the activations. We propose two modifications to Winograd-based CNNs to enable these methods to exploit sparsity. First, we prune the weights in the ”Winograd domain” (after the transform) to exploit static weight sparsity. Second, we move the ReLU operation into the ”Winograd domain” to improve the sparsity of the transformed activations. On CIFAR-10, our method reduces the number of multiplications in the VGG- nagadomi model by 10.2x with no loss of accuracy.
  • 69. Pipeline • Future paths• Non-euclidean computability & Geometric Operators #3 SPARCNet: A Hardware Accelerator for Efficient Deployment of Sparse Convolutional Networks Adam Page, Ali Jafari, Colin Shea, Tinoosh Mohsenin ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems. Volume 13 Issue 3, May 2017 - Article No. 31. https://doi.org/10.1145/3005448 The SPARCNet accelerator with different numbers of processing engines is implemented on a low-power Artix-7 FPGA platform. The FPGA-based accelerator is developed using a combination of pure HDL written in Verilog and IP cores developed using Xilinx’s Vivado HLS. Additionally, the same networks are optimally implemented on a number of embedded commercial-off-the-shelf platforms including NVIDIAs CPU+GPU SoCs TK1 and TX1 and Intel Edison. Compared to NVIDIAs TK1 and TX1, the FPGA-based accelerator obtains 11.8 × and 7.5 × improvement in energy efficiency In addition to improving efficiency, the accelerator has built-in support for sparsification techniques and ability to perform in-place rectified linear unit (ReLU) activation function, max-pooling, and batch normalization. Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC E Nurvitadhi, D Sheffield, J Sim… 2017 Field-Programmable Technology (FPT), 2016 International Conference on Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? E Nurvitadhi, G Venkatesh, J Sim, D Marr, R Huang FPGA’17 Hardware accelerator for analytics of sparse data E Nurvitadhi, A Mishra, Y Wang, G Venkatesh… - Proceedings of the 2016 Sparse Matrix Multiplication on CAM Based Accelerator L Yavits, R Ginosar - arXiv preprint arXiv:1705.09937, 2017 Cambricon-X: An accelerator for sparse neural networks S Zhang, Z Du, L Zhang, H Lan, S Liu… - … (MICRO), 2016 49th …, 2016 Accelerator for Sparse Machine Learning L Yavits, R Ginosar - IEEE Computer Architecture Letters, 2017 A Scalable FPGA-Based Accelerator for High-Throughput MCMC Algorithms M Hosseini, R Islam, A Kulkarni… - … (FCCM), 2017 IEEE SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks A Parashar, M Rhu, A Mukkara, A Puglielli… - Proceedings of the 44th …, 2017 NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps A Aimar, H Mostafa, E Calabrese… - arXiv preprint arXiv: …, 2017
  • 70. Geometric DNNs • implementation options in practice #1: GVNN ankurhanda/gvnn Insights gvnn: Geometric Vision with Neural Networks gvnn is primarily intended for self-supervised learning using low-level vision. It is inspired by the Spatial Transformer Networks (STN) paper that appeared in NIPS in 2015 and its open source code made available by Maxime Oquab. The code is self contained i.e. the original implementation of STN by Maxime is also within the repository. STs were mainly limited to applying only 2D transformations to the input. We added a new set of transformations often needed for manipulating data in 3D geometric computer vision. These include the 3D counterparts of what were used in original STN together with a lot more new transformations and different M-estimators. SO3 Layer Rotations are represented as so(3) 3-vector. This vector is turned into rotation matrix via the exponential map. For a more detailed view of the so(3) representation and exponential map read this tutorial from Ethan Eade: Lie-Algebra Tutorial. This is what the exponential map is Exponential Map. Also, Tom Drummond's notes on Lie-Algebra are a great source to learn about exponential maps Tom Drummond's notes. The reason for choosing so3 representation is mainly due to its appealing properties when linearising rotations (via Taylor series expansion) for iterative image alignment via classic linearise-solve-update rule. The figure below shows how linearisation for SO3 is fitting a local plane on the sphere Optical Flow Lens Distortion Projection Layer