Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features

MANHATTAN SCENE UNDERSTANDING
USING MONOCULAR, STEREO, AND 3D
FEATURES
Alex Flint, David Murray, and Ian Reid
University of Oxford

SEMANTICS IN GEOMETRIC MODELS

1. Motivation
2. Prior work
3. The indoor Manhattan representation
4. Probabilistic model and inference
5. Results and conclusion

Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”

MOTIVATION
Single View Computer Vision Multiple View Geometry
Sky
Tree
Water Rock classroom (2.09) classroom (1.99) classroom (1.98) fastfood (−0.18) garage (−0.69) bathroom (−0.99) kitchen (−1.27)
Human
classroom

Sand

Beach
restaurant (1.57) livingroom (1.55) pantry (1.53) fastfood (−0.12) waitingroom (−0.59) restaurant (−0.89) kitchen (−1.16)
dining room

bathroom (2.45) bathroom (2.14) bedroom (2.01) laundromat (0.36) operating room(−0.23) dental office (−0.65) bookstore (−1.04)
locker room hospitalroom

locker room (2.52) corridor (2.27) locker room (2.22) office (−0.04) prisoncell (−0.52) kindergarden (−0.86) bathroom (−1.16)

mall (1.69) videostore (1.44) videostore (1.39) tv studio (−0.14) bathroom (−0.51) concert hall (−0.78) concert hall (−1.01) i
tore


MOTIVATION
The multiple view setting is increasingly relevant
• Powerful mobile devices with cameras
• Bandwidth no longer constrains video on the internet
• Depth sensing cameras becoming increasingly prevalent

Structure-from-motion does not
immediately solve:
• Scene categorisation
• Object recognition
• Many scene understanding tasks


MOTIVATION
We seek a representation that:
• leads naturally to semantic-level scene understanding tasks;
• integrates both photometric and geometric data;
• is suitable for both monocular and multiple-view scenarios.

The indoor Manhattan representation (Lee et al, 2009)

• Parallel ﬂoor and ceiling planes
• Walls terminate at vertical boundaries
• A sub-class of Manhattan scenes

Lee, Kanade, Hebert, “Geometric reasoning for single image structure recovery”, CVPR 2009


Where would a person stand?
Where would doors be found?
What is the direction of gravity?
Is this an ofﬁce or house?
How wide (in absolute units)?


PRIOR WORK
Target image Depth map Depth normal map Mesh

• Kosecka and Zhang, “Video Compass”, ECCV 2002

• Furukawa, Curless, Seitz, and Szeliski, “Manhattan World Stereo”,
CVPR 2009
2.2 Context in Robotics 11

• Posner, Schroeter, and Newman, “Online generation of scene
descriptions in urban environments”, RAS 2008

2.3 Context in Computer Vision 13

• Vasudevan, Gachter, Nguyen, Siegwart, “Cognitive maps for mobile Figure 2.3: Semantic labels output by the system of Posner et al [PSN08].

2.2.2 Map–centric approaches

robots -- an object-based approach”, RAS 2007 An alternative approach to deriving context in robotics applications is to integrate new mea-
surements into a map, and then reason about semantics within the map representation. In
general this approach enables stronger integration of measurements taken over several time
steps, at the cost of relying on the ability to correctly build a map.
Buschka and Saffiotti [BS02] have taken a map–centric approach to the problem of identi-

• Bao and Savarese, “Semantic Structure From Motion”, CVPR 2011 fying room boundaries within indoor environments and recognising the resultant rooms. A
series of laser range scans are fused into a 2D occupancy grid representing the probability
that each cell is occupied by some object or boundary. Rooms boundaries are identified by
applying dilation and erosion to the occupancy map, which are standard morphological fil-
ters from visual segmentation [FP02]. The authors demonstrate that this can be performed
with fixed computational cost by discarding old parts of the environment as the robot moves

Figurethrough the environment.
2.4: Example of an object–centric map of [VGNS07]. The blue triangles show object
detections, the red and green stars show doorways the system has identified, and the red
The result of their algorithm is a series of “nodes” with topological connections between
dot shows the robot’s inferred place category for the outlined room, which in this case is an
them, which correspond to the various rooms and corridors within the robot’s environment
office.
and while the doorways that connect them. The authors goal it is still instructive to reviewthe
and
this is not aligned exactly with our own proceed to characterise each node by these
contributions because the ideas they propose for inferring context are often separable from

geometry of a scene. For example, Kosaka and Kak [11] rithm using the imag
presented a navigation algorithm that allows a monocular fine the “floor-wall” g
robot to track its position in a building by associating vi- recovering 3d inform
sual cues, such as lines and corners, with the configura- its training process, an
tion of hallways on a plan. However, this approach would likely floor-wall bound
fail in a new environment where the plan of the room is we present a quantitat
not available beforehand. To succeed more generally, one struction on test imag

PRIOR WORK
needs to rely on a more flexible geometric model. With a the algorithm by appl
Manhattan world assumption on a given scene (i.e. one that ages.
contains many orthogonal shapes, like in many urban en-
vironments), Coughlan and Yuille [3], and Schindler and 2. Background M
Dellaert [16] have developed efficient techniques to recover
autonomously both extrinsic and intrinsic camera param-
eters from a single image. Another successful attempt in In this paper, we fo
the field of monocular 3d reconstruction was developed by scenes of the sort that
Han and Zhu [7, 8], which used models both of man-made bile robot. We make
“block-shaped objects” and of some natural objects, such as camera:
trees and grass. Unfortunately, this approach has so far been
1. The image is ob
applied only to fairly simple images, and seems unlikely
ing a calibrated c
to scale in its present form to complex, textured images as
Thus, as present
shown in Figure 1.
world is projecte
in homogeneous
if:3

• Delage, Lee, and Ng, “A dynamic Bayesian network for Make3D: Learning 3D Scene Structure2. fromtoconta
Object Detection
The image
sponding N d a
autonomous 3d reconstruction from a single indoor
the floor plane. (F
Single Still Image which all surface

image”, CVPR 2006
2 A calibrated camera m
Ashutosh Saxena, Min Sun and Andrew Y. Ng 1
to the optical axis is known
3 Here, K, q and Q are
Figure 2. 3d reconstruction of a corridor from  
f 0 ∆u
single image presented in Figure 1 using our
Make3D: Learning 3D Scene Structure from a Abstract— We consider the problem autonomous algorithm.
of estimating detailed
3-d structure from a single still image of an unstructured
K=  0 f ∆v  ,
0 0 1
Thus, Q is projected onto a

• Hoiem, Efros, and Ebert, “Geometric context from a singleSingle Still Hoiemthat focuses also generating aesthetically pleasing
Image on developed independently an al-
gorithm
et al. [9]
environment. Our goal is to create 3-d models which are both
quantitatively accurate as well as visually pleasing.
is some constant α so that Q
4 Vanishing points in the

are parallel in 3d space mee
“pop-up book” versions of outdoor pictures. Although their
For each small homogeneous patch in the image, we use a
image”, CVPR 2005 Ashutosh Saxena, Min Sun and Andrew Y. in spirit, it is different from ours in de-
algorithm is related Ng
Markov Random Field (MRF) to infer a set of “plane parame-
ters” that capture both the 3-d location and 3-d orientation of the
tail. We will describe a comparison of our method with
perspective geometry. Beca
cial scenes, they form impo
that has mainly orthogonal
patch. The MRF, trained via supervised learning, models both
image depth cues as well as the relationships between different
parts of the image. Other than assuming that the environment
Abstract— We consider the problem of estimating detailed
is made up of a number of small planes, our model makes no
3-d structure from a single still image of an unstructured
• Saxena, Sun, and Ng, “Make3d: Learning 3D scene structure explicit assumptions about the structure of the scene; this enables
environment. Our goal is to create 3-d models captureare both
the algorithm to which much more detailed 3-d structure than
quantitatively accurate as well does prior art, and also give a much richer experience in the 3-d
as visually pleasing.

from a single still image, PAMI 2008
For each small homogeneous patch in the image, we use a
flythroughs created using image-based rendering, even for scenes
Markov Random Field (MRF) to infer a set of “plane parame-
with significant non-vertical structure.
ters” that capture both the 3-d location and 3-d orientation have created qualitatively correct 3-d
Using this approach, we of the
patch. The MRF, trained via supervised learning, models both downloaded from the internet.
models for 64.9% of 588 images
image depth cues as well as the relationships extended different
We have also between our model to produce large scale 3d
parts of the image. Other than assuming thatfew images.1
models from a the environment
is made up of a number of small planes, our model makes no Fig. 1. (a) An original image. (b) Oversegmentation of the image to

• Lee, Kanade, Hebert, “Geometric reasoning for single image
explicit assumptions about the structure Terms— Machineenables Monocular vision, Learning
Index of the scene; this learning, “superpixels”. (c) The 3-d model predicted by the algorithm. (d) A scre
the algorithm to capture much depth, detailed and structure than
more Vision 3-d Scene Understanding, Scene Analysis: Depth of the textured 3-d model.
cues.
does prior art, and also give a much richer experience in the 3-d
structure recovery”, CVPR 2009 flythroughs created using image-based rendering, even for scenes
with significant non-vertical structure.
I. I NTRODUCTION
Using this approach, we have created qualitatively correct 3-d
these methods therefore do not apply to the many scenes th
models for 64.9% of 588 images Upon seeing an image such as Fig. 1a, a human has no difficulty
downloaded from the internet.
not made up only of vertical surfaces standing on a hori
We have also extended our model to produce 3-d structure (Fig. 1c,d). However, inferring
understanding its large scale 3d floor. Some examples include images of mountains, trees
models from a few images.1 such 3-d structure remains extremely 1. (a) An original image. (b) Fig. 15b and 13d), staircases (e.g., Fig. 15a), arches (e.g., Fi
Fig. challenging for current Oversegmentation of the image to obtain
narrow mathematical sense, and 15k), rooftops (e.g., Fig. 15m), etc. that often have
Index Terms— Machine learning, Monocular systems.Learningin a“superpixels”. (c) The 3-d model predicted by the algorithm. (d) A screenshot
computer vision vision, Indeed,
depth, Vision and Scene Understanding, Sceneto recover 3-d depth from a single model.
it is impossible Analysis: Depth of the textured 3-dimage, since richer 3-d structure.
cues. we can never know if it is a picture of a painting (in which case In this paper, our goal is to infer 3-d models that are
the depth is flat) or if it is a picture of an actual 3-d environment. quantitatively accurate as well as visually pleasing. W
Yet in practice people perceive depth remarkably well given do not the insight that most 3-dthat are can be segmented into
I. I NTRODUCTION these methods therefore just apply to the many scenes scenes

PROBLEM STATEMENT

Given:
• K views of a scene
• Camera poses from structure-from-motion
• Point cloud

Recover an indoor Manhattan model


ck k
X ) and q X (x, y 0 )
in column x be px = = yx y ) x = (M, i)
C(M )
(x, ⇡(x, re- (34)
x
x
y (depicted in figure ??). Since each i=0 lies on the
x=0 px
ne andPre-processing X
each q x lies on the ceiling plane, we have
Xk
ˆ
M = argmax ⇡(x, yx ) (M, i) (35)
1. Detect p = Hq .x
vanishing points
M i=0 (1)
x x
2. Estimate Manhattan homology
8
is a planar homology [?]. We ci is a concave corner
>log( 1 ), if show how to recover
<
3. Vertically rectify imagesif c is a concex corner
tion 3.5. Once H >log( 2 ), any indoor Manhattan (36)
(M, i) = is known, i
:
fully described by log( values c{yxan occluding corner
the 3 ), if i is }, leading to the
Structure recovery
arametrization,
ck = W (37)
M = {yx }Nx .
x=1 (2)
ck < W (38)
y this parametrization as follows.as check whether
Express posterior on models To
x0 , y0 ) lies on a vertical zor horizontal surface we
likelihood
X }|
prior
{ z }| {
X
eed to check whether y0 is ⇡(x, yx ) yx0 (M, i)yx0 . (39)
log P (M |X) = between and 0
ow the 3D position of the xfloor and ceiling planes
i

can recover the depth of2010) we described an exactIf
In (Flint et al, ECCV every pixel as follows. dynamic
lies on the floor or ceiling then we simplyform.
programming solution for problems of this back–
ray onto the corresponding plane. If not, we back–
[1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”,
In ECCV 2010


he camera intrinsics are unknown then we construct the camera matrix
v
m the detected vanishing points by assuming that the the image locations of ceiling points an
available the mapping Hc!f between camera centre is
indoor Manhattan scene has exactlyof thefloor and one ceiling plane, both
one floor points that are
image centre and the image a focal length and aspect ratio such vertically below them (see Figu
choosing locations that the
h normal direction vv . It willa be useful in the with axis h = v ⇥v and vertex v [15] and ca
1b).arec!f is planar homology following sections to have
ted vanishing points H mutually orthogonal. l r v
ailable the mapping Hc!f between the image locations of ceiling points and
be recovered given the image location of any pair of corresponding floor/ceilin
Preliminaries (xf ,withas h = v ⇥v and vertex v [15] and can
image locations of the floor points that are vertically below them (see Figure
points
). Hc!f is a planar homology
xc )
axis
Identifying the floor and ceiling planes.l r v T
recovered mapping from ceiling plane toof any pair Hc!f = I + µ vv h ,
• The given the image location floor plane, of corresponding floor/ceiling (
nts (xf a xc ) as homology.
, planar vv · h
is
door Manhattan scene has exactly one floor Tand one ceiling plane, both
normalFollowing rectification, H be v , xc+ µ vxh⇥ following sections to have (1) cross ratio of Hc!
•
where µ =< useful in the f ⇥
direction vv . It will v = I , xf , v c xalongh > is the characteristic
c!f transforms points ,
Although we do notv locations of ceiling such pair (xf , xc ), we can recov
ble the mapping Hc!f between the image · h v have a priori any points and
image columns.
age locations of the c!f using the following RANSAC algorithm. First, we sample one point x
Hfloor points that are vertically below them (see Figure ˆ
from ⇥with axisabove lthe r and vertex vv [15] and can map,
f ⇥h
ere µ =< vv , xc , xf , xc thexregion > is the characteristic the Canny of Hc!f . then we sample
c!f• Given the label yx at some column x, the orientation
is a planar homology h = v ⇥v horizon in cross ratio edge
Although we do second pointpriori any such pair (xf , and vwefrom the region below the horizo
not have a x collinear with the first xc ),
ôf any recovered as can recover
overedfor every pixel in that column can be pair of corresponding floor/ceiling
given the image location f v
!f using the following RANSAC algorithm. First,H we sample one point xc
We compute the hypothesis map ˆ c!f as described above, which we then sco ˆ
(xf , xc ) as
m thefollows. above the horizon in the Canny edge ˆ
region map, then we (x,yx) asample
by the number of edge pixels that Hc!f maps onto other edge pixels (accordin
T
ond point xf collinear with the first µ vv h v from the region below the horizon.
ˆ [x + and v
1. Compute yx’ the c!f = Iedge map). , After repeating this for a (1)
to = H Cannyˆyx 1] v · h
T
fixed number of iteratio
compute the hypothesis map Hhypothesis with greatest which we then score
c!f as described above, score.
v
2. Pixels between yx and theH vertical, others are
we return yx’ are
that>îs contain eitherother edgeratio of Hc!f . view of the ceiling.
the numbercof edge pixels ⇥images the characteristic view ofpixels (according
µ =< vv , x , xf , xc ⇥ xf h Many c!f maps onto no cross the floor or no
horizontalmap). After repeating this for a fixed number of iterations
the Canny do not such cases H any unimportant since therecan recover
hough we edge have a priori is such pair (xf , xc ), we are no corresponding points in t
c!f
return the hypothesis with the best H
image. If greatest score. output from the one point xc
using the following RANSAC algorithm. First, we sample RANSAC process has a score below ˆ
c!f
he region above the horizon no viewwe set µ to amap, view of that ceiling. In
Many images contain eitherk in the Canny floor or no then wethe will transfer all pixels outsi
threshold t then of the edge large value sample a
h cases f collinear with the first and there arethethen have no the horizon. theestimated model.
point xHc!f is unimportant since vv c!f will region below impact on the
ˆ the image bounds. H from no corresponding points in
mpute the best Hc!f map Hc!f asthe RANSAC process has a then score a
age. If the hypothesis output from described above, which we score below
ˆ
eshold kt then we set µ to a large value that will transfer all pixels(x,yx’)
ˆ c!f maps onto other edge pixels (according outside
number of edge pixels that H
image bounds. Hc!f will then have no impact on the estimated model.
CannyFlint, David Murray, Ian Reid repeating this for a fixed number of iterations Stereo, and 3D Features”
Alex edge map). After “Manhattan Scene Understanding Using Monocular,

log P (X | M )
M = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33)
log P (X
M = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33)

C(M ) =
X ck
⇡(x, yx )
X ck
MODEL
X k

X
(M, i)
k
(34) log P (I 1:K
| M) =
XX
X
x=0 i=0 p2Io k
C(M ) = ⇡(x, yx ) (M, i) (34) log P (I 1:K | M ) =
X x=0 X ki=0 p
ˆ
M = argmax ⇡(x, yx ) M (M, i) (35)
X X k
M x i=0
ˆ
M = argmax ⇡(x, yx ) (M, i) (35)
M x i=0
ˆ
M = argmax P (M )P (XXmono M )P (Xstereo | M )PX3D | M )
mono | (X3D
M
X stereo
ˆ (36)
M = argmax P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M )
M
(36)
P (M | X) = P (Xmono | M )P (Xstereo | M )P (X3D | M )P (M )
(37)
P (M | X) = P (Xmono | M )P (Xstereo | M )P (X3D | M )P (M )
(37)
log P (M | X) = log P (Xmono | M )+log P (Xstereo | M )+log P (X3D | M )+log P (M )
}| { z }| { (38)
z = log P (Xmono | M )+log P (Xstereo | M )+log P (X3D | M )+log P (M )
log P (M | X) X X
log P (M |X) =
8
⇡(x, yx ) (M, i) (38)
>log x 1 , if ci is a concave corner
<
i
8
> 2 , 1 , is i is a concave corner (39)
(M, i) = log log if ci if ca concex corner
> <
:

x i=0

8 ck < W (38)
Prior>log(
< if ci is a concave corner
1 ),
(M, i) = log( 2 ), if ci is a concex corner (36)
>
:
z
log(
X }| ci is an z }| corner
3 ), if { X occluding {
log P (M |X) = ⇡(x, yx )
concave convex
(M, i)
occluding
(39)
M = {c1 , (rck a1 ), . . . , ck
x =W (37) (33)
1, i 1 , (rk 1 , ak 1 ), ck }

1 n1 n 2 n 3
P (M ) = ck1 <X W
ck 2 3 Xk (38)
(40)
Z
C(M ) = ⇡(x, yx ) (M, i) (34)
X x=0 i=0
⇤
log P (M | ) =z }| P{( z | a}|) +{c
X log X p
p k
(41)
X
log P (M |X)ˆ= p ⇡(x, yx ) (M,X
i) (39)
M = argmax ⇡(x, yx ) (M, i) (35)
x i
X
M x i=0
⇡mono (x, yx ) = log P ( i | a⇤ )
i (42)
8 y0
>log 1 ,
< if ci is a concave corner
(M, i) = 1 log n1 , n2ci is a concex corner
if n3 (36)
P (M ) = > 1 2
: 2 3 (43)
Zlog 3 , if ci is an occluding corner

ck = W (37)

ck < W (38)

2, 3} for each pixel, with < W corresponding to the
ck values (38)
e Manhattan orientations (shown as red, green, and blue
ons in ﬁgure 1). As described in section 3, a is deter-
Likelihood For Photometric Features
ck < W (38)
istic given the modelz . We assume a linear likelihood
M }| { z }| {
X X
pixellog P (M |X) =
features ⇥, ⇡(x, yx ) (M, i) (39)
z
X }| { z }| {
x T Xi
log P (M |X) = a) = yx ) ⇥
P (⇥ | ⇡(x, w a .(M, i) (39)
(5)
x w i T⇥
1 j n1 a n2j n3
P (M ) = 1 2 3 (40)
We now derive MAP inference. The posterior on M is
1 Z n1 n 2 n 3
P (M ) = (40)
Z X⇤ 3
1 2
log P | | = ⇥P P (⇥ i ⇤ )
P (M ( ⇥) M ) =(M )log P ( p | ai ) + c
p
(6)
(41)
X
log P (M | )= p i
log P ( p | a⇤ ) + c
p (41)
re ai is the orientation p Xdeterministically predicted by
del M at pixel pi and ⇥ is X log P ( i a⇤ )
⇡mono (x, yx ) = a normalizing| constant. We
i (42)
= y0 equals i a⇤ (42)
⇡mono (x, yx )since itlog P ( 1 |for )a and 0 oth-
e omitted P (ai | M ) i
i
y0
ise. Taking logarithms, 1 n1 n2 n3
P (M ) = 1 2 3 (43)
Z
log P (M | ⇥) = n1 ⇤⇥ + n2 ⇤⇥ + n3 ⇤⇥
1 2 3
⇥
+ X log P (⇥ i | ai ) + k
⇣ ⌘ (7)
X
log P (D|M ) = i log P (di | pi , yx ) , (44)
x i2Dx
re ⇤⇥ = log ⇤3 and similarly for the other penalties, and
3
orresponds to the normalizing denominators in (6) and
X⇣ X
which we henceforth drop since it makes no difference ⌘
log P (D | M ) = log P (di | pi , yx ) , (45)
he optimization to come. We can now put (7) into payoff
x i2Dx
m (3) by writing

1 n1 n2 n3
P (M ) = 1 2 3 (43)
Z
Likelihood For Photoconsistency Features
⇣ ⌘ X X
log P (D|M ) = log P (di | pi , yx ) , (44)
x i2Dx Frame 0 Frame i

X⇣ X ⌘
log P (D | M ) = log P (di | pi , yx ) , (45)
x i2Dx

X
log P (X M ) = ⇡(x, yx ) (46)
Figure 6. The graphical| model relating indoor Manhattan models
iple views are com-
to 3D points. The hidden variablext indicates whether the point is
M followed by re–
inside, outside, photo-consistency measure
or coincident with the model.reprojection of p into frame k
X X z}|{ z
K }| {
log P (I 1:K | M ) = PC (p, reprojk (p, M )) ,
reprojk (p; yx ) and write
p2Io k=1
(47)
stereo for the case Ny M
lable. We assume stereo (x, yx ) = PC(p, reprojk (p, yx )) , (10)
es I1 , . . . , IM . We y=1 k=1
mera, as output for
em, and that cam- where p = (x, y). To see this, substitute (10) into (3) and
Equivalent to canonical stereo formulation
intensities to zero observe thatto indoor is precisely (9).
subject the result Manhattan assumption.
Note that the column–wise decomposition (10) neither

3.4. Combinin
Likelihood For Point Cloud Features We combine
model by assum
P (M | Xmono ,
P (M )P (Xm
Taking logarithm
Figure 7. Depth measurements di might be generated by a surface
in our model (represented by ti = ON) or by an object inside ⌃joint (x) =
or outside the environment (in which case ti = IN, OUT respec-
3.5. Resolving
tively).
We resolve th
follows. If C is
as through a window. The likelihoods we use are the vertical vani
⇤ mal to the ﬂoor
, if 0 < d < r(p; M ) this orientation t
P (d | p, M, IN) = (11)
0, otherwise number of point
⇤ of the diameter
⇥ , if r(p; M ) < d < Nd take as the ﬂoor
P (d | p, M, OUT) = (12)
0 , otherwise mum locations s
P (d | p, M, ON) = N (d ; r(p; M ), ⌥) . (13) We found that t
on our training s
where and ⇥ are determined by the requirement that Let the two n
the probabilities sum to 1 and r(p; M ) denotes the depth and let h = vl
predicted by M at p. We compute likelihoods on d by xf and xc on th

payoff m
p, yx ) = 1
P (d | P (M) = P (d |np, M 2 . n3
1 n) (15) (40)
Likelihood For Point Cloud Features
1 2 3
Z
Let D denote all depth measurements, P denote all pixels,
X
and Dx contain indices for = log P ( p | a⇤ ) in
log P (M | )all depth measurements+ c col- (41)
p In previo
umn x. Then p model M
⇧ ⇧ X “cropped
⇡=P (M )yx ) = PlogiP ( ,i yxa⇤ ) (16) (42)
P (M | D, P ) mono (x, (d | pi | )i interval [
x i⇤Dxy0 interval.
⌅ ⌅ ⇥ which M ˆ
log P (M | D, P ) =P (M ))+ 1 n1 log P n3 | p , y ) ,
n2 (d
P (M = 1 2 3 i i x (43) Our a
Z i⇤Dx
x
respects.
(17)
X⇣ X ⌘ the form
which welog P (D|M ) = form as log P (di | pi , yx ) ,
write in payoff (44) as assum
x i2Dx corners a
⌅
⌃3D (x, yx ) = log P (di | pi , yx ) (18) directly i
i⇤Dx
ity by O(
For comp
and the penalty function ⇤ remains as in (8). appendix


model by assuming conditional independence given M ,
ˆ M = argmax P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M )
Combining| Xmono , Xstereo , X3D ) =
P (M Features M
(36) (19)
P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M )
Taking logarithms leads |to summationM )P (X3D | M )P (M )
P (M | X) = P (Xmono M )P (Xstereo | over payoffs,
a surface (37)
ct inside ⌃joint (x) = ⌃mono (x) + ⌃stereo (x) + ⌃3D (x) . (20)
T respec-
3.5. log P (M | X) = log P (Xmono | M )+log Pplanes | M )+log P (X3D | M )+log
Resolving the floor and ceiling (Xstereo
(38)
We resolve the equation of the floor and ceiling planes as
follows. If C is8 camera matrix for any frame and vv is
the
>log 1 , if c is a concave corner
the vertical vanishing in that iframe, then n = C 1 vv is nor-
<
mal to (M, floor>log ceilingcplanes. We corner a plane with
the i) = : and 2 , if i is a concex sweep (39)

(11) this orientation through the ci is an occluding corner step the
log 3 , if scene, recording at each
number of points within a distance ⌅ of the plane (⌅=0.1%
(40)
of the diameter of the point = W in our experiments). We
ck
cloud
d
(12) take as the floor and ceiling planes the minimum and maxi-
mum locations such that the < W contains at least 5 points.
ck plane (41)

(13) We found No approximations other than conditional failure
that this simple heuristic worked without
independence and occlusions{
on our training set. z
X }| { z }|
X
ent that Let the two non–vertical vanishing points be vl(42) vr
log P (M |X) = ⇡(x, yx ) (M, i) and
Alex Flint, David Murray, letReid = v ⇤ v . Select any twoi corresponding points and 3D Features”
and Ian h “Manhattan Scene Understanding Using Monocular, Stereo,
x

er of this paper if we can assume that vertical
world appear vertical in the image. To this end
e simple rectification procedure of [?]. X⇣ X
= {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33) log P (D | M ) = log P (di | pi , yx

INFERENCE
describe our parametrization for indoor Manhat-
x i2Dx
Let the image dimensions be Nx ⇥ Ny . Follow-
tion, the vertical seams at which adjacent walls
X ck Xk X
log P (X | M ) = ⇡(x, yx )
C(M ) = ⇡(x, yx ) column (34)
to vertical lines, so each image(M, i) inter- x
x=0 i=0
y one wall segment. Let the top and bottom of
olumn x be px X(x, yx ) andX = (x, yx ) re-
= k
qx 0
X X z}|{ z
K }
ˆ
M = argmax ??). Since each p(M, i) on the (35)
depicted inM figure ⇡(x, yx ) lies log P (I 1:K | M) = PC (p, reprojk
MAP inference . . . , c i=0, (rx , a ), c }
x
(33)
nd each q
M = x lies on a1 ),ceiling plane, we have k
{c1 , (r1 , the k 1 k 1 k 1 p2Io k=1
ˆ
M = argmax P (M | X) (36)
px = Hq xck
M . k
(1)
X X
C(M ) = ⇡(x, yx ) (M, i) (34)
a planarP (M )P (X [?]. |We)P (X howi=0 recover | M )
argmax homologymono M showstereo |to payoff matrix:
Reduced to optimisation over )P (X3D
x=0 M
n 3.5. Once H is known, any indoor Manhattan
M
X X (37)
k
ly described by the values {yx }, leading to the
ˆ
M = argmax ⇡(x, yx ) (M, i) (35)
metrization, M x i=0
X) = P (Xmono | M )P x stereo | M )P (X3D | M )P (M )
N (X
M = {yx }x=1 .
8 (2)
(38)
>log( 1 ), if ci is a concave corner
<
is parametrization as follows. iTo check whether
(M, i) = log( 2 ), if c is a concex corner (36)
(M)| lies = log >(XmonoorM )+log P (Xstereo | M )+log P (X3D | M )+log P (M )
y0 X) on a : vertical | horizontal surface we
P
log( 3 ), if ci is an occluding corner
to check whether y0 is between yx0 and yx0 . 0(39)

the Alex Flint, David Murray,theReid and ceiling planes Scene Understanding Using Monocular, Stereo, and 3D Features”
3D position of Ian floor “Manhattan

t a mul- error is a per–pixel average of (23). ⌃fup (x, y 1, a ) (x)
Figure 9 also shows that joint estimation is superior to
a mul- error is a per–pixel average of (23). ⇧
pproach, fout (x, y, a) = 0max fdown (x, y + 1, a ) (x)
pproach, one sensor modality alone. Anecdotally we find
sing any
have ex- a ⇥{1,2} ⌃
⌅
have ex-
Recursive Sub-problem Formulation
hat using 3D cues future work we intend to use indoor Manhattan mod-
n we feel In alone often fails within large textureless
nnefitfeelin which tofuture work we intend to use indoorscene categories.
we of
egions a elsInthe structure–from–motion system failed mod-
reason about objects, actions, and Manhattan
fin (x, y, a ) (x)
(25)
els to reason about objects, actions, and scenefor learning
nefitcues. points, whereas to investigate structuralcues alone categories.
o trackaany We also intend stereo or monocular SVMs
3D of ⇥
We also intend to investigate us to relaxSVMs for learning fup (x, y, a) = max f (·), fup (x, y 1, a) ,
parameters, which may allow structural the conditional in- (26)
3D cues.
figure 9. What is the optimal model up to column x?
ften performparameters,such regions but us to lack precision
better in which may allow can relax the conditional in-
figure 9. anddependence assumptions between sensor modalities.
in
⇥
rs. Even
t corners boundaries.
s. Even dependence assumptions between sensor modalities. fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) , (27)
m outper- 11 shows timing results for our system. For each
Figure
outper-
flects the 8. Appendix ⇥
flects rep- frames, our system requires on average less than
riplet of
attan the 8. Appendix fin (x, y, a) = max fout (x , y , a) + , (28)
ttansecond to compute features for all three MAP inference.
ne rep- Recurrence relations for frames and less Let x0 <x
han 100 milliseconds to perform optimization. Ny ,inference. be the
% of our
fout (x, y, a), 1 ⇤relations, 1 ⇤ yMAP a ⇧ {1, 2}
Recurrence x ⇤ Nx for ⇤ Let ⌥x

% of our
fout (x, y, a), 1 ⇤ for any x , 1 ⇤ y ⇤ Ny , a model 2} be the
maximum payoff x ⇤ N indoor Manhattan ⇧ {1, M span- = ⇥(i, y ) . (29)
show re-
maximum payoff for any indoor(i) M contains a floor/wall
ning columns [1, x], such that Manhattan model M span-
7.
s. Conclusion
show re-
Label-
ning columns [1, x], such that the M contains a floor/wall
intersection at (x, y), and (ii) (i) wall that intersects col-
i=x0

. Label-
ular–only
We have presented a Bayesian and (ii) the out can be intersects col- we have treated fin , fup , and fdown simply as nota-
lar–only
n of 10%
intersection orientation a. Then f wall that computed by
umn x has at (x, y), framework for scene un- Here
umn x has orientation a. Then fout can be computed by tional placeholders; for their interpretations in terms of sub–
erstanding in the context of a of the recurrence relations,ap-
of 10%
rocedure,
recursive evaluation
moving camera. Our
ocedure, Recurrence relations
recursive evaluation of the recurrence relations,
⇤
roach draws on the indoor Manhattan assumption intro-
⌃
Boundary Conditions
problems see [7]. Finally, the base cases are
perior for monocular reasoning and we up (x, y shown) that (x)
uced to ⇤f
⇧ have 1, a
⌃f (x, y 1, a ) (x)
y we find from monocular = a0max ⇧fup (x, y + 1, a )
perior to
echniques
fout (x, y, a)
and max ⌃f down
stereo vision can+ 1, inte- (x)
⇥{1,2} ⌅ be a )
(x) fout (0, y, a) = 0 ⌃y, a (30)
we find
xtureless fout (x, y, a) = 0 down (x, y
rated with 3D data in a coherent Bayesian framework.(x)
a ⇥{1,2} ⌃fin (x, y, a )
⌅ fup (x, 0, a) = ⌅ ⌃x, a (31)
xtureless
em failed fin (x, y, a ) (x) (25)
uesfailed excludes cases for which [14] was unable to find overlapping
m 1 This row
alone ⇥ (25) fdown (x, Nx , a) = ⌅ ⌃x, a . (32)
precision initialization. (x, y, a) = max fin (·), fup (x, y 1, a) ⇥ ,
fup (26)
nesalone
es during
precision fup (x, y, a) = max fin (·), fup (x, y 1, a) ⇥ (26) ,
fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) ⇥ , (27)
For each
fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) , (27)
⇥
For each
less than fin (x, y, a) = max fout (x , y , a) + ⇥ , (28)
sess than
and less x0 <x
fin (x, y, a) = max fout (x , y , a) + , (28)
O(WH)
x
and less x0 <x ⌥
= ⌥ ⇥(i, y ) .
x (29)
= i=x0 ⇥(i, y ) . (29)
i=x0
ceneFlint, Mei, Murray,we have treated fin , Programming Approach to as nota-
un- Here and Reid, “A Dynamic fup , and fdown simply Reconstructing Building Interiors”, In ECCV 2010
cene Alex Flint,Here we haveIan Reid fin , fup , and fdown in termsas nota-
Our un-
ap- tional placeholders; for their interpretations simply of sub–
treated
David Murray, [7]. Finally, the base cases are “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
problems see
tional placeholders; for their interpretations in terms of sub–

RESULTS
Input
• 3 frames sampled at 1 second intervals
• Camera poses from SLAM
• Point cloud (approx. 100 points)

Dataset
• 204 triplets from 10 video sequences
• Image dimensions 640 x 480
• Manually annotated ground truth


RESULTS


RESULTS
Algorithm Mean depth error (%) Labeling error (%)
Our approach (full) 14.5 24.5
Stereo only 17.4 30.5
3D only [1] 15.2 28.9
Monocular only 24.8 30.8
Brostow et al. [2] 39.4
Lee et al. [3] 79.8 54.5

[1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”, ECCV 2010

[2] Brostow, Shotton, Fauqueur, and Cipolla, “Segmentation and recognition using structure from motion point clouds”,
ECCV 2008

[3] Lee, Hebert, and Kanade, “Geometric reasoning for single image structure recovery”, CVPR 2009


RESULTS

Stereo Features
730ms
3D Features
9ms
Inference
102ms
Monocular Features
160ms

997ms mean processing time per instance

RESULTS
Sparse texture Non-Manhattan


RESULTS
Poor Lighting Conditions


RESULTS

Clutter


RESULTS
Failure Cases


SUMMARY

• We wish to leverage multiple-view geometry for scene understanding.
• Indoor Manhattan models are a simple and meaningful model family.
• We have presented a probabilistic model for monocular, stereo, and point
cloud features.
• A fast and exact inference algorithm exists.
• Results show state-of-the-art performance.


Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features

Recommended

Recommended

More Related Content

Similar to Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features

Similar to Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features (20)

Recently uploaded

Recently uploaded (20)

Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features