SlideShare a Scribd company logo
1 of 31
Download to read offline
MANHATTAN SCENE UNDERSTANDING
USING MONOCULAR, STEREO, AND 3D
           FEATURES
         Alex Flint, David Murray, and Ian Reid
                  University of Oxford
SEMANTICS IN GEOMETRIC MODELS




                                     1. Motivation
                                     2. Prior work
                                     3. The indoor Manhattan representation
                                     4. Probabilistic model and inference
                                     5. Results and conclusion


Alex Flint, David Murray, Ian Reid                   “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
MOTIVATION
    Single View Computer Vision                                                                                            Multiple View Geometry
                     Sky
           Tree
   Water          Rock                                           classroom (2.09) classroom (1.99) classroom (1.98)            fastfood (−0.18)     garage (−0.69)       bathroom (−0.99)      kitchen (−1.27)
                    Human
                                      classroom

                              Sand


       Beach
                                                                 restaurant (1.57) livingroom (1.55)       pantry (1.53)      fastfood (−0.12)    waitingroom (−0.59)   restaurant (−0.89)     kitchen (−1.16)
                                      dining room




                                                                  bathroom (2.45)     bathroom (2.14)     bedroom (2.01)      laundromat (0.36) operating room(−0.23) dental office (−0.65) bookstore (−1.04)
                                      locker room hospitalroom




                                                                 locker room (2.52)    corridor (2.27)   locker room (2.22)     office (−0.04)     prisoncell (−0.52) kindergarden (−0.86) bathroom (−1.16)




                                                                    mall (1.69)       videostore (1.44) videostore (1.39)     tv studio (−0.14)    bathroom (−0.51)     concert hall (−0.78) concert hall (−1.01) i
                                      tore




Alex Flint, David Murray, Ian Reid                                                  “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
MOTIVATION
 The multiple view setting is increasingly relevant
    • Powerful mobile devices with cameras
    • Bandwidth no longer constrains video on the internet
    • Depth sensing cameras becoming increasingly prevalent


 Structure-from-motion does not
 immediately solve:
    • Scene categorisation
    • Object recognition
    • Many scene understanding tasks



Alex Flint, David Murray, Ian Reid       “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
MOTIVATION
  We seek a representation that:
  • leads naturally to semantic-level scene understanding tasks;
  • integrates both photometric and geometric data;
  • is suitable for both monocular and multiple-view scenarios.

The indoor Manhattan representation (Lee et al, 2009)


                                                                                • Parallel floor and ceiling planes
                                                                                • Walls terminate at vertical boundaries
                                                                                • A sub-class of Manhattan scenes


   Lee, Kanade, Hebert, “Geometric reasoning for single image structure recovery”, CVPR 2009

Alex Flint, David Murray, Ian Reid                      “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
Where would a person stand?
                                               Where would doors be found?
                                               What is the direction of gravity?
                                               Is this an office or house?
                                               How wide (in absolute units)?




Alex Flint, David Murray, Ian Reid   “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
Goal is to ignore clutter
PRIOR WORK
                                                             Target image        Depth map                                        Depth normal map                                     Mesh

•    Kosecka and Zhang, “Video Compass”, ECCV 2002

•    Furukawa, Curless, Seitz, and Szeliski, “Manhattan World Stereo”,
     CVPR 2009
                                                                                       2.2 Context in Robotics                                                               11




•    Posner, Schroeter, and Newman, “Online generation of scene
     descriptions in urban environments”, RAS 2008

                                                                                 2.3 Context in Computer Vision                                                                   13

•    Vasudevan, Gachter, Nguyen, Siegwart, “Cognitive maps for mobile                            Figure 2.3: Semantic labels output by the system of Posner et al [PSN08].

                                                                                       2.2.2 Map–centric approaches

     robots -- an object-based approach”, RAS 2007                                     An alternative approach to deriving context in robotics applications is to integrate new mea-
                                                                                       surements into a map, and then reason about semantics within the map representation. In
                                                                                       general this approach enables stronger integration of measurements taken over several time
                                                                                       steps, at the cost of relying on the ability to correctly build a map.
                                                                                         Buschka and Saffiotti [BS02] have taken a map–centric approach to the problem of identi-

•    Bao and Savarese, “Semantic Structure From Motion”, CVPR 2011                     fying room boundaries within indoor environments and recognising the resultant rooms. A
                                                                                       series of laser range scans are fused into a 2D occupancy grid representing the probability
                                                                                       that each cell is occupied by some object or boundary. Rooms boundaries are identified by
                                                                                       applying dilation and erosion to the occupancy map, which are standard morphological fil-
                                                                                       ters from visual segmentation [FP02]. The authors demonstrate that this can be performed
                                                                                       with fixed computational cost by discarding old parts of the environment as the robot moves

                                                                                 Figurethrough the environment.
                                                                                         2.4: Example of an object–centric map of [VGNS07]. The blue triangles show object
                                                                                 detections, the red and green stars show doorways the system has identified, and the red
                                                                                          The result of their algorithm is a series of “nodes” with topological connections between
                                                                                 dot shows the robot’s inferred place category for the outlined room, which in this case is an
                                                                                        them, which correspond to the various rooms and corridors within the robot’s environment
                                                                                 office.
                                                                                 and while the doorways that connect them. The authors goal it is still instructive to reviewthe
                                                                                      and
                                                                                            this is not aligned exactly with our own proceed to characterise each node by these
    Alex Flint, David Murray, Ian Reid           “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
                                                                                 contributions because the ideas they propose for inferring context are often separable from
geometry of a scene. For example, Kosaka and Kak [11]                      rithm using the imag
                                                                                                                                   presented a navigation algorithm that allows a monocular                   fine the “floor-wall” g
                                                                                                                                   robot to track its position in a building by associating vi-               recovering 3d inform
                                                                                                                                   sual cues, such as lines and corners, with the configura-                   its training process, an
                                                                                                                                   tion of hallways on a plan. However, this approach would                   likely floor-wall bound
                                                                                                                                   fail in a new environment where the plan of the room is                    we present a quantitat
                                                                                                                                   not available beforehand. To succeed more generally, one                   struction on test imag




                                          PRIOR WORK
                                                                                                                                   needs to rely on a more flexible geometric model. With a                    the algorithm by appl
                                                                                                                                   Manhattan world assumption on a given scene (i.e. one that                 ages.
                                                                                                                                   contains many orthogonal shapes, like in many urban en-
                                                                                                                                   vironments), Coughlan and Yuille [3], and Schindler and                    2. Background M
                                                                                                                                   Dellaert [16] have developed efficient techniques to recover
                                                                                                                                   autonomously both extrinsic and intrinsic camera param-
                                                                                                                                   eters from a single image. Another successful attempt in                      In this paper, we fo
                                                                                                                                   the field of monocular 3d reconstruction was developed by                   scenes of the sort that
                                                                                                                                   Han and Zhu [7, 8], which used models both of man-made                     bile robot. We make
                                                                                                                                   “block-shaped objects” and of some natural objects, such as                camera:
                                                                                                                                   trees and grass. Unfortunately, this approach has so far been
                                                                                                                                                                                                                1. The image is ob
                                                                                                                                   applied only to fairly simple images, and seems unlikely
                                                                                                                                                                                                                   ing a calibrated c
                                                                                                                                   to scale in its present form to complex, textured images as
                                                                                                                                                                                                                   Thus, as present
                                                                                                                                   shown in Figure 1.
                                                                                                                                                                                                                   world is projecte
                                                                                                                                                                                                                   in homogeneous
                                                                                                                                                                                                                   if:3



•   Delage, Lee, and Ng, “A dynamic Bayesian network for Make3D: Learning 3D Scene Structure2. fromtoconta
                                                     Object Detection
                                                                                                                      The image
                                                                                                                      sponding N d a
    autonomous 3d reconstruction from a single indoor
                                                                                                                      the floor plane. (F
                                                                         Single Still Image                           which all surface

    image”, CVPR 2006
                                                                                                                                                                                                                  2 A calibrated camera m
                                                                      Ashutosh Saxena, Min Sun and Andrew Y. Ng                                                                                          1
                                                                                                                                                                                                              to the optical axis is known
                                                                                                                                                                                                                  3 Here, K, q and Q are
                                                                       Figure 2. 3d reconstruction of a corridor from                                                                                                                  
                                                                                                                                                                                                                        f 0 ∆u
                                                                                                                                single image presented in Figure 1 using our
                                          Make3D: Learning 3D Scene Structure from a          Abstract— We consider the problem autonomous algorithm.
                                                                                                                                 of estimating detailed
                                                                                            3-d structure from a single still image of an unstructured
                                                                                                                                                                                                              K=      0 f ∆v  ,
                                                                                                                                                                                                                        0 0         1
                                                                                                                                                                                                              Thus, Q is projected onto a

• Hoiem, Efros, and Ebert, “Geometric context from a singleSingle Still Hoiemthat focuses also generating aesthetically pleasing
                                                                                    Image on developed independently an al-
                                                                               gorithm
                                                                                          et al. [9]
                                                                                            environment. Our goal is to create 3-d models which are both
                                                                                            quantitatively accurate as well as visually pleasing.
                                                                                                                                                                                                              is some constant α so that Q
                                                                                                                                                                                                                  4 Vanishing points in the

                                                                                                                                                                                                              are parallel in 3d space mee
                                                                               “pop-up book” versions of outdoor pictures. Although their
                                                                                               For each small homogeneous patch in the image, we use a
  image”, CVPR 2005                                     Ashutosh Saxena, Min Sun and Andrew Y. in spirit, it is different from ours in de-
                                                                               algorithm is related Ng
                                                                                            Markov Random Field (MRF) to infer a set of “plane parame-
                                                                                            ters” that capture both the 3-d location and 3-d orientation of the
                                                                               tail. We will describe a comparison of our method with
                                                                                                                                                                                                              perspective geometry. Beca
                                                                                                                                                                                                              cial scenes, they form impo
                                                                                                                                                                                                              that has mainly orthogonal
                                                                                            patch. The MRF, trained via supervised learning, models both
                                                                                            image depth cues as well as the relationships between different
                                                                                            parts of the image. Other than assuming that the environment
                                                              Abstract— We consider the problem of estimating detailed
                                                                                            is made up of a number of small planes, our model makes no
                                                           3-d structure from a single still image of an unstructured
•   Saxena, Sun, and Ng, “Make3d: Learning 3D scene structure                               explicit assumptions about the structure of the scene; this enables
                                                           environment. Our goal is to create 3-d models captureare both
                                                                                            the algorithm to which much more detailed 3-d structure than
                                                           quantitatively accurate as well does prior art, and also give a much richer experience in the 3-d
                                                                                             as visually pleasing.

    from a single still image, PAMI 2008
                                                              For each small homogeneous patch in the image, we use a
                                                                                            flythroughs created using image-based rendering, even for scenes
                                                           Markov Random Field (MRF) to infer a set of “plane parame-
                                                                                            with significant non-vertical structure.
                                                           ters” that capture both the 3-d location and 3-d orientation have created qualitatively correct 3-d
                                                                                               Using this approach, we of the
                                                           patch. The MRF, trained via supervised learning, models both downloaded from the internet.
                                                                                            models for 64.9% of 588 images
                                                           image depth cues as well as the relationships extended different
                                                                                            We have also between our model to produce large scale 3d
                                                           parts of the image. Other than assuming thatfew images.1
                                                                                            models from a the environment
                                                           is made up of a number of small planes, our model makes no                                              Fig. 1. (a) An original image. (b) Oversegmentation of the image to

•   Lee, Kanade, Hebert, “Geometric reasoning for single image
                                                           explicit assumptions about the structure Terms— Machineenables Monocular vision, Learning
                                                                                               Index of the scene; this learning,                                  “superpixels”. (c) The 3-d model predicted by the algorithm. (d) A scre
                                                           the algorithm to capture much depth, detailed and structure than
                                                                                             more Vision 3-d Scene Understanding, Scene Analysis: Depth            of the textured 3-d model.
                                                                                            cues.
                                                           does prior art, and also give a much richer experience in the 3-d
    structure recovery”, CVPR 2009                         flythroughs created using image-based rendering, even for scenes
                                                           with significant non-vertical structure.
                                                                                                                    I. I NTRODUCTION
                                                              Using this approach, we have created qualitatively correct 3-d
                                                                                                                                                                  these methods therefore do not apply to the many scenes th
                                                           models for 64.9% of 588 images Upon seeing an image such as Fig. 1a, a human has no difficulty
                                                                                               downloaded from the internet.
                                                                                                                                                                  not made up only of vertical surfaces standing on a hori
                                                           We have also extended our model to produce 3-d structure (Fig. 1c,d). However, inferring
                                                                                            understanding its large scale 3d                                      floor. Some examples include images of mountains, trees
                                                           models from a few images.1 such 3-d structure remains extremely 1. (a) An original image. (b)          Fig. 15b and 13d), staircases (e.g., Fig. 15a), arches (e.g., Fi
                                                                                                                                  Fig. challenging for current   Oversegmentation of the image to obtain
                                                                                                                                  narrow mathematical sense, and 15k), rooftops (e.g., Fig. 15m), etc. that often have
                                                             Index Terms— Machine learning, Monocular systems.Learningin a“superpixels”. (c) The 3-d model predicted by the algorithm. (d) A screenshot
                                                                                        computer vision vision, Indeed,
                                                           depth, Vision and Scene Understanding, Sceneto recover 3-d depth from a single model.
                                                                                        it is impossible Analysis: Depth         of the textured 3-dimage, since  richer 3-d structure.
                                                           cues.                        we can never know if it is a picture of a painting (in which case            In this paper, our goal is to infer 3-d models that are
                                                                                        the depth is flat) or if it is a picture of an actual 3-d environment.     quantitatively accurate as well as visually pleasing. W
Alex Flint, David Murray, Ian Reid                                  “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
                                                                                        Yet in practice people perceive depth remarkably well given do not the insight that most 3-dthat are can be segmented into
                                                                                I. I NTRODUCTION                                 these methods therefore just apply to the many scenes scenes
PROBLEM STATEMENT


   Given:
    •   K views of a scene
    •   Camera poses from structure-from-motion
    •   Point cloud

   Recover an indoor Manhattan model




Alex Flint, David Murray, Ian Reid        “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
ck        k
                                X ) and q X (x, y 0 )
in column x be          px = = yx y ) x = (M, i)
                        C(M )
                              (x, ⇡(x,           re- (34)
                                                 x
                                       x
y (depicted in figure ??). Since each i=0 lies on the
                         x=0           px
ne andPre-processing X
       each q x lies on the ceiling plane, we have
                                        Xk
                     ˆ
                     M = argmax              ⇡(x, yx )              (M, i)        (35)
        1. Detect p = Hq .x
                  vanishing points
                          M                                  i=0            (1)
                    x        x
        2. Estimate Manhattan homology
                      8
 is a planar homology [?]. We ci is a concave corner
                       >log( 1 ), if show how to recover
                       <
        3. Vertically rectify imagesif c is a concex corner
tion 3.5. Once H >log( 2 ), any indoor Manhattan (36)
            (M, i) = is known, i
                       :
 fully described by log( values c{yxan occluding corner
                         the 3 ), if i is }, leading to the
       Structure recovery
arametrization,
                                           ck = W                                 (37)
                   M = {yx }Nx .
                            x=1                                             (2)
                                           ck < W                                 (38)
y this parametrization as follows.as check whether
         Express posterior on models To
x0 , y0 ) lies on a vertical zor horizontal surface we
                                 likelihood
                               X }|
                                                 prior
                                            { z }| {
                                              X
 eed to check whether y0 is ⇡(x, yx ) yx0 (M, i)yx0 . (39)
                log P (M |X) =       between        and 0
ow the 3D position of the xfloor and ceiling planes
                                               i

can recover the depth of2010) we described an exactIf
         In (Flint et al, ECCV every pixel as follows. dynamic
  lies on the floor or ceiling then we simplyform.
         programming solution for problems of this back–
 ray onto the corresponding plane. If not, we back–
    [1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”,
    In ECCV 2010



    Alex Flint, David Murray, Ian Reid                           “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
he camera intrinsics are unknown then we construct the camera matrix
                                                         v
m the detected vanishing points by assuming that the the image locations of ceiling points an
                        available the mapping Hc!f between camera centre is
   indoor Manhattan scene has exactlyof thefloor and one ceiling plane, both
                                                     one floor points that are
  image centre and the image a focal length and aspect ratio such vertically below them (see Figu
                         choosing locations                                             that the
 h normal direction vv . It willa be useful in the with axis h = v ⇥v and vertex v [15] and ca
                        1b).arec!f is planar homology following sections to have
 ted vanishing points H mutually orthogonal.                                            l     r           v
ailable the mapping Hc!f between the image locations of ceiling points and
                        be recovered given the image location of any pair of corresponding floor/ceilin
         Preliminaries (xf ,withas h = v ⇥v and vertex v [15] and can
   image locations of the floor points that are vertically below them (see Figure
                        points
). Hc!f is a planar homology
                                         xc )
                                               axis
Identifying the floor and ceiling planes.l                        r                   v T
 recovered mapping from ceiling plane toof any pair Hc!f = I + µ vv h ,
      • The given the image location floor plane, of corresponding floor/ceiling                                        (
 nts (xf a xc ) as homology.
            , planar                                                               vv · h
         is
door Manhattan scene has exactly one floor Tand one ceiling plane, both
normalFollowing rectification, H be v , xc+ µ vxh⇥ following sections to have (1) cross ratio of Hc!
      •
                        where µ =< useful in the f ⇥
            direction vv . It will v = I , xf , v c xalongh > is the characteristic
                                       c!f transforms points   ,
                               Although we do notv locations of ceiling such pair (xf , xc ), we can recov
ble the mapping Hc!f between the image · h            v  have a priori any points and
         image columns.
 age locations of the c!f using the following RANSAC algorithm. First, we sample one point x
                        Hfloor points that are vertically below them (see Figure                                        ˆ
                        from ⇥with axisabove lthe r and vertex vv [15] and can map,
                                       f ⇥h
 ere µ =< vv , xc , xf , xc thexregion > is the characteristic the Canny of Hc!f . then we sample
 c!f• Given the label yx at some column x, the orientation
       is a planar homology                     h = v ⇥v    horizon in cross ratio edge
  Although we do second pointpriori any such pair (xf , and vwefrom the region below the horizo
                        not have a x collinear with the first xc ),
                                           ˆof any recovered as                         can recover
overedfor every pixel in that column can be pair of corresponding floor/ceiling
           given the image location f                                               v
 !f using the following RANSAC algorithm. First,H                     we sample one point xc
                        We compute the hypothesis map ˆ c!f as described above, which we then sco    ˆ
  (xf , xc ) as
 m thefollows. above the horizon in the Canny edge ˆ
           region                                                      map, then we (x,yx) asample
                        by the number of edge pixels that Hc!f maps onto other edge pixels (accordin
                                                       T
 ond point xf collinear with the first µ vv h v from the region below the horizon.
               ˆ                         [x + and v
          1. Compute yx’ the c!f = Iedge map). , After repeating this for a (1)
                        to = H Cannyˆyx 1] v · h
                                                T
                                                                                                fixed number of iteratio
   compute the hypothesis map Hhypothesis with greatest which we then score
                                            c!f as described above, score.
                                                    v
          2. Pixels between yx and theH vertical, others are
                        we return yx’ are
                                     that>ˆis contain eitherother edgeratio of Hc!f . view of the ceiling.
 the numbercof edge pixels ⇥images the characteristic view ofpixels (according
 µ =< vv , x , xf , xc ⇥ xf h  Many           c!f maps onto no cross              the floor or no
             horizontalmap). After repeating this for a fixed number of iterations
 the Canny do not such cases H any unimportant since therecan recover
 hough we edge have a priori is such pair (xf , xc ), we are no corresponding points in t
                                           c!f
  return the hypothesis with the best H
                        image. If greatest score. output from the one point xc
 using the following RANSAC algorithm. First, we sample RANSAC process has a score below        ˆ
                                                    c!f
 he region above the horizon no viewwe set µ to amap, view of that ceiling. In
  Many images contain eitherk in the Canny floor or no then wethe will transfer all pixels outsi
                        threshold t then of the edge large value sample a
 h cases f collinear with the first and there arethethen have no the horizon. theestimated model.
  point xHc!f is unimportant since vv c!f will region below impact on the
           ˆ            the image bounds. H from no corresponding points in
mpute the best Hc!f map Hc!f asthe RANSAC process has a then score a
age. If the hypothesis output from described above, which we score below
                                      ˆ
 eshold kt then we set µ to a large value that will transfer all pixels(x,yx’)
                                        ˆ c!f maps onto other edge pixels (according          outside
  number of edge pixels that H
   image bounds. Hc!f will then have no impact on the estimated model.
  CannyFlint, David Murray, Ian Reid repeating this for a fixed number of iterations Stereo, and 3D Features”
     Alex edge map). After                              “Manhattan Scene Understanding Using Monocular,
log P (X | M )
           M = {c1 , (r1 , a1 ), . . . , ck   1 , (rk 1 , ak 1 ), ck }       (33)
                                                                                                                  log P (X
                M = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33)

                   C(M ) =
                            X ck
                                  ⇡(x, yx )
                                  X ck
                                              MODEL
                                                  X  k

                                                         X
                                                          (M, i)
                                                          k
                                                                         (34)                log P (I   1:K
                                                                                                              | M) =
                                                                                                                       XX
                                                                                                                                X
                            x=0                    i=0                                                                 p2Io k
                       C(M ) =           ⇡(x, yx )             (M, i)         (34)                log P (I 1:K | M ) =
                                X x=0                 X ki=0                                                                    p
                 ˆ
                M = argmax            ⇡(x, yx )          M (M, i)        (35)
                                      X                    X k
                         M       x                     i=0
                     ˆ
                    M = argmax               ⇡(x, yx )           (M, i)       (35)
                                  M       x                  i=0
      ˆ
      M = argmax P (M )P (XXmono M )P (Xstereo | M )PX3D | M )
                           mono |                    (X3D
             M
                                       X stereo
         ˆ                                             (36)
        M = argmax P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M )
                       M
                                                                                    (36)
      P (M | X) = P (Xmono | M )P (Xstereo | M )P (X3D | M )P (M )
                                                          (37)
         P (M | X) = P (Xmono | M )P (Xstereo | M )P (X3D | M )P (M )
                                                               (37)
        log P (M | X) = log P (Xmono | M )+log P (Xstereo | M )+log P (X3D | M )+log P (M )
                                       }|    { z }| {                        (38)
                                  z = log P (Xmono | M )+log P (Xstereo | M )+log P (X3D | M )+log P (M )
              log P (M | X)       X              X
          log P (M |X) =
                            8
                                      ⇡(x, yx )         (M, i)                     (38)
                            >log x 1 , if ci is a concave corner
                            <
                                                   i
                                  8
                                  > 2 , 1 , is i is a concave corner (39)
             (M, i) = log log if ci if ca concex corner
                            > <
                            :
Alex Flint, David Murray, Ian Reid                   “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
x                 i=0

          8                  ck < W                                          (38)
     Prior>log(
          <                   if ci is a concave corner
                           1 ),
  (M, i) = log(          2 ), if ci is a concex corner  (36)
          >
          :
               z
            log(
                        X }| ci is an z }| corner
                         3 ), if     { X  occluding {
log P (M |X) =                 ⇡(x, yx )
                          concave      convex
                                                             (M, i)
                                                             occluding
                                                                             (39)
                M = {c1 , (rck a1 ), . . . , ck
                     x          =W                                          (37)       (33)
                            1,                i         1 , (rk 1 , ak 1 ), ck }

                   1 n1 n 2 n 3
          P (M ) = ck1 <X W
                         ck 2    3       Xk        (38)
                                                    (40)
                   Z
                 C(M ) =     ⇡(x, yx )         (M, i)    (34)
                   X x=0                 i=0
                                       ⇤
   log P (M | ) =z }| P{( z | a}|) +{c
                  X log         X p
                                p            k
                                                    (41)
                            X
     log P (M |X)ˆ= p ⇡(x, yx )        (M,X
                                          i)   (39)
                M = argmax      ⇡(x, yx )    (M, i)                                    (35)
                                  x                 i
                                      X
                                      M      x                      i=0
       ⇡mono (x, yx ) =                   log P (   i    |   a⇤ )
                                                              i              (42)
                       8 y0
                       >log 1 ,
                       <                      if ci is a concave corner
              (M, i) = 1 log n1 ,             n2ci is a concex corner
                                              if     n3                                (36)
             P (M ) = > 1 2
                       :                    2      3                 (43)
                       Zlog 3 ,               if ci is an occluding corner

                                              ck = W                                   (37)


                                              ck < W                                   (38)
  Alex Flint, David Murray, Ian Reid                           “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
2, 3} for each pixel, with < W corresponding to the
                            ck values                    (38)
e Manhattan orientations (shown as red, green, and blue
ons in figure 1). As described in section 3, a is deter-
        Likelihood For Photometric Features
                         ck < W                      (38)
istic given the modelz . We assume a linear likelihood
                        M }|         { z }| {
                       X                 X
pixellog P (M |X) =
       features ⇥,           ⇡(x, yx )        (M, i)     (39)
                     z
                     X }|         { z }| {
                         x         T Xi
     log P (M |X) = a) = yx ) ⇥
              P (⇥ |      ⇡(x, w a         .(M, i)   (39)
                                                        (5)
                       x          w i T⇥
                             1 j n1 a n2j n3
                P (M ) =        1     2    3             (40)
We now derive MAP inference. The posterior on M is
                           1 Z n1 n 2 n 3
               P (M ) =                              (40)
                          Z  X⇤ 3
                              1    2
         log P | | = ⇥P               P (⇥ i ⇤ )
         P (M ( ⇥) M ) =(M )log P ( p | ai ) + c
                                               p
                                                        (6)
                                                         (41)
                           X
          log P (M |       )=       p    i
                                     log P (    p | a⇤ ) + c
                                                     p              (41)
 re ai is the orientation p Xdeterministically predicted by
del M at pixel pi and ⇥ is  X       log P ( i a⇤ )
            ⇡mono (x, yx ) = a normalizing| constant. We
                                                   i          (42)
                          = y0 equals i a⇤                (42)
           ⇡mono (x, yx )since itlog P ( 1 |for )a and 0 oth-
e omitted P (ai | M )                         i
                                                   i
                              y0
ise. Taking logarithms, 1 n1 n2 n3
               P (M ) =          1    2    3                  (43)
                            Z
     log P (M | ⇥) = n1 ⇤⇥ + n2 ⇤⇥ + n3 ⇤⇥
                             1         2         3
                            ⇥
                          + X log P (⇥ i | ai ) + k
                          ⇣                            ⌘     (7)
                     X
   log P (D|M ) =             i   log P (di | pi , yx ) ,     (44)
                               x    i2Dx
 re ⇤⇥ = log ⇤3 and similarly for the other penalties, and
      3
orresponds to the normalizing denominators in (6) and
                             X⇣ X
 which we henceforth drop since it makes no difference         ⌘
   log P (D | M ) =                       log P (di | pi , yx ) , (45)
he optimization to come. We can now put (7) into payoff
                                x    i2Dx
m (3) by writing
     Alex Flint, David Murray, Ian Reid                   “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
1                n1       n2       n3
                                                      P (M ) =              1        2        3             (43)
                                                               Z
        Likelihood For Photoconsistency Features
                              ⇣             ⌘                 X X
                                          log P (D|M ) =                     log P (di | pi , yx ) ,        (44)
                                                              x       i2Dx                Frame 0                     Frame i


                                                              X⇣ X                                     ⌘
                                      log P (D | M ) =                          log P (di | pi , yx ) ,     (45)
                                                                  x    i2Dx

                                                                                 X
                                                  log P (X M ) =          ⇡(x, yx )             (46)
                                    Figure 6. The graphical| model relating indoor Manhattan models
iple views are com-
                                    to 3D points. The hidden variablext indicates whether the point is
M followed by re–
                                    inside, outside, photo-consistency measure
                                                     or coincident with the model.reprojection of p into frame k
                                                         X X z}|{ z
                                                               K              }|     {
                                    log P (I 1:K | M ) =         PC (p, reprojk (p, M )) ,
                                    reprojk (p; yx ) and write
                                                                  p2Io k=1
                                                                                                            (47)
  stereo  for the case                                            Ny    M
 lable. We assume                           stereo (x, yx )   =                 PC(p, reprojk (p, yx )) ,        (10)
es I1 , . . . , IM . We                                           y=1 k=1
mera, as output for
 em, and that cam-                  where p = (x, y). To see this, substitute (10) into (3) and
                                      Equivalent to canonical stereo formulation
  intensities to zero               observe thatto indoor is precisely (9).
                                       subject the result Manhattan assumption.
                                       Note that the column–wise decomposition (10) neither
     Alex Flint, David Murray, Ian Reid                       “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
3.4. Combinin
   Likelihood For Point Cloud Features                                                                          We combine
                                                                                                              model by assum
                                                                                                                P (M | Xmono ,
                                                                                                                  P (M )P (Xm
                                                                                                              Taking logarithm
                                 Figure 7. Depth measurements di might be generated by a surface
                                 in our model (represented by ti = ON) or by an object inside                       ⌃joint (x) =
                                 or outside the environment (in which case ti = IN, OUT respec-
                                                                                                              3.5. Resolving
                                 tively).
                                                                                                                 We resolve th
                                                                                                              follows. If C is
                                 as through a window. The likelihoods we use are                              the vertical vani
                                                         ⇤                                                    mal to the floor
                                                             , if 0 < d < r(p; M )                            this orientation t
                                      P (d | p, M, IN) =                            (11)
                                                           0, otherwise                                       number of point
                                                         ⇤                                                    of the diameter
                                                           ⇥ , if r(p; M ) < d < Nd                           take as the floor
                                   P (d | p, M, OUT) =                              (12)
                                                           0 , otherwise                                      mum locations s
                                     P (d | p, M, ON) = N (d ; r(p; M ), ⌥) .                     (13)        We found that t
                                                                                                              on our training s
                                 where     and ⇥ are determined by the requirement that                          Let the two n
                                 the probabilities sum to 1 and r(p; M ) denotes the depth                    and let h = vl
                                 predicted by M at p. We compute likelihoods on d by                          xf and xc on th
Alex Flint, David Murray, Ian Reid                    “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
payoff m
                                              p, yx ) = 1
                                       P (d | P (M) = P (d |np, M 2 . n3
                                                               1     n)          (15) (40)
   Likelihood For Point Cloud Features
                                                            1      2    3
                                                        Z
                         Let D denote all depth measurements, P denote all pixels,
                                                        X
                         and Dx contain indices for   =      log P ( p | a⇤ ) in
                                      log P (M | )all depth measurements+ c col- (41)
                                                                           p                In previo
                         umn x. Then                      p                                 model M
                                                     ⇧ ⇧  X                                 “cropped
                                          ⇡=P (M )yx ) = PlogiP ( ,i yxa⇤ ) (16) (42)
                            P (M | D, P ) mono (x,               (d | pi | )i               interval [
                                                       x i⇤Dxy0                             interval.
                                                        ⌅ ⌅                         ⇥       which M ˆ
                         log P (M | D, P ) =P (M ))+      1 n1 log P n3 | p , y ) ,
                                                                     n2    (d
                                               P (M =         1    2      3 i  i x     (43)     Our a
                                                          Z i⇤Dx
                                                          x
                                                                                            respects.
                                                                                  (17)
                                                   X⇣ X                         ⌘           the form
                         which welog P (D|M ) = form as log P (di | pi , yx ) ,
                                   write in payoff                                     (44) as assum
                                                     x   i2Dx                               corners a
                                                    ⌅
                                    ⌃3D (x, yx ) =       log P (di | pi , yx )    (18)      directly i
                                                   i⇤Dx
                                                                                            ity by O(
                                                                                            For comp
                         and the penalty function ⇤ remains as in (8).                      appendix




Alex Flint, David Murray, Ian Reid             “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
model by assuming conditional independence given M ,
                    ˆ M = argmax P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M )
       Combining| Xmono , Xstereo , X3D ) =
           P (M Features         M
                                                                                    (36) (19)
                      P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M )
                 Taking logarithms leads |to summationM )P (X3D | M )P (M )
                    P (M | X) = P (Xmono M )P (Xstereo | over payoffs,
a surface                                                                           (37)
 ct inside             ⌃joint (x) = ⌃mono (x) + ⌃stereo (x) + ⌃3D (x) .                    (20)
T respec-
                 3.5. log P (M | X) = log P (Xmono | M )+log Pplanes | M )+log P (X3D | M )+log
                       Resolving the floor and ceiling (Xstereo
                                                                                    (38)
                         We resolve the equation of the floor and ceiling planes as
                     follows. If C is8 camera matrix for any frame and vv is
                                             the
                                            >log 1 , if c is a concave corner
                     the vertical vanishing in that iframe, then n = C 1 vv is nor-
                                            <
                     mal to (M, floor>log ceilingcplanes. We corner a plane with
                                 the i) = : and 2 , if i is a concex sweep                     (39)

    (11)             this orientation through the ci is an occluding corner step the
                                              log 3 , if scene, recording at each
                     number of points within a distance ⌅ of the plane (⌅=0.1%
                                                                                               (40)
                     of the diameter of the point = W in our experiments). We
                                                         ck
                                                            cloud
d
    (12)             take as the floor and ceiling planes the minimum and maxi-
                     mum locations such that the < W contains at least 5 points.
                                                         ck plane                              (41)

    (13)             We found No approximations other than conditional failure
                                      that this simple heuristic worked without
                                             independence and occlusions{
                     on our training set.            z
                                                     X }|          { z }|
                                                                        X
ent that                 Let the two non–vertical vanishing points be vl(42) vr
                                    log P (M |X) =        ⇡(x, yx )            (M, i)            and
   Alex Flint, David Murray, letReid = v ⇤ v . Select any twoi corresponding points and 3D Features”
                     and Ian h                         “Manhattan Scene Understanding Using Monocular, Stereo,
                                                       x
er of this paper if we can assume that vertical
world appear vertical in the image. To this end
e simple rectification procedure of [?].                                                     X⇣ X
 = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33)          log P (D | M ) =                 log P (di | pi , yx

                                        INFERENCE
describe our parametrization for indoor Manhat-
                                                                                             x   i2Dx
 Let the image dimensions be Nx ⇥ Ny . Follow-
tion, the vertical seams at which adjacent walls
                 X ck                  Xk                                                                X
                                                                                     log P (X | M ) =         ⇡(x, yx )
      C(M ) =           ⇡(x, yx )             column (34)
  to vertical lines, so each image(M, i) inter-                                                           x
                 x=0                   i=0
y one wall segment. Let the top and bottom of
 olumn x be px X(x, yx ) andX = (x, yx ) re-
                      =                    k
                                          qx            0
                                                                                            X X z}|{ z
                                                                                              K              }
    ˆ
   M = argmax ??). Since each p(M, i) on the              (35)
depicted inM  figure ⇡(x, yx )                     lies                log P (I 1:K   | M) =     PC (p, reprojk
       MAP inference . . . , c i=0, (rx , a ), c }
                       x
                                                                    (33)
 nd each q
      M = x lies on a1 ),ceiling plane, we have k
              {c1 , (r1 , the           k 1     k 1 k 1                                      p2Io k=1
            ˆ
            M = argmax P (M | X)                        (36)
            px = Hq xck
                   M     .             k
                                               (1)
                      X                X
             C(M ) =       ⇡(x, yx )       (M, i)     (34)
a planarP (M )P (X [?]. |We)P (X howi=0 recover | M )
argmax homologymono M showstereo |to payoff matrix:
       Reduced to optimisation over )P (X3D
                      x=0              M
n 3.5. Once H is known, any indoor Manhattan
    M
                          X             X (37)
                                         k
 ly described by the values {yx }, leading to the
           ˆ
           M = argmax        ⇡(x, yx )        (M, i)  (35)
metrization,       M           x                 i=0
 X) = P (Xmono | M )P x stereo | M )P (X3D | M )P (M )
                       N (X
           M = {yx }x=1 .
                8                               (2)
                                                 (38)
                >log( 1 ), if ci is a concave corner
                <
is parametrization as follows. iTo check whether
     (M, i) = log( 2 ), if c is a concex corner          (36)
(M)| lies = log >(XmonoorM )+log P (Xstereo | M )+log P (X3D | M )+log P (M )
 y0 X) on a :   vertical | horizontal surface we
                P
                  log( 3 ), if ci is an occluding corner
  to check whether y0 is between yx0 and yx0 .  0(39)

 the Alex Flint, David Murray,theReid and ceiling planes Scene Understanding Using Monocular, Stereo, and 3D Features”
     3D position of Ian floor                        “Manhattan
t a mul-       error is a per–pixel average of (23).                                                              ⌃fup (x, y 1, a )          (x)
    Figure 9 also shows that joint estimation is superior to
   a mul-       error is a per–pixel average of (23).                                                               ⇧
  pproach,                                                                              fout (x, y, a) = 0max          fdown (x, y + 1, a )      (x)
pproach, one sensor modality alone. Anecdotally we find
  sing any
  have ex-                                                                                                 a ⇥{1,2} ⌃
                                                                                                                    ⌅
 have ex-
         Recursive Sub-problem Formulation
hat using 3D cues future work we intend to use indoor Manhattan mod-
 n we feel           In alone often fails within large textureless
nnefitfeelin which tofuture work we intend to use indoorscene categories.
   we of
 egions a        elsInthe structure–from–motion system failed mod-
                         reason about objects, actions, and Manhattan
                                                                                                                       fin (x, y, a )     (x)
                                                                                                                                                   (25)
                els to reason about objects, actions, and scenefor learning
nefitcues. points, whereas to investigate structuralcues alone             categories.
o trackaany We also intend stereo or monocular SVMs
  3D of                                                                                                                                     ⇥
                We also intend to investigate us to relaxSVMs for learning fup (x, y, a) = max f (·), fup (x, y 1, a) ,
                 parameters, which may allow structural the conditional in-                                                                        (26)
3D cues.
  figure 9.        What is the optimal model up to column x?
  ften performparameters,such regions but us to lack precision
                  better in which may allow can relax the conditional in-
 figure 9. anddependence assumptions between sensor modalities.
                                                                                                                   in
                                                                                                                                              ⇥
 rs. Even
 t corners        boundaries.
 s. Even        dependence assumptions between sensor modalities.                     fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) , (27)
m outper- 11 shows timing results for our system. For each
    Figure
   outper-
 flects the       8. Appendix                                                                                                            ⇥
flects rep- frames, our system requires on average less than
riplet of
attan the       8. Appendix                                                              fin (x, y, a) = max fout (x , y , a) +           ,        (28)
 ttansecond to compute features for all three MAP inference.
  ne rep-            Recurrence relations for frames and less                     Let                      x0 <x
han 100 milliseconds to perform optimization. Ny ,inference. be the
 % of our
                 fout (x, y, a), 1 ⇤relations, 1 ⇤ yMAP a ⇧ {1, 2}
                    Recurrence x ⇤ Nx for ⇤                                       Let                            ⌥x

% of our
                fout (x, y, a), 1 ⇤ for any x , 1 ⇤ y ⇤ Ny , a model 2} be the
                 maximum payoff x ⇤ N indoor Manhattan ⇧ {1, M span-                                          =       ⇥(i, y ) .                   (29)
 show re-
                maximum payoff for any indoor(i) M contains a floor/wall
                 ning columns [1, x], such that Manhattan model M span-
7.
 s. Conclusion
show re-
    Label-
                ning columns [1, x], such that the M contains a floor/wall
                 intersection at (x, y), and (ii) (i) wall that intersects col-
                                                                                                                 i=x0

 . Label-
ular–only
    We have presented a Bayesian and (ii) the out can be intersects col- we have treated fin , fup , and fdown simply as nota-
 lar–only
n of 10%
                intersection orientation a. Then f wall that computed by
                 umn x has at (x, y), framework for scene un-                      Here
                umn x has orientation a. Then fout can be computed by              tional placeholders; for their interpretations in terms of sub–
  erstanding in the context of a of the recurrence relations,ap-
   of 10%
rocedure,
                 recursive evaluation
                                           moving camera. Our
 ocedure,               Recurrence relations
                recursive evaluation of the recurrence relations,
                                                  ⇤
  roach draws on the indoor Manhattan assumption intro-
                                                  ⌃
                                                                                            Boundary Conditions
                                                                                   problems see [7]. Finally, the base cases are
  perior for monocular reasoning and we up (x, y shown) that (x)
  uced   to                                       ⇤f
                                                  ⇧ have         1, a
                                                  ⌃f (x, y 1, a )            (x)
y we find from monocular = a0max ⇧fup (x, y + 1, a )
perior to
echniques
                      fout (x, y, a)
                                     and max ⌃f      down
                                           stereo vision can+ 1, inte- (x)
                                           ⇥{1,2} ⌅                be a )
                                                                               (x)                    fout (0, y, a) = 0           ⌃y, a           (30)
   we find
 xtureless           fout (x, y, a) = 0              down (x, y
  rated with 3D data in a coherent Bayesian framework.(x)
                                        a ⇥{1,2} ⌃fin (x, y, a )
                                                  ⌅                                                    fup (x, 0, a) = ⌅            ⌃x, a          (31)
xtureless
em failed                                           fin (x, y, a )     (x)       (25)
 uesfailed excludes cases for which [14] was unable to find overlapping
 m 1 This row
      alone                                                              ⇥       (25)             fdown (x, Nx , a) = ⌅             ⌃x, a .        (32)
 precision initialization. (x, y, a) = max fin (·), fup (x, y 1, a) ⇥ ,
                       fup                                                       (26)
 nesalone
  es during
precision             fup (x, y, a) = max fin (·), fup (x, y 1, a) ⇥ (26)  ,
                    fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) ⇥ , (27)
  For each
                   fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) , (27)
                                                                     ⇥
For each
 less than             fin (x, y, a) = max fout (x , y , a) + ⇥ ,                (28)
sess than
   and less                             x0 <x
                       fin (x, y, a) = max fout (x , y , a) +          ,         (28)
                                                                                                                   O(WH)
                                                x
  and less                              x0 <x ⌥
                                           = ⌥ ⇥(i, y ) .
                                                x                                (29)
                                           = i=x0 ⇥(i, y ) .                     (29)
                                          i=x0
 ceneFlint, Mei, Murray,we have treated fin , Programming Approach to as nota-
      un-        Here and Reid, “A Dynamic fup , and fdown simply Reconstructing       Building Interiors”, In ECCV 2010
cene Alex Flint,Here we haveIan Reid fin , fup , and fdown in termsas nota-
 Our un-
      ap-        tional placeholders; for their interpretations simply of sub–
                                treated
                 David Murray, [7]. Finally, the base cases are “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
                 problems see
                tional placeholders; for their interpretations in terms of sub–
RESULTS
Input
  • 3 frames sampled at 1 second intervals
  • Camera poses from SLAM
  • Point cloud (approx. 100 points)


Dataset
  • 204 triplets from 10 video sequences
  • Image dimensions 640 x 480
  • Manually annotated ground truth




Alex Flint, David Murray, Ian Reid         “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
RESULTS




Alex Flint, David Murray, Ian Reid    “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
RESULTS
        Algorithm                          Mean depth error (%)                     Labeling error (%)
        Our approach (full)                                             14.5                                24.5
         Stereo only                                                    17.4                                30.5
         3D only [1]                                                    15.2                                28.9
         Monocular only                                                 24.8                                30.8
        Brostow et al. [2]                                                                                  39.4
        Lee et al. [3]                                                  79.8                                54.5


[1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”, ECCV 2010

[2] Brostow, Shotton, Fauqueur, and Cipolla, “Segmentation and recognition using structure from motion point clouds”,
ECCV 2008

[3] Lee, Hebert, and Kanade, “Geometric reasoning for single image structure recovery”, CVPR 2009


Alex Flint, David Murray, Ian Reid                  “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
RESULTS

                                     Stereo Features
                                         730ms
                                                                  3D Features
                                                                      9ms
                                                                   Inference
                                                                     102ms
                                                Monocular Features
                                                     160ms



                 997ms mean processing time per instance
Alex Flint, David Murray, Ian Reid              “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
RESULTS
                 Sparse texture                              Non-Manhattan




Alex Flint, David Murray, Ian Reid    “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
RESULTS
                                     Poor Lighting Conditions




Alex Flint, David Murray, Ian Reid           “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
RESULTS

                                      Clutter




Alex Flint, David Murray, Ian Reid    “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
RESULTS
                                     Failure Cases




Alex Flint, David Murray, Ian Reid     “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
RESULTS
                                     Failure Cases




Alex Flint, David Murray, Ian Reid     “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
SUMMARY

•   We wish to leverage multiple-view geometry for scene understanding.
•   Indoor Manhattan models are a simple and meaningful model family.
•   We have presented a probabilistic model for monocular, stereo, and point
    cloud features.
•   A fast and exact inference algorithm exists.
•   Results show state-of-the-art performance.




Alex Flint, David Murray, Ian Reid     “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”

More Related Content

Similar to Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features

ICCV 2011 Presentation
ICCV 2011 PresentationICCV 2011 Presentation
ICCV 2011 PresentationAlex Flint
 
Surface Normal Prediction using Hypercolumn Skip-Net & Normal-Depth
Surface Normal Prediction using Hypercolumn Skip-Net & Normal-DepthSurface Normal Prediction using Hypercolumn Skip-Net & Normal-Depth
Surface Normal Prediction using Hypercolumn Skip-Net & Normal-DepthChinghang chen
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
 
P.maria sheeba 15 mco010
P.maria sheeba 15 mco010P.maria sheeba 15 mco010
P.maria sheeba 15 mco010W3Edify
 
Resume
ResumeResume
Resumebutest
 
Mit6870 template matching and histograms
Mit6870 template matching and histogramsMit6870 template matching and histograms
Mit6870 template matching and histogramszukun
 
Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Christian Sandor
 
Land scene classification from remote sensing images using improved artificia...
Land scene classification from remote sensing images using improved artificia...Land scene classification from remote sensing images using improved artificia...
Land scene classification from remote sensing images using improved artificia...IJECEIAES
 
Different Image Fusion Techniques –A Critical Review
Different Image Fusion Techniques –A Critical ReviewDifferent Image Fusion Techniques –A Critical Review
Different Image Fusion Techniques –A Critical ReviewIJMER
 
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATIONYOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATIONIRJET Journal
 
Copy of AggiE_Challenge Poster_latest.ppt
Copy of AggiE_Challenge Poster_latest.pptCopy of AggiE_Challenge Poster_latest.ppt
Copy of AggiE_Challenge Poster_latest.pptAnthony Vazhapilly
 
Fcv learn sudderth
Fcv learn sudderthFcv learn sudderth
Fcv learn sudderthzukun
 
Review by g siminon latest 2011
Review by g siminon latest 2011Review by g siminon latest 2011
Review by g siminon latest 2011ujjwal9191
 
Evaluation of the Acceptance of Virtual Worlds in the Tourism Sector: An Ext...
Evaluation of the Acceptance of Virtual Worlds in the Tourism Sector: An Ext...Evaluation of the Acceptance of Virtual Worlds in the Tourism Sector: An Ext...
Evaluation of the Acceptance of Virtual Worlds in the Tourism Sector: An Ext...Virtual Tourism
 

Similar to Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features (20)

ICCV 2011 Presentation
ICCV 2011 PresentationICCV 2011 Presentation
ICCV 2011 Presentation
 
Surface Normal Prediction using Hypercolumn Skip-Net & Normal-Depth
Surface Normal Prediction using Hypercolumn Skip-Net & Normal-DepthSurface Normal Prediction using Hypercolumn Skip-Net & Normal-Depth
Surface Normal Prediction using Hypercolumn Skip-Net & Normal-Depth
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
 
P.maria sheeba 15 mco010
P.maria sheeba 15 mco010P.maria sheeba 15 mco010
P.maria sheeba 15 mco010
 
Resume
ResumeResume
Resume
 
Mit6870 template matching and histograms
Mit6870 template matching and histogramsMit6870 template matching and histograms
Mit6870 template matching and histograms
 
Scientific visualization
Scientific visualizationScientific visualization
Scientific visualization
 
Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012
 
V2 v posenet
V2 v posenetV2 v posenet
V2 v posenet
 
Land scene classification from remote sensing images using improved artificia...
Land scene classification from remote sensing images using improved artificia...Land scene classification from remote sensing images using improved artificia...
Land scene classification from remote sensing images using improved artificia...
 
Different Image Fusion Techniques –A Critical Review
Different Image Fusion Techniques –A Critical ReviewDifferent Image Fusion Techniques –A Critical Review
Different Image Fusion Techniques –A Critical Review
 
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATIONYOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
 
AR/SLAM and IoT
AR/SLAM and IoTAR/SLAM and IoT
AR/SLAM and IoT
 
Copy of AggiE_Challenge Poster_latest.ppt
Copy of AggiE_Challenge Poster_latest.pptCopy of AggiE_Challenge Poster_latest.ppt
Copy of AggiE_Challenge Poster_latest.ppt
 
Ku2518881893
Ku2518881893Ku2518881893
Ku2518881893
 
Ku2518881893
Ku2518881893Ku2518881893
Ku2518881893
 
Fcv learn sudderth
Fcv learn sudderthFcv learn sudderth
Fcv learn sudderth
 
Review by g siminon latest 2011
Review by g siminon latest 2011Review by g siminon latest 2011
Review by g siminon latest 2011
 
Evaluation of the Acceptance of Virtual Worlds in the Tourism Sector: An Ext...
Evaluation of the Acceptance of Virtual Worlds in the Tourism Sector: An Ext...Evaluation of the Acceptance of Virtual Worlds in the Tourism Sector: An Ext...
Evaluation of the Acceptance of Virtual Worlds in the Tourism Sector: An Ext...
 
ICPRAM 2012
ICPRAM 2012ICPRAM 2012
ICPRAM 2012
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features

  • 1. MANHATTAN SCENE UNDERSTANDING USING MONOCULAR, STEREO, AND 3D FEATURES Alex Flint, David Murray, and Ian Reid University of Oxford
  • 2. SEMANTICS IN GEOMETRIC MODELS 1. Motivation 2. Prior work 3. The indoor Manhattan representation 4. Probabilistic model and inference 5. Results and conclusion Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 3. MOTIVATION Single View Computer Vision Multiple View Geometry Sky Tree Water Rock classroom (2.09) classroom (1.99) classroom (1.98) fastfood (−0.18) garage (−0.69) bathroom (−0.99) kitchen (−1.27) Human classroom Sand Beach restaurant (1.57) livingroom (1.55) pantry (1.53) fastfood (−0.12) waitingroom (−0.59) restaurant (−0.89) kitchen (−1.16) dining room bathroom (2.45) bathroom (2.14) bedroom (2.01) laundromat (0.36) operating room(−0.23) dental office (−0.65) bookstore (−1.04) locker room hospitalroom locker room (2.52) corridor (2.27) locker room (2.22) office (−0.04) prisoncell (−0.52) kindergarden (−0.86) bathroom (−1.16) mall (1.69) videostore (1.44) videostore (1.39) tv studio (−0.14) bathroom (−0.51) concert hall (−0.78) concert hall (−1.01) i tore Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 4. MOTIVATION The multiple view setting is increasingly relevant • Powerful mobile devices with cameras • Bandwidth no longer constrains video on the internet • Depth sensing cameras becoming increasingly prevalent Structure-from-motion does not immediately solve: • Scene categorisation • Object recognition • Many scene understanding tasks Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 5. MOTIVATION We seek a representation that: • leads naturally to semantic-level scene understanding tasks; • integrates both photometric and geometric data; • is suitable for both monocular and multiple-view scenarios. The indoor Manhattan representation (Lee et al, 2009) • Parallel floor and ceiling planes • Walls terminate at vertical boundaries • A sub-class of Manhattan scenes Lee, Kanade, Hebert, “Geometric reasoning for single image structure recovery”, CVPR 2009 Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 6. Where would a person stand? Where would doors be found? What is the direction of gravity? Is this an office or house? How wide (in absolute units)? Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 7. Goal is to ignore clutter
  • 8. PRIOR WORK Target image Depth map Depth normal map Mesh • Kosecka and Zhang, “Video Compass”, ECCV 2002 • Furukawa, Curless, Seitz, and Szeliski, “Manhattan World Stereo”, CVPR 2009 2.2 Context in Robotics 11 • Posner, Schroeter, and Newman, “Online generation of scene descriptions in urban environments”, RAS 2008 2.3 Context in Computer Vision 13 • Vasudevan, Gachter, Nguyen, Siegwart, “Cognitive maps for mobile Figure 2.3: Semantic labels output by the system of Posner et al [PSN08]. 2.2.2 Map–centric approaches robots -- an object-based approach”, RAS 2007 An alternative approach to deriving context in robotics applications is to integrate new mea- surements into a map, and then reason about semantics within the map representation. In general this approach enables stronger integration of measurements taken over several time steps, at the cost of relying on the ability to correctly build a map. Buschka and Saffiotti [BS02] have taken a map–centric approach to the problem of identi- • Bao and Savarese, “Semantic Structure From Motion”, CVPR 2011 fying room boundaries within indoor environments and recognising the resultant rooms. A series of laser range scans are fused into a 2D occupancy grid representing the probability that each cell is occupied by some object or boundary. Rooms boundaries are identified by applying dilation and erosion to the occupancy map, which are standard morphological fil- ters from visual segmentation [FP02]. The authors demonstrate that this can be performed with fixed computational cost by discarding old parts of the environment as the robot moves Figurethrough the environment. 2.4: Example of an object–centric map of [VGNS07]. The blue triangles show object detections, the red and green stars show doorways the system has identified, and the red The result of their algorithm is a series of “nodes” with topological connections between dot shows the robot’s inferred place category for the outlined room, which in this case is an them, which correspond to the various rooms and corridors within the robot’s environment office. and while the doorways that connect them. The authors goal it is still instructive to reviewthe and this is not aligned exactly with our own proceed to characterise each node by these Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features” contributions because the ideas they propose for inferring context are often separable from
  • 9. geometry of a scene. For example, Kosaka and Kak [11] rithm using the imag presented a navigation algorithm that allows a monocular fine the “floor-wall” g robot to track its position in a building by associating vi- recovering 3d inform sual cues, such as lines and corners, with the configura- its training process, an tion of hallways on a plan. However, this approach would likely floor-wall bound fail in a new environment where the plan of the room is we present a quantitat not available beforehand. To succeed more generally, one struction on test imag PRIOR WORK needs to rely on a more flexible geometric model. With a the algorithm by appl Manhattan world assumption on a given scene (i.e. one that ages. contains many orthogonal shapes, like in many urban en- vironments), Coughlan and Yuille [3], and Schindler and 2. Background M Dellaert [16] have developed efficient techniques to recover autonomously both extrinsic and intrinsic camera param- eters from a single image. Another successful attempt in In this paper, we fo the field of monocular 3d reconstruction was developed by scenes of the sort that Han and Zhu [7, 8], which used models both of man-made bile robot. We make “block-shaped objects” and of some natural objects, such as camera: trees and grass. Unfortunately, this approach has so far been 1. The image is ob applied only to fairly simple images, and seems unlikely ing a calibrated c to scale in its present form to complex, textured images as Thus, as present shown in Figure 1. world is projecte in homogeneous if:3 • Delage, Lee, and Ng, “A dynamic Bayesian network for Make3D: Learning 3D Scene Structure2. fromtoconta Object Detection The image sponding N d a autonomous 3d reconstruction from a single indoor the floor plane. (F Single Still Image which all surface image”, CVPR 2006 2 A calibrated camera m Ashutosh Saxena, Min Sun and Andrew Y. Ng 1 to the optical axis is known 3 Here, K, q and Q are Figure 2. 3d reconstruction of a corridor from   f 0 ∆u single image presented in Figure 1 using our Make3D: Learning 3D Scene Structure from a Abstract— We consider the problem autonomous algorithm. of estimating detailed 3-d structure from a single still image of an unstructured K=  0 f ∆v  , 0 0 1 Thus, Q is projected onto a • Hoiem, Efros, and Ebert, “Geometric context from a singleSingle Still Hoiemthat focuses also generating aesthetically pleasing Image on developed independently an al- gorithm et al. [9] environment. Our goal is to create 3-d models which are both quantitatively accurate as well as visually pleasing. is some constant α so that Q 4 Vanishing points in the are parallel in 3d space mee “pop-up book” versions of outdoor pictures. Although their For each small homogeneous patch in the image, we use a image”, CVPR 2005 Ashutosh Saxena, Min Sun and Andrew Y. in spirit, it is different from ours in de- algorithm is related Ng Markov Random Field (MRF) to infer a set of “plane parame- ters” that capture both the 3-d location and 3-d orientation of the tail. We will describe a comparison of our method with perspective geometry. Beca cial scenes, they form impo that has mainly orthogonal patch. The MRF, trained via supervised learning, models both image depth cues as well as the relationships between different parts of the image. Other than assuming that the environment Abstract— We consider the problem of estimating detailed is made up of a number of small planes, our model makes no 3-d structure from a single still image of an unstructured • Saxena, Sun, and Ng, “Make3d: Learning 3D scene structure explicit assumptions about the structure of the scene; this enables environment. Our goal is to create 3-d models captureare both the algorithm to which much more detailed 3-d structure than quantitatively accurate as well does prior art, and also give a much richer experience in the 3-d as visually pleasing. from a single still image, PAMI 2008 For each small homogeneous patch in the image, we use a flythroughs created using image-based rendering, even for scenes Markov Random Field (MRF) to infer a set of “plane parame- with significant non-vertical structure. ters” that capture both the 3-d location and 3-d orientation have created qualitatively correct 3-d Using this approach, we of the patch. The MRF, trained via supervised learning, models both downloaded from the internet. models for 64.9% of 588 images image depth cues as well as the relationships extended different We have also between our model to produce large scale 3d parts of the image. Other than assuming thatfew images.1 models from a the environment is made up of a number of small planes, our model makes no Fig. 1. (a) An original image. (b) Oversegmentation of the image to • Lee, Kanade, Hebert, “Geometric reasoning for single image explicit assumptions about the structure Terms— Machineenables Monocular vision, Learning Index of the scene; this learning, “superpixels”. (c) The 3-d model predicted by the algorithm. (d) A scre the algorithm to capture much depth, detailed and structure than more Vision 3-d Scene Understanding, Scene Analysis: Depth of the textured 3-d model. cues. does prior art, and also give a much richer experience in the 3-d structure recovery”, CVPR 2009 flythroughs created using image-based rendering, even for scenes with significant non-vertical structure. I. I NTRODUCTION Using this approach, we have created qualitatively correct 3-d these methods therefore do not apply to the many scenes th models for 64.9% of 588 images Upon seeing an image such as Fig. 1a, a human has no difficulty downloaded from the internet. not made up only of vertical surfaces standing on a hori We have also extended our model to produce 3-d structure (Fig. 1c,d). However, inferring understanding its large scale 3d floor. Some examples include images of mountains, trees models from a few images.1 such 3-d structure remains extremely 1. (a) An original image. (b) Fig. 15b and 13d), staircases (e.g., Fig. 15a), arches (e.g., Fi Fig. challenging for current Oversegmentation of the image to obtain narrow mathematical sense, and 15k), rooftops (e.g., Fig. 15m), etc. that often have Index Terms— Machine learning, Monocular systems.Learningin a“superpixels”. (c) The 3-d model predicted by the algorithm. (d) A screenshot computer vision vision, Indeed, depth, Vision and Scene Understanding, Sceneto recover 3-d depth from a single model. it is impossible Analysis: Depth of the textured 3-dimage, since richer 3-d structure. cues. we can never know if it is a picture of a painting (in which case In this paper, our goal is to infer 3-d models that are the depth is flat) or if it is a picture of an actual 3-d environment. quantitatively accurate as well as visually pleasing. W Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features” Yet in practice people perceive depth remarkably well given do not the insight that most 3-dthat are can be segmented into I. I NTRODUCTION these methods therefore just apply to the many scenes scenes
  • 10. PROBLEM STATEMENT Given: • K views of a scene • Camera poses from structure-from-motion • Point cloud Recover an indoor Manhattan model Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 11. ck k X ) and q X (x, y 0 ) in column x be px = = yx y ) x = (M, i) C(M ) (x, ⇡(x, re- (34) x x y (depicted in figure ??). Since each i=0 lies on the x=0 px ne andPre-processing X each q x lies on the ceiling plane, we have Xk ˆ M = argmax ⇡(x, yx ) (M, i) (35) 1. Detect p = Hq .x vanishing points M i=0 (1) x x 2. Estimate Manhattan homology 8 is a planar homology [?]. We ci is a concave corner >log( 1 ), if show how to recover < 3. Vertically rectify imagesif c is a concex corner tion 3.5. Once H >log( 2 ), any indoor Manhattan (36) (M, i) = is known, i : fully described by log( values c{yxan occluding corner the 3 ), if i is }, leading to the Structure recovery arametrization, ck = W (37) M = {yx }Nx . x=1 (2) ck < W (38) y this parametrization as follows.as check whether Express posterior on models To x0 , y0 ) lies on a vertical zor horizontal surface we likelihood X }| prior { z }| { X eed to check whether y0 is ⇡(x, yx ) yx0 (M, i)yx0 . (39) log P (M |X) = between and 0 ow the 3D position of the xfloor and ceiling planes i can recover the depth of2010) we described an exactIf In (Flint et al, ECCV every pixel as follows. dynamic lies on the floor or ceiling then we simplyform. programming solution for problems of this back– ray onto the corresponding plane. If not, we back– [1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”, In ECCV 2010 Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 12. he camera intrinsics are unknown then we construct the camera matrix v m the detected vanishing points by assuming that the the image locations of ceiling points an available the mapping Hc!f between camera centre is indoor Manhattan scene has exactlyof thefloor and one ceiling plane, both one floor points that are image centre and the image a focal length and aspect ratio such vertically below them (see Figu choosing locations that the h normal direction vv . It willa be useful in the with axis h = v ⇥v and vertex v [15] and ca 1b).arec!f is planar homology following sections to have ted vanishing points H mutually orthogonal. l r v ailable the mapping Hc!f between the image locations of ceiling points and be recovered given the image location of any pair of corresponding floor/ceilin Preliminaries (xf ,withas h = v ⇥v and vertex v [15] and can image locations of the floor points that are vertically below them (see Figure points ). Hc!f is a planar homology xc ) axis Identifying the floor and ceiling planes.l r v T recovered mapping from ceiling plane toof any pair Hc!f = I + µ vv h , • The given the image location floor plane, of corresponding floor/ceiling ( nts (xf a xc ) as homology. , planar vv · h is door Manhattan scene has exactly one floor Tand one ceiling plane, both normalFollowing rectification, H be v , xc+ µ vxh⇥ following sections to have (1) cross ratio of Hc! • where µ =< useful in the f ⇥ direction vv . It will v = I , xf , v c xalongh > is the characteristic c!f transforms points , Although we do notv locations of ceiling such pair (xf , xc ), we can recov ble the mapping Hc!f between the image · h v have a priori any points and image columns. age locations of the c!f using the following RANSAC algorithm. First, we sample one point x Hfloor points that are vertically below them (see Figure ˆ from ⇥with axisabove lthe r and vertex vv [15] and can map, f ⇥h ere µ =< vv , xc , xf , xc thexregion > is the characteristic the Canny of Hc!f . then we sample c!f• Given the label yx at some column x, the orientation is a planar homology h = v ⇥v horizon in cross ratio edge Although we do second pointpriori any such pair (xf , and vwefrom the region below the horizo not have a x collinear with the first xc ), ˆof any recovered as can recover overedfor every pixel in that column can be pair of corresponding floor/ceiling given the image location f v !f using the following RANSAC algorithm. First,H we sample one point xc We compute the hypothesis map ˆ c!f as described above, which we then sco ˆ (xf , xc ) as m thefollows. above the horizon in the Canny edge ˆ region map, then we (x,yx) asample by the number of edge pixels that Hc!f maps onto other edge pixels (accordin T ond point xf collinear with the first µ vv h v from the region below the horizon. ˆ [x + and v 1. Compute yx’ the c!f = Iedge map). , After repeating this for a (1) to = H Cannyˆyx 1] v · h T fixed number of iteratio compute the hypothesis map Hhypothesis with greatest which we then score c!f as described above, score. v 2. Pixels between yx and theH vertical, others are we return yx’ are that>ˆis contain eitherother edgeratio of Hc!f . view of the ceiling. the numbercof edge pixels ⇥images the characteristic view ofpixels (according µ =< vv , x , xf , xc ⇥ xf h Many c!f maps onto no cross the floor or no horizontalmap). After repeating this for a fixed number of iterations the Canny do not such cases H any unimportant since therecan recover hough we edge have a priori is such pair (xf , xc ), we are no corresponding points in t c!f return the hypothesis with the best H image. If greatest score. output from the one point xc using the following RANSAC algorithm. First, we sample RANSAC process has a score below ˆ c!f he region above the horizon no viewwe set µ to amap, view of that ceiling. In Many images contain eitherk in the Canny floor or no then wethe will transfer all pixels outsi threshold t then of the edge large value sample a h cases f collinear with the first and there arethethen have no the horizon. theestimated model. point xHc!f is unimportant since vv c!f will region below impact on the ˆ the image bounds. H from no corresponding points in mpute the best Hc!f map Hc!f asthe RANSAC process has a then score a age. If the hypothesis output from described above, which we score below ˆ eshold kt then we set µ to a large value that will transfer all pixels(x,yx’) ˆ c!f maps onto other edge pixels (according outside number of edge pixels that H image bounds. Hc!f will then have no impact on the estimated model. CannyFlint, David Murray, Ian Reid repeating this for a fixed number of iterations Stereo, and 3D Features” Alex edge map). After “Manhattan Scene Understanding Using Monocular,
  • 13. log P (X | M ) M = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33) log P (X M = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33) C(M ) = X ck ⇡(x, yx ) X ck MODEL X k X (M, i) k (34) log P (I 1:K | M) = XX X x=0 i=0 p2Io k C(M ) = ⇡(x, yx ) (M, i) (34) log P (I 1:K | M ) = X x=0 X ki=0 p ˆ M = argmax ⇡(x, yx ) M (M, i) (35) X X k M x i=0 ˆ M = argmax ⇡(x, yx ) (M, i) (35) M x i=0 ˆ M = argmax P (M )P (XXmono M )P (Xstereo | M )PX3D | M ) mono | (X3D M X stereo ˆ (36) M = argmax P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M ) M (36) P (M | X) = P (Xmono | M )P (Xstereo | M )P (X3D | M )P (M ) (37) P (M | X) = P (Xmono | M )P (Xstereo | M )P (X3D | M )P (M ) (37) log P (M | X) = log P (Xmono | M )+log P (Xstereo | M )+log P (X3D | M )+log P (M ) }| { z }| { (38) z = log P (Xmono | M )+log P (Xstereo | M )+log P (X3D | M )+log P (M ) log P (M | X) X X log P (M |X) = 8 ⇡(x, yx ) (M, i) (38) >log x 1 , if ci is a concave corner < i 8 > 2 , 1 , is i is a concave corner (39) (M, i) = log log if ci if ca concex corner > < : Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 14. x i=0 8 ck < W (38) Prior>log( < if ci is a concave corner 1 ), (M, i) = log( 2 ), if ci is a concex corner (36) > : z log( X }| ci is an z }| corner 3 ), if { X occluding { log P (M |X) = ⇡(x, yx ) concave convex (M, i) occluding (39) M = {c1 , (rck a1 ), . . . , ck x =W (37) (33) 1, i 1 , (rk 1 , ak 1 ), ck } 1 n1 n 2 n 3 P (M ) = ck1 <X W ck 2 3 Xk (38) (40) Z C(M ) = ⇡(x, yx ) (M, i) (34) X x=0 i=0 ⇤ log P (M | ) =z }| P{( z | a}|) +{c X log X p p k (41) X log P (M |X)ˆ= p ⇡(x, yx ) (M,X i) (39) M = argmax ⇡(x, yx ) (M, i) (35) x i X M x i=0 ⇡mono (x, yx ) = log P ( i | a⇤ ) i (42) 8 y0 >log 1 , < if ci is a concave corner (M, i) = 1 log n1 , n2ci is a concex corner if n3 (36) P (M ) = > 1 2 : 2 3 (43) Zlog 3 , if ci is an occluding corner ck = W (37) ck < W (38) Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 15. 2, 3} for each pixel, with < W corresponding to the ck values (38) e Manhattan orientations (shown as red, green, and blue ons in figure 1). As described in section 3, a is deter- Likelihood For Photometric Features ck < W (38) istic given the modelz . We assume a linear likelihood M }| { z }| { X X pixellog P (M |X) = features ⇥, ⇡(x, yx ) (M, i) (39) z X }| { z }| { x T Xi log P (M |X) = a) = yx ) ⇥ P (⇥ | ⇡(x, w a .(M, i) (39) (5) x w i T⇥ 1 j n1 a n2j n3 P (M ) = 1 2 3 (40) We now derive MAP inference. The posterior on M is 1 Z n1 n 2 n 3 P (M ) = (40) Z X⇤ 3 1 2 log P | | = ⇥P P (⇥ i ⇤ ) P (M ( ⇥) M ) =(M )log P ( p | ai ) + c p (6) (41) X log P (M | )= p i log P ( p | a⇤ ) + c p (41) re ai is the orientation p Xdeterministically predicted by del M at pixel pi and ⇥ is X log P ( i a⇤ ) ⇡mono (x, yx ) = a normalizing| constant. We i (42) = y0 equals i a⇤ (42) ⇡mono (x, yx )since itlog P ( 1 |for )a and 0 oth- e omitted P (ai | M ) i i y0 ise. Taking logarithms, 1 n1 n2 n3 P (M ) = 1 2 3 (43) Z log P (M | ⇥) = n1 ⇤⇥ + n2 ⇤⇥ + n3 ⇤⇥ 1 2 3 ⇥ + X log P (⇥ i | ai ) + k ⇣ ⌘ (7) X log P (D|M ) = i log P (di | pi , yx ) , (44) x i2Dx re ⇤⇥ = log ⇤3 and similarly for the other penalties, and 3 orresponds to the normalizing denominators in (6) and X⇣ X which we henceforth drop since it makes no difference ⌘ log P (D | M ) = log P (di | pi , yx ) , (45) he optimization to come. We can now put (7) into payoff x i2Dx m (3) by writing Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 16. 1 n1 n2 n3 P (M ) = 1 2 3 (43) Z Likelihood For Photoconsistency Features ⇣ ⌘ X X log P (D|M ) = log P (di | pi , yx ) , (44) x i2Dx Frame 0 Frame i X⇣ X ⌘ log P (D | M ) = log P (di | pi , yx ) , (45) x i2Dx X log P (X M ) = ⇡(x, yx ) (46) Figure 6. The graphical| model relating indoor Manhattan models iple views are com- to 3D points. The hidden variablext indicates whether the point is M followed by re– inside, outside, photo-consistency measure or coincident with the model.reprojection of p into frame k X X z}|{ z K }| { log P (I 1:K | M ) = PC (p, reprojk (p, M )) , reprojk (p; yx ) and write p2Io k=1 (47) stereo for the case Ny M lable. We assume stereo (x, yx ) = PC(p, reprojk (p, yx )) , (10) es I1 , . . . , IM . We y=1 k=1 mera, as output for em, and that cam- where p = (x, y). To see this, substitute (10) into (3) and Equivalent to canonical stereo formulation intensities to zero observe thatto indoor is precisely (9). subject the result Manhattan assumption. Note that the column–wise decomposition (10) neither Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 17. 3.4. Combinin Likelihood For Point Cloud Features We combine model by assum P (M | Xmono , P (M )P (Xm Taking logarithm Figure 7. Depth measurements di might be generated by a surface in our model (represented by ti = ON) or by an object inside ⌃joint (x) = or outside the environment (in which case ti = IN, OUT respec- 3.5. Resolving tively). We resolve th follows. If C is as through a window. The likelihoods we use are the vertical vani ⇤ mal to the floor , if 0 < d < r(p; M ) this orientation t P (d | p, M, IN) = (11) 0, otherwise number of point ⇤ of the diameter ⇥ , if r(p; M ) < d < Nd take as the floor P (d | p, M, OUT) = (12) 0 , otherwise mum locations s P (d | p, M, ON) = N (d ; r(p; M ), ⌥) . (13) We found that t on our training s where and ⇥ are determined by the requirement that Let the two n the probabilities sum to 1 and r(p; M ) denotes the depth and let h = vl predicted by M at p. We compute likelihoods on d by xf and xc on th Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 18. payoff m p, yx ) = 1 P (d | P (M) = P (d |np, M 2 . n3 1 n) (15) (40) Likelihood For Point Cloud Features 1 2 3 Z Let D denote all depth measurements, P denote all pixels, X and Dx contain indices for = log P ( p | a⇤ ) in log P (M | )all depth measurements+ c col- (41) p In previo umn x. Then p model M ⇧ ⇧ X “cropped ⇡=P (M )yx ) = PlogiP ( ,i yxa⇤ ) (16) (42) P (M | D, P ) mono (x, (d | pi | )i interval [ x i⇤Dxy0 interval. ⌅ ⌅ ⇥ which M ˆ log P (M | D, P ) =P (M ))+ 1 n1 log P n3 | p , y ) , n2 (d P (M = 1 2 3 i i x (43) Our a Z i⇤Dx x respects. (17) X⇣ X ⌘ the form which welog P (D|M ) = form as log P (di | pi , yx ) , write in payoff (44) as assum x i2Dx corners a ⌅ ⌃3D (x, yx ) = log P (di | pi , yx ) (18) directly i i⇤Dx ity by O( For comp and the penalty function ⇤ remains as in (8). appendix Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 19. model by assuming conditional independence given M , ˆ M = argmax P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M ) Combining| Xmono , Xstereo , X3D ) = P (M Features M (36) (19) P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M ) Taking logarithms leads |to summationM )P (X3D | M )P (M ) P (M | X) = P (Xmono M )P (Xstereo | over payoffs, a surface (37) ct inside ⌃joint (x) = ⌃mono (x) + ⌃stereo (x) + ⌃3D (x) . (20) T respec- 3.5. log P (M | X) = log P (Xmono | M )+log Pplanes | M )+log P (X3D | M )+log Resolving the floor and ceiling (Xstereo (38) We resolve the equation of the floor and ceiling planes as follows. If C is8 camera matrix for any frame and vv is the >log 1 , if c is a concave corner the vertical vanishing in that iframe, then n = C 1 vv is nor- < mal to (M, floor>log ceilingcplanes. We corner a plane with the i) = : and 2 , if i is a concex sweep (39) (11) this orientation through the ci is an occluding corner step the log 3 , if scene, recording at each number of points within a distance ⌅ of the plane (⌅=0.1% (40) of the diameter of the point = W in our experiments). We ck cloud d (12) take as the floor and ceiling planes the minimum and maxi- mum locations such that the < W contains at least 5 points. ck plane (41) (13) We found No approximations other than conditional failure that this simple heuristic worked without independence and occlusions{ on our training set. z X }| { z }| X ent that Let the two non–vertical vanishing points be vl(42) vr log P (M |X) = ⇡(x, yx ) (M, i) and Alex Flint, David Murray, letReid = v ⇤ v . Select any twoi corresponding points and 3D Features” and Ian h “Manhattan Scene Understanding Using Monocular, Stereo, x
  • 20. er of this paper if we can assume that vertical world appear vertical in the image. To this end e simple rectification procedure of [?]. X⇣ X = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33) log P (D | M ) = log P (di | pi , yx INFERENCE describe our parametrization for indoor Manhat- x i2Dx Let the image dimensions be Nx ⇥ Ny . Follow- tion, the vertical seams at which adjacent walls X ck Xk X log P (X | M ) = ⇡(x, yx ) C(M ) = ⇡(x, yx ) column (34) to vertical lines, so each image(M, i) inter- x x=0 i=0 y one wall segment. Let the top and bottom of olumn x be px X(x, yx ) andX = (x, yx ) re- = k qx 0 X X z}|{ z K } ˆ M = argmax ??). Since each p(M, i) on the (35) depicted inM figure ⇡(x, yx ) lies log P (I 1:K | M) = PC (p, reprojk MAP inference . . . , c i=0, (rx , a ), c } x (33) nd each q M = x lies on a1 ),ceiling plane, we have k {c1 , (r1 , the k 1 k 1 k 1 p2Io k=1 ˆ M = argmax P (M | X) (36) px = Hq xck M . k (1) X X C(M ) = ⇡(x, yx ) (M, i) (34) a planarP (M )P (X [?]. |We)P (X howi=0 recover | M ) argmax homologymono M showstereo |to payoff matrix: Reduced to optimisation over )P (X3D x=0 M n 3.5. Once H is known, any indoor Manhattan M X X (37) k ly described by the values {yx }, leading to the ˆ M = argmax ⇡(x, yx ) (M, i) (35) metrization, M x i=0 X) = P (Xmono | M )P x stereo | M )P (X3D | M )P (M ) N (X M = {yx }x=1 . 8 (2) (38) >log( 1 ), if ci is a concave corner < is parametrization as follows. iTo check whether (M, i) = log( 2 ), if c is a concex corner (36) (M)| lies = log >(XmonoorM )+log P (Xstereo | M )+log P (X3D | M )+log P (M ) y0 X) on a : vertical | horizontal surface we P log( 3 ), if ci is an occluding corner to check whether y0 is between yx0 and yx0 . 0(39) the Alex Flint, David Murray,theReid and ceiling planes Scene Understanding Using Monocular, Stereo, and 3D Features” 3D position of Ian floor “Manhattan
  • 21. t a mul- error is a per–pixel average of (23). ⌃fup (x, y 1, a ) (x) Figure 9 also shows that joint estimation is superior to a mul- error is a per–pixel average of (23). ⇧ pproach, fout (x, y, a) = 0max fdown (x, y + 1, a ) (x) pproach, one sensor modality alone. Anecdotally we find sing any have ex- a ⇥{1,2} ⌃ ⌅ have ex- Recursive Sub-problem Formulation hat using 3D cues future work we intend to use indoor Manhattan mod- n we feel In alone often fails within large textureless nnefitfeelin which tofuture work we intend to use indoorscene categories. we of egions a elsInthe structure–from–motion system failed mod- reason about objects, actions, and Manhattan fin (x, y, a ) (x) (25) els to reason about objects, actions, and scenefor learning nefitcues. points, whereas to investigate structuralcues alone categories. o trackaany We also intend stereo or monocular SVMs 3D of ⇥ We also intend to investigate us to relaxSVMs for learning fup (x, y, a) = max f (·), fup (x, y 1, a) , parameters, which may allow structural the conditional in- (26) 3D cues. figure 9. What is the optimal model up to column x? ften performparameters,such regions but us to lack precision better in which may allow can relax the conditional in- figure 9. anddependence assumptions between sensor modalities. in ⇥ rs. Even t corners boundaries. s. Even dependence assumptions between sensor modalities. fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) , (27) m outper- 11 shows timing results for our system. For each Figure outper- flects the 8. Appendix ⇥ flects rep- frames, our system requires on average less than riplet of attan the 8. Appendix fin (x, y, a) = max fout (x , y , a) + , (28) ttansecond to compute features for all three MAP inference. ne rep- Recurrence relations for frames and less Let x0 <x han 100 milliseconds to perform optimization. Ny ,inference. be the % of our fout (x, y, a), 1 ⇤relations, 1 ⇤ yMAP a ⇧ {1, 2} Recurrence x ⇤ Nx for ⇤ Let ⌥x % of our fout (x, y, a), 1 ⇤ for any x , 1 ⇤ y ⇤ Ny , a model 2} be the maximum payoff x ⇤ N indoor Manhattan ⇧ {1, M span- = ⇥(i, y ) . (29) show re- maximum payoff for any indoor(i) M contains a floor/wall ning columns [1, x], such that Manhattan model M span- 7. s. Conclusion show re- Label- ning columns [1, x], such that the M contains a floor/wall intersection at (x, y), and (ii) (i) wall that intersects col- i=x0 . Label- ular–only We have presented a Bayesian and (ii) the out can be intersects col- we have treated fin , fup , and fdown simply as nota- lar–only n of 10% intersection orientation a. Then f wall that computed by umn x has at (x, y), framework for scene un- Here umn x has orientation a. Then fout can be computed by tional placeholders; for their interpretations in terms of sub– erstanding in the context of a of the recurrence relations,ap- of 10% rocedure, recursive evaluation moving camera. Our ocedure, Recurrence relations recursive evaluation of the recurrence relations, ⇤ roach draws on the indoor Manhattan assumption intro- ⌃ Boundary Conditions problems see [7]. Finally, the base cases are perior for monocular reasoning and we up (x, y shown) that (x) uced to ⇤f ⇧ have 1, a ⌃f (x, y 1, a ) (x) y we find from monocular = a0max ⇧fup (x, y + 1, a ) perior to echniques fout (x, y, a) and max ⌃f down stereo vision can+ 1, inte- (x) ⇥{1,2} ⌅ be a ) (x) fout (0, y, a) = 0 ⌃y, a (30) we find xtureless fout (x, y, a) = 0 down (x, y rated with 3D data in a coherent Bayesian framework.(x) a ⇥{1,2} ⌃fin (x, y, a ) ⌅ fup (x, 0, a) = ⌅ ⌃x, a (31) xtureless em failed fin (x, y, a ) (x) (25) uesfailed excludes cases for which [14] was unable to find overlapping m 1 This row alone ⇥ (25) fdown (x, Nx , a) = ⌅ ⌃x, a . (32) precision initialization. (x, y, a) = max fin (·), fup (x, y 1, a) ⇥ , fup (26) nesalone es during precision fup (x, y, a) = max fin (·), fup (x, y 1, a) ⇥ (26) , fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) ⇥ , (27) For each fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) , (27) ⇥ For each less than fin (x, y, a) = max fout (x , y , a) + ⇥ , (28) sess than and less x0 <x fin (x, y, a) = max fout (x , y , a) + , (28) O(WH) x and less x0 <x ⌥ = ⌥ ⇥(i, y ) . x (29) = i=x0 ⇥(i, y ) . (29) i=x0 ceneFlint, Mei, Murray,we have treated fin , Programming Approach to as nota- un- Here and Reid, “A Dynamic fup , and fdown simply Reconstructing Building Interiors”, In ECCV 2010 cene Alex Flint,Here we haveIan Reid fin , fup , and fdown in termsas nota- Our un- ap- tional placeholders; for their interpretations simply of sub– treated David Murray, [7]. Finally, the base cases are “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features” problems see tional placeholders; for their interpretations in terms of sub–
  • 22. RESULTS Input • 3 frames sampled at 1 second intervals • Camera poses from SLAM • Point cloud (approx. 100 points) Dataset • 204 triplets from 10 video sequences • Image dimensions 640 x 480 • Manually annotated ground truth Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 23. RESULTS Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 24. RESULTS Algorithm Mean depth error (%) Labeling error (%) Our approach (full) 14.5 24.5 Stereo only 17.4 30.5 3D only [1] 15.2 28.9 Monocular only 24.8 30.8 Brostow et al. [2] 39.4 Lee et al. [3] 79.8 54.5 [1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”, ECCV 2010 [2] Brostow, Shotton, Fauqueur, and Cipolla, “Segmentation and recognition using structure from motion point clouds”, ECCV 2008 [3] Lee, Hebert, and Kanade, “Geometric reasoning for single image structure recovery”, CVPR 2009 Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 25. RESULTS Stereo Features 730ms 3D Features 9ms Inference 102ms Monocular Features 160ms 997ms mean processing time per instance Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 26. RESULTS Sparse texture Non-Manhattan Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 27. RESULTS Poor Lighting Conditions Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 28. RESULTS Clutter Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 29. RESULTS Failure Cases Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 30. RESULTS Failure Cases Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  • 31. SUMMARY • We wish to leverage multiple-view geometry for scene understanding. • Indoor Manhattan models are a simple and meaningful model family. • We have presented a probabilistic model for monocular, stereo, and point cloud features. • A fast and exact inference algorithm exists. • Results show state-of-the-art performance. Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”