Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Three Dimensional Representationand Reasoning for Indoor Scene
Understanding
David C. Lee
August 2011
Department of Electrical and Computer EngineeringCarnegie Mellon University
Pittsburgh, Pennsylvania 15213
Thesis Committee:Takeo Kanade, Chair
Martial HebertAlexei A. EfrosMarios Savvides
Jitendra Malik, UC Berkeley
Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical and Computer
Engineering
c©2011 by David C. Lee. All rights reserved.
Abstract
When addressing the problem of scene understanding from a single image, we
want our system to understand not only where objects are in the image, but also
where they are in the 3D world. Segmenting and labeling regions only in the 2D
image plane does not achieve this goal. We need a representation that inherently
encodes the 3D properties of the scene. In addition to understanding the location in
3D, we also want our system to make use of physical knowledge about valid config-
urations of our world by rejecting configurations that violate physical constraints,
such as two objects occupying the same volume. 3D geometric properties can also
aid in detecting and identifying certain clasess of objects that are well characterized
by their geometry. In this thesis, we will demonstrate the benefits of using 3D rep-
resentation for indoor scene understanding. We will show that the use of models
provides a natural way to represent objects in 3D and inject knowledge we have
about the world to perform geometric reasoning.
3
Acknowledgements
I would first like to thank my advisor Professor Takeo Kanade for his support
and guidance throughout my PhD study. He has provided practical guidance and
has steered me to pursue bigger goals. I would also like to thank Professor Martial
Hebert for his advice and encouragements. I thank friends at CMU for making
my stay in Pittsburgh fun and memorable. Finally, I thank my family and my wife
SooYoon for their endless support and love.
The work presented in this thesis was supported in part by NSF Grant EEEC-
0540865, ONR MURI Grant N00014-07-1-0747, NSF Grant IIS-0905402, and
ONR Grant N000141010766.
4
Contents
1 Introduction 13
1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Related Work 19
3 Representation of the Structure of Building Interiors 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Indoor World Model . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Geometric Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Finding Building Structure . . . . . . . . . . . . . . . . . . . . . . 32
3.5.1 Line Segment Detection and Vanishing Point Estimation . . 32
3.5.2 Generating Building Hypotheses . . . . . . . . . . . . . . . 33
3.5.3 Evaluating Building Hypotheses . . . . . . . . . . . . . . . 34
3.5.4 Converting Building Models to 3D . . . . . . . . . . . . . . 38
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Populating the Scene Frame with Objects . . . . . . . . . . . . . . 42
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5
4 Volumetric Reasoning for Structure and Objects 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Estimating Surface Geometry . . . . . . . . . . . . . . . . . . . . . 56
4.4 Generating Scene Configuration Hypothesis . . . . . . . . . . . . . 57
4.4.1 Generating Room Hypotheses . . . . . . . . . . . . . . . . 57
4.4.2 Generating Object Hypotheses . . . . . . . . . . . . . . . . 58
4.4.3 Volumetric Compatibility of Scene Configuration . . . . . . 59
4.5 Evaluating Scene Configurations . . . . . . . . . . . . . . . . . . . 61
4.5.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5.2 Learning the Score Function . . . . . . . . . . . . . . . . . 62
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Detecting Objects Characterized by Geometry 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Representation of objects and building structure . . . . . . . . . . . 72
5.3.1 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.2 Geometric Properties of Objects . . . . . . . . . . . . . . . 73
5.4 Method Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.1 Creating rectangle hypotheses . . . . . . . . . . . . . . . . 76
5.4.2 Lifting Rectangle Hypotheses to 3D . . . . . . . . . . . . . 78
5.4.3 Creating Building Structure Hypotheses . . . . . . . . . . . 79
6
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6 Conclusion 82
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7
List of Figures
1.1 An example of a complex indoor environment . . . . . . . . . . . . 15
1.2 The Penrose triangle, an example of a physically impossible object. 16
1.3 An example of an invaid configuration, where an object protrudes
into a wall of a room. . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Line segments. Can you recognize the building structure? Can you
find doors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Levels of completeness of line drawings. Left: Complete. Middle:
Missing. Not all structure edges in the real world are present in the
image. Right: Missing and Spurious. Not all lines in the image are
structure edges or even part of the target structure. . . . . . . . . . . 26
3.3 Examples of building models under Indoor World model. All build-
ing models are built by connecting three basic types of corners. Top
left: concave(-) corner. Top middle: convex(+) corner. Top right:
occluding(>) corner. Bottom row: combinations of corners. . . . . 29
8
3.4 Regions divided by vanishing lines and restrictions on types of cor-
ners. Top: Line drawing, vanishing points, and vanishing lines.
Bottom: Types of possible corners in each of the three regions. En-
closed in small boxes are depictions of corners as they would appear
in the image, and next to it are the top-down view of each corners.
In each of the three regions, four types of corners can exist: one
convex(+), one concave(-), and two occluding(>) corners. . . . . . 31
3.5 Solid lines are the minimal set of lines needed to define a corner.
Three lines are needed for convex(+) and concave(-) corners. Four
lines are needed for occluding(>) corners. . . . . . . . . . . . . . . 34
3.6 Generating hypotheses. Left: The process of a hypothesis being
generated by four line segments. Right: A sample of generated
building hypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Line segments and Orientation map. (a) Line segments, vanishing
points, and vanishing lines. (b) Orientation map. Lines segments
and regions are colored according to their orientation. (Best viewed
in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 The shaded area denotes the sweep S (l, vy, α) of line l towards van-
ishing point vy by amount α, and it potentially supports the region
to be orthogonal to vx and vy. . . . . . . . . . . . . . . . . . . . . . 38
3.9 3D models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.10 Percentage of pixels with correct orientation. . . . . . . . . . . . . 42
3.11 Comparison of floor boundary error . . . . . . . . . . . . . . . . . 42
3.12 Examples of doors and people in a scene frame. . . . . . . . . . . . 43
3.13 Examples (Best viewed in color) . . . . . . . . . . . . . . . . . . . 45
9
3.14 Examples with occluding objects. Unobstructed view of the ceiling-
wall boundary helps finding the underlying building structure. (Best
viewed in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.15 Failure examples. (Best viewed in color) . . . . . . . . . . . . . . . 47
3.16 Examples of images downloaded from the web. Top two rows: Suc-
cess. Bottom two rows: Failure. (Best viewed in color) . . . . . . . 48
4.1 (a) Input image. (b) Estimate of the spatial layout of the room with-
out object reasoning. Colors represent the output of the surface
geometry by [36]. Green: floor, red: left wall, yellow: center wall,
cyan: right wall. (c) Evidence from object region removed. (d)
Spatial layout with 2D object reasoning. (e) Object fitted with 3D
parametric model. (f) Spatial layout with 3D volumetric reasoning.
The wall is pushed by the volume occupied by the object. . . . . . 51
4.2 Overview of our approach for estimating the spatial layout of the
room and the objects. . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Examples of volumetric constraint violation. . . . . . . . . . . . . . 60
4.4 Object hypothesis generation: we use the orientation maps to gen-
erate object hypotheses by finding convex edges. . . . . . . . . . . 61
4.5 Two qualitative examples showing how 3D volumetric reasoning
aids estimation of the spatial layout of the room. . . . . . . . . . . 65
4.6 Additional examples to show the performance on a wide variety
of scenes. Dotted lines represent the room estimate without object
reasoning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10
4.7 Failure examples. The first two examples are the failure cases when
the cuboids are either missed or estimated wrong. The last two
failure cases are due to errors in vanishing point estimation. . . . . 66
5.1 Three common “Sculpted Objects” objects modeled using rectan-
gles. (a) Desk and Computer monitor. (b) Doors. . . . . . . . . . . 73
5.2 Relational geometric properties specific to object categories . . . . . 75
5.3 Four types of L-junctions. (a) Given a designation of “up” and
“right” direction, L-junctions are categorized into four types: top-
left, top-right, bottom-left, and bottom-right. (b)(c) L-junctions are
formed by connecting two line segments. Depending on the relative
configuration of two line segments, they form different types of L-
junctions. (b) A bottom-right junction. (c) A top-left junction. . . . 77
5.4 Connection of L-junctions. ID of line segments that form L-junctions
determine which L-junctions can be connected with each other. A
top-left type junction with ID (12,6) can connect with top-right
junction with ID (12,17) but not with ID (3,17) . . . . . . . . . . . 78
5.5 Result for estimating building structure and detecting doors, desks,
and monitors. (a) Office sequence. (b) Common area sequence. . . . 81
11
List of Tables
4.1 Percentage of pixels with correct estimate of room surfaces. First
row performs no reasoning about objects. Second row is our ap-
proach with 3D volumetric reasoning of objects. Columns shows
the features that are used. OM: Orientation map from [45]. GC:
Geometric context from [36]. . . . . . . . . . . . . . . . . . . . . . 66
12
Chapter 1
Introduction
Seeing is a major part of our daily lives. We visually perceive the scene that sur-
rounds us at almost every moment that we are awake. Our perception is not limited
to detecting objects of interest, such as faces, people, cars, chairs, desks, etc. It
includes understanding the entire scene and perceiving the environment, such as
knowing that we are on a busy street and cars are on a road, or that we are in an
office and there are desks and chairs. Furthermore, our understanding is not limited
to understanding just the semantic category of the environment and objects. Our
understanding extends to the underlying 3D geometry of the scene and we know
where things are in the real 3D world, which allows us to navigate in our 3D world
and perform daily tasks, such approaching and sitting on a chair.
Our goal in this thesis is to create computer vision methods that mimic the ability
of human to understand a scene in 3D. We would like our system to understand
the 3D spatial layout of its environment and locate the 3D position objects in the
environment. Such system could allow robots to navigate and manipulate objects in
an environment. It could also be used as an assistive device for people with visual
13
impairment to help perceive their surrounding.
For scene understanding, we believe that it is crucial to attempt to understand it
in three dimensions, rather than to recognize just the semantic category of the scene
and objects in the scene, or to detect objects just in the image and not in the 3D
world. Three dimensional understanding is necessary for a robot to navigate or to
assist people perceive their environment. However, aside from its implications, we
believe that it is better to understand in 3D, even from pure computer vision per-
spective, because 1) there are objects better defined by their 3D geometric properties
rather than appearance properties, and 2) 3D geometry provides strong constraints
on the size and relative location of various components in the scene.
Our goal of three dimensional scene understanding also differs from pure 3D re-
construction methods, such as stereo vision, structure from motion, or depth cam-
eras. Such methods provide only 3D point clouds of a scene and are not capable
of assigning the semantic meaning of those point clouds, such as floor, walls, desk,
and so on. Our goal of scene understanding provides a higher level semantic under-
standing of an environment.
1.1 Challenges
One of the major challenges in scene understanding comes from the loss of three
dimensional information as a result of the perspective projection of the 3D world
onto the 2D image plane. Thus, to recover the three dimensional information from
an image, one must make use of the rules and regularities that exists in the world to
resolve the inherent ambiguity caused by perspective projection.
Our focus in this thesis is on discovering rules that every physical object must
14
Figure 1.1: An example of a complex indoor environment
obey in our world and applying the rules to guide us in scene understanding. For
example, Figure 1.2 and 1.3 shows examples in which such rules are violated. They
are the Penrose triangle (Fig 1.2) and the scene where multiple objects occupy the
same volume (Fig 1.3). The Penrose triangle is an example of object, which in itself
can not be realized in our world. The second example is a scene in which individual
components of the scene, i.e. the room and the object, are physically valid but their
relation prevents the configuration from being realized in our world. By explicitly
ruling out such configurations that do not exist in our world, we can make the task
of understanding the scene easier.
Another challenge is in choosing the right model to represent the 3D scene. Our
goal is to understand both the semantic category and the 3D geometry of com-
ponents in a scene. There has been a recent surge in work on scene understand-
ing [60, 66, 33, 22]. Most of this work represent objects in the image as segmented
regions with associated labels of the category. These methods can tell what objects
are in the scene, but are unable to tell where those objects are in the 3D scene. In-
15
Figure 1.2: The Penrose triangle, an example of a physically impossible object.
Figure 1.3: An example of an invaid configuration, where an object protrudes into a wall of
a room.
stead, if the model we use to represent objects are 3D models, we can both detect
and localize the object simultaneously and achieve our goal of 3D scene under-
standing. However, choosing the right model to represent the 3D scene is not an
easy problem. For example, representing the environment using dense 3D point
clouds or 3D polygons will suffice in providing accurate geometric structure of the
16
scene but will be unable to assign semantic meaning to the structure. Also, 3D point
clouds and polygons have the potential to represent the given scene to a very high
level of detail, but it is very hard to robustly fit to the scene with limited input such
as just a single image of the scene.
Therefore, we need a model that can represent both the semantic category and
the 3D structure, while striking the right balance between model complexity and
robustness to fitting and between generalizabillity and adherance to common envi-
ronments. In this thesis, we propose a model to represent the structure of building
interiors that can represent the 3D structure of the scene and identify the major sur-
faces, such as floor, walls, and ceiling, and is easy to manage in the 2D image space.
We have also proposed the use of simple geometric primitives, such as rectangles
and cuboids, to represent common objects found in indoor environments.
1.2 Our Approach
Our goal is to understand a scene, given an image acquired by a camera. We would
like to build computer algorithms that can recover the structure of building interiors
in 3D given a single image. In addition to the structure of building interiors, we also
detect common objects in indoors, such as doors, desks, and computer monitors.
Our approach towards scene understanding is to use 3D representation and rea-
soning. We carefully make observations about our physical world and then decide
on the representation that is suitable to model our target environment. We then dis-
cover rules about the geometric properties of indoor environments, which objects in
the real world must satisfy. We consider both rules about individual parameterized
model, described in Chapter 3, as well as rules among different objects in the scene,
17
described in Chapter 4. Such rules allow us to limit search to geometrically valid
configurations, resulting in improved estimate of the scene, due to smaller search
space, while guaranteeing the estimated configuration to always be physically valid.
Finally, we extend the idea of 3D representation to detecting the identity of the ob-
jects by recognizing object categories that are better characterized by their geometry
than their appearance, described in Chapter 5.
We limit our target environment to man-made indoor environments, as we spend
major part of our days indoors and indoor scene understanding has huge impli-
cations for robots and assistive technology. In addition, indoor environments are
highly structured, so it is easy to represent components in the scene using parame-
teric model and easy to discover and apply geometric constraints.
The following are the key contribution of this thesis:
• Estimation of structure of building interiors.
– Model to represent building interiors and geometric reasoning to rule
out invalid structures
– Method to estimate local surface orientation in Manhattan environments.
• Detection of objects in indoor environments
– Reasoning about occupied volume of objects and building structures
– Use of three dimensional geometric properties as the main characteriz-
ing feature to identify objects
18
Chapter 2
Related Work
3D scene understanding is one of the most important problems in computer vision
and has received much attention from many researchers. It is related to many dif-
ferent subfields in computer vision. Our work has been influenced by and is built
upon prior work. In this chapter, we will introduce related work and will put our
work in the context of these work.
Scene understanding involves understanding the overall scene and the various
components in an image. In order to understand the components in an image, many
have utilized the relationship among the components in a scene. Ohta et al. [53]
have modeled the relationships of properties among substructures in a scene. More
recently, many researches applied machine learning to model the relationship of
various objects, such as a computer mouse being next to a keyboard, and used that
information to detect objects together [64, 43, 13, 30]. The relationship that they
considered were two dimensional, such as cars being above in the image compared
to the road. While two dimensional relationships are useful, there are cases when
two dimensional relationships can not correctly model a scene. For example, when
19
a car in the foreground is occluding the road behind it, the road appears to be above
the car in the image.
More explicit modeling of the three dimensional relationship between compo-
nents has been done recently by Hoiem et al.[35]. They have modeled the angle
of pitch of the camera while detecting objects simulatneously. This simple model
puts a constraint on the size and position of objects, when the size of objects in real
world are known and are assumed to rest on the ground.
Recovering the 3D structure from images is also a major part of our goal of 3D
scene understanding. There are a number of methods to recover 3D structure from
multiple images, such as structure-from-motion and stereo [15, 29, 47]. The theo-
retical aspects of such methods have matured and modern stereo systems are able
to produce point clouds to high level of accuracy [2]. But such methods relying on
multiple images have the fundamental limitation imposed by the distance between
camera at the time of acquiring the multiple images, that is, the baseline distance,
in the case of stereo, or the distance traveled, in the case of structure-from-motion.
There has been some recent developments that have shown that 3D structure can
be estimated from a single image [34, 57, 58]. Such methods rely on the fact that
there is a pattern in the apperance of image patches that depend on surface normal
or the distance from the camera. For example, the appearance of the ground is
different from buildings or the sky, and the texture of tree leaves are different when
viewed from nearby or far away. They have used machine learning to learn the
distribution of appearance features and map appearance to depth or surface normals,
and eventually recover the underlying 3D structure. These methods do not have
the fundamental limiting factor of baseline distance, so it works for scenes with
greater depth. However, the fidelity of the reconstruction can not be guaranteed and
20
these methods do not generalize to scenes that greatly differ from previously trained
scenes.
The most recent breakthrough in obtaining 3D structure is with dedicated hard-
ware that measures depth directly. [7] Such depth cameras have existed in the past,
but the cost has dropped drastically in the past year to consumer level, so that it is
now possible to use for practical applications.
The three methods that were mentioned for 3D reconstruction, multiple-view,
single-view, and depth cameras, estimates only the 3D structure and are unable to
assign any semantic meaning to the recovered scene. Our goal to understand a given
scene includes both semantic understanding, as well as geometric reconstruction of
a scene.
For man-made environments, a useful subclass of scenes has been proposed
called “Manhattan World” [8]. It assumes that the world is made up of planar
surfaces that have three mutually orthogonal orientations. Such an assumption
holds for many man-made environments, both indoors and outdoors, and proved
to be useful. In a Manhattan World, there are three vanishing points, which are
points in the image to which parallel lines in 3D converge. Estimating vanishing
points [55, 67, 40, 62, 3] allows us to infer the 3D orientation of parallel lines and
provides useful information for later processes. A number of work have detected
rectangular structures [41, 49, 28, 51, 71] by benefiting from vanishing point esti-
mation and the Manhattan World assumption. There are also multiple-view meth-
ods that make use of Manhattan assumption to achieve impressive results [20, 21].
Another subclass of “Manhattan World”, called “Indoor Manhattan World”, has
recently been explored, both by the work in this thesis and others that were done
during a similar time. Our work [45] have first propsed a subclass of “Manhattan
21
world” by adding an additional constraint to Manhattan world that there are at most
two horizontal surfaces, the floor and the ceiling. Such constraint allowed us to
build a model that can represent most indoor building structures. A slightly simpler
model was proposed by [31] that represents rooms by boxes. Since then, many
work have adapted the model to estimate the structure of indoor environments. [44,
19, 32, 70, 26]
At the object level, the past decade was particularly successful for object detec-
tors and has matured enough to be of practical use for a few classes of objects,
such as faces [68], and pedestrians [10]. Such efforts are expanding to more classes
of objects [18, 17], driven by organized challenges, such as the PASCAL chal-
lenge [14]. Such success has been based on methods that make use of appearance
features. But as reported in [14], some objects turns out to be harder to detect than
others, even when the same appearance-based method has been applied.
In contrast to recent appearance-based object detection methods, geometry-based
methods has been explored in the past and has been the primary method for the
most part of the history of computer vision from 1960s to 1990s before the surge of
appearance based methods. Early geometry based methods are well summarized in
the article by Mundy [50].
One of the earliest and most influential is the work on blocks world [54]. It as-
sumes that the world is made of composition of polyhedral components and has
solved for parameters of polyhedral models to fit edges. The work has been ex-
tended by many researchers, especially in exploring constraints for labeling edges [27,
6, 37, 69, 48, 61]. These work were limited to either contrived scenes or ground
truth line drawing images, rather than real scenes, and the objects they considered
were artificial blocks and not realistic objects. Also, their focus was on recovering
22
the geometric structure of objects, rather than determining the semantic category.
A group of work has emerged that recognizes objects by aligning manually de-
fined 3D object models to images [46, 4, 24, 1, 5, 38, 63]. Such methods bypasses
the problem of grouping of features and are robust to occlusion or missing evi-
dences. However, these methods eventually led to the problem of ambiguity of
image features, so the focus of research has shifted away from geometry and led to
methods that focus on learning statistical distribution in appearance.
Our work tries to make use of 3D geometry and three dimensional reasoning
at all levels of scene understanding: to represent the global structure, to rule out
physically invalid configurations, and to detect objects. This is the main motivation
of our work.
23
Chapter 3
Representation of the Structure of
Building Interiors
We study the problem of generating plausible interpretations of a scene from a
collection of line segments automatically extracted from a single indoor image.
We show that we can recognize the three dimensional structure of the interior of a
building, even in the presence of occluding objects. Several physically valid struc-
ture hypotheses are proposed by geometric reasoning and verified to find the best
fitting model to line segments, which is then converted to a full 3D model. Our ex-
periments demonstrate that our structure recovery from line segments is comparable
with methods using full image appearance. Our approach shows how a set of rules
describing geometric constraints between groups of segments can be used to prune
scene interpretation hypotheses and to generate the most plausible interpretation.
24
3.1 Introduction
It is easy for us to recognize the building structure in Figure 3.1, as well as locate
a few doors. However, automatic recognition of structure from a collection of line
segments is challenging, as not all lines defining the building structure are perfectly
detected by low level image processing. To further complicate the problem, extra
edges may lie on surfaces of walls or even on objects that are not part of the target
structure (Figure 3.2). We can still interpret the collection of line segments because
1) we perform geometric reasoning and only consider physically plausible interpre-
tations, 2) we have the ability to look globally at the overall structure, and 3) we
have prior knowledge on how the world, in our case the interior of a building, is
structured.
As images are projections of the real world, it is desirable to interpret them only
in ways which can be realized in the real world. Geometric inference, when jointly
done with semantic labeling, may be more demanding, but it may significantly
reduce the problem space and make the problem, in fact, easier.
In this work, we tackle the problem of interpreting collection of line segments to
recognize the structure of buildings. We search for building models that translate to
physically plausible three dimensional building models. We perform geometric rea-
soning to generate many physically valid structure hypotheses from line segments.
Each hypothesis is tested to find the one that best matches the collection of line
segments. We have also done preliminary experiments to detect objects, using the
recovered structure as a “scene frame”, which provides geometric context to objects
in the scene.
25
Figure 3.1: Line segments. Can you recognize the building structure? Can you find doors?
Figure 3.2: Levels of completeness of line drawings. Left: Complete. Middle: Missing.
Not all structure edges in the real world are present in the image. Right: Missing and
Spurious. Not all lines in the image are structure edges or even part of the target structure.
3.2 Prior Work
Line drawings have been studied from the early days of computer vision. Guz-
man [27] was the first to interpret line drawings to separate collection of polyhedral
objects into parts. Huffman [37] and Clowes [6] came up with a formal scheme
of labeling lines into convex, concave, and occluding for polyhedral objects, with
26
which 3D description of objects can be recovered and impossible objects can be re-
jected. Mackworth [48] introduced the concept of gradient space and surface based
constraints. Waltz [69] expanded the problem by allowing line drawings to include
shadows, cracks, and missing edges (Figure 3.2). Kanade [39] dealt with “origami
world”, which includes hollow shells and planar sheets, and utilized heuristics, such
as parallel lines in image are parallel in space. Sugihara [61] provided an algebraic
optimization approach for interpreting line drawings. However, these approaches
were limited to synthetic line drawings and were not applied to real images.
Kosecka’s group have a number of papers on images of the Manhattan world by
using information from line segments. Kosecka and Wei [40] developed a method
to recover vanishing points and camera parameters from a single image by us-
ing line segments found in Manhattan structures. Using the recovered vanishing
points, rectangular surfaces aligned with major orientations were detected by Wei
and Kosecka [41] and more recently by Micusik et al. [49]. Han and Zhu [28]
have also worked on finding rectangles aligned with vanishing points from line seg-
ments. They used top-down grammars, which helped finding rectangles forming
regular patterns, such as grid or box patterns. However, these approaches operate
directly in 2D image space (except when multiple images were used) and do not
attempt to extract three dimensional information from a single image.
A number of papers address the problem of recovering three dimensional struc-
ture from a single image. Three dimensional information can be extracted from a
single image when there is a reference in the image [9]. A commonly used refer-
ence is the ground plane. Hoiem et al. [34] and Delage et al. [12] take a two-step
approach for recovering 3D structure of outdoor images and indoor images respec-
tively: 1) estimate image region orientation (e.g., ground, vertical) using statistical
27
methods on image properties, such as color, texture, edge orientation, position in
image, etc. 2) “pop-up” vertical regions by “folding” along the crease between
ground and vertical regions. Saxena et al. have taken a different approach by esti-
mating absolute depth directly from image properties [57], and smoothly connect-
ing regions under weak assumptions, such as connectivity or coplanarity, without
the explicit assumption of a ground plane [58].
An interesting observation was made by Nedovic et al. [52] that a typical scene
can be categorized into a limited number of categories of 3D scene geometry, which
they call “stages”. Categories of stages include sky+ground, box, corner, and per-
son+background, and the stage information can potentially serve as a guide for a
more complete depth estimation or a more detailed scene understanding.
3.3 Indoor World Model
Most indoor environments satisfy the Manhattan World assumption [8], i.e., most
planes lie in one of three mutually orthogonal orientations. In addition, indoor envi-
ronments usually have a single floor plane and a single ceiling plane with constant
ceiling height. Combining the “Manhattan World” and “single-floor single-ceiling”
models, we propose the “Indoor World” model as an useful approximation for in-
door scenes.
This world model applies to most indoor environments and has a number of de-
sirable properties. First of all, it is easy to represent a physically valid model of a
scene in two dimensional image space, which can be effortlessly translated into a
three dimensional model. By geometric reasoning on the configuration of edges, we
can represent a scene structure in two dimensions that encodes a physically valid
28
Figure 3.3: Examples of building models under Indoor World model. All building models
are built by connecting three basic types of corners. Top left: concave(-) corner. Top
middle: convex(+) corner. Top right: occluding(>) corner. Bottom row: combinations of
corners.
three dimensional structure. Examples of such representation of scenes are depicted
in 3.3.
Another desirable property is the symmetry that it introduces between the shape
of the ceiling and the floor. Building models under this assumption have sym-
metric floor and ceiling shape. Evidence to infer building structure from a single
image mostly comes from the position of boundaries between planes, but floor-
wall boundaries are often occluded by objects such as desks, chairs, and bookcases,
as shown in Figure 3.14. Even in those cases, ceiling-wall boundaries are rarely
occluded, so observing ceiling-wall boundaries and assuming symmetry between
them allows us to infer the location of floor-wall boundaries.
29
3.4 Geometric Reasoning
As the world is made up of solid objects, projections of the world onto an image
obey a set of rules. In particular, projections of buildings under the Indoor World
assumption are geometrically constrained by a small set of rules defined on connec-
tion of walls, which we define as corners. An indoor scene can be fully represented
by corners, so geometric constraits on corners will guarantee the entire structure to
be valid.
There are three types of corners: convex(+), concave(-), and occluding(>). A
convex(+) or concave(-) corner is formed when two walls meet at one place in 3D
space and an occluding(>) corner is formed when one wall is in front of another
wall but appears to be adjacent in the image. The type and position of a corner is
constrained depending on where the corner is in the image.
The simplest constraint on a corner is that it should consist of two junctions, one
above the horizon and one below the horizon. This rule holds because the camera
itself is between the floor and the ceiling. Regions divided by vertical vanishing
lines also create constraints. In each of the three regions divided by two vertical
vanishing lines, only a total of four types of corners can exist, as illustrated in Fig-
ure 3.4. These rules are derived from facts about the physical world and geometry,
such as, the camera must be in an empty quadrant of a wall in order for it to be able
to observe the corner, and walls should have non-zero thickness.
These constraints are simple to adhere to, even at an early stage of inference when
no consideration about the 3D coordinates are made. Also, they can be applied
only to local and primitive corner structures, even when no consideration about the
global structure of the scene has been made. Yet, performing geometric reasoning
30
Figure 3.4: Regions divided by vanishing lines and restrictions on types of corners. Top:
Line drawing, vanishing points, and vanishing lines. Bottom: Types of possible corners
in each of the three regions. Enclosed in small boxes are depictions of corners as they
would appear in the image, and next to it are the top-down view of each corners. In each of
the three regions, four types of corners can exist: one convex(+), one concave(-), and two
occluding(>) corners.
according to these constraints will guarantee that our entire building model encodes
a valid model, which can be easily converted to a valid 3D model without ambiguity.
31
3.5 Finding Building Structure
Finding the building structure is done in three steps; line segments and vanishing
points are found, many plausible building model hypotheses are created, and each
hypothesis is tested against an orientation map, which is a map of local belief of re-
gion orientations, to find the best matching hypothesis. Each step will be explained
in detail in the following sections.
3.5.1 Line Segment Detection and Vanishing Point Estimation
We extract line segments using the Matlab toolbox by Kovesi [42], which runs
Canny edge detector, links edge pixels, and fits line segments. We then recover
vanishing points from these line segments.
From the three vanishing points, we can recover the orientation of the three axes
of the building in the camera coordinate by formulas in Appendix. This allows us to
reconstruct an accurate 3D model, even when none of the camera axes are aligned
with world coordinates.
We loosely follow Rother [55] to find three orthogonal vanishing points. Two
pairs of lines are randomly sampled in RANSAC fashion and the intersection of
each pair of lines generates a candidate vanishing point. Orthogonality of the two
vanishing points is verified using formulas in Appendix and the third vanishing
point is computed to be orthogonal to the two vanishing ponts. Then the three
candidates are evaluated using the cost function proposed in [55]. Finally, the x,
y coordinates of the best RANSAC solution are fine tuned using non-linear opti-
mization (Matlab fminsearch) with the same cost function. To ensure orthogonality
under optimization, vanishing points are translated into a rotation matrix, which
32
can then be parameterized with three unbounded parameters using Rodrigues’ for-
mula [16]. The highly non-convex nature of the cost function is not a big issue, as
the RANSAC solution was already close to the true solution.
For uncalibrated images with no available camera intrinsic parameters, three
pairs of lines are sampled to create a proposal, and orthogonality is loosely en-
forced by constraining three vanishing points to be apart from each other. Once
three vanishing points are found in image space, the focal length of the camera can
be recovered by finding a focal length that makes the angles exactly 90 degrees.
In practice, this method returned vanishing points within a few pixels of the true
vanishing points for all 102 test images when camera parameters were available,
and 40 out of 44 images when camera parameters were not available. It failed when
there were no lines in one of the three direction, or when many lines were not in the
principal directions.
3.5.2 Generating Building Hypotheses
For this and the following section, we define “orientation of a line segment” to
be the orientation of the line in the world, which can be estimated by the vanishing
point that lies on the extension of the line segment in the image. Similarly, “parallel”
line segments means parallel in the world. “Orientation of a surface’ is defined as
the normal orientation of the surface in the world and “pixel orientation” as the
orientation of the surface projected to the pixel.
Building models can be generated by connecting line segments to create corners,
and connecting corners to create building models. A corner consists of five lines,
but not all five lines need to be present to define a corner. Concave(-) and convex(+)
corners need three lines, and occluding(>) corners need four lines to be defined
33
Figure 3.5: Solid lines are the minimal set of lines needed to define a corner. Three lines
are needed for convex(+) and concave(-) corners. Four lines are needed for occluding(>)
corners.
(Figure 3.5). A new corner is proposed when a minimal set of lines defines a corner,
while obeying the constraints on corners described in Section 3.4.
The process of generating hypotheses is illustrated in Figure 3.6. We start by
creating building hypotheses with zero corners, i.e., scenes with just one wall. Two
parallel line segments, one above the horizon and one below the horizon, are ex-
tended until the image boundaries to define the floor-wall and ceiling-wall bound-
ary of a wall. Next, we search for line segments that can be extended to “attach” to
existing walls to propose a new corner. Note that an existing wall already defines
two lines, so only one additional line need to be added to propose a concave(-) or
a convex(+) corner, and two for an occluding(>) corner. By repeatedly attaching
more corners to an existing structure, we can create a scene with many corners.
This process is described in Algorithm 1.
3.5.3 Evaluating Building Hypotheses
We test all building hypotheses to find the best fitting hypothesis to a given col-
lection of line segments. This is done by evaluating the fitness of hypotheses to
an orientation map (Figure 3.7), which is a map that expresses the local belief of
region orientations computed from line segments. The fitness of a hypothesis to an
34
Figure 3.6: Generating hypotheses. Left: The process of a hypothesis being generated by
four line segments. Right: A sample of generated building hypotheses.
orientation map is defined as the total number of pixels which the orientation agrees
between that encoded by the hypothesis and that given by the orientation map. The
hypothesis with the largest fitness is chosen as the best fitting hypothesis.
Two line segments having different orientation supporting a pixel is a strong in-
dication of the pixel orientation to be perpendicular to the orientation of the two
lines. For example, we, as human, believe pixel (1) in Figure 3.7(a) is on a hor-
izontal surface because a green line above it and a blue line to the right supports
pixel (1) to be perpendicular to the orientation of both lines. Pixel (2) seems to be
on a vertical surface because green lines above and below and red lines to the left
support it. Notice that, although there is a blue line below pixel (2), its support is
blocked by the green line between the blue line and the pixel. The support of a line
extends until it hits a line which has the same orientation as the normal orientation
of the surface it is supporting. This is because a line can not be on a plane that is
35
Algorithm 1 Generating building hypotheses
Set H0 ← ∅, where H0 is the set of hypotheses with zero corners.
for all pair of line segments (li, lj) do
if li above horizon ∧ lj below horizon ∧ li and lj have overlap then
Add scene with no corner (li, lj) to H0
end if
end for
for k = 1 to n, where n is maximum number of corners in scene do
Set Hk ← ∅, where Hk is the set of hypotheses with k corners.
for all h ∈ Hk−1 do
Find sets of lines that create corners that attaches to h and satisfies geometric
constraints.
H ′ ← Set of all scenes with a new corner attached to h
Hk ← Hk ∪H ′
end for
end for
return H ← H0 ∪H1 ∪ · · · ∪Hn
perpendicular to it. This logic usually produces accurate orientation map, except
around occluding boundaries.
More formally, let Lx = {lx,1, lx,2, · · · , lx,nx} be the set of line segments of ori-
entation x, where x ∈ {1, 2, 3} denotes the one of the three orientations. A “sweep”
S (lx,i, vy, α) of a line lx,i towards vanishing point vy by amount α is the set of pix-
els that is supported by line lx,i to be orientation z (Figure 3.8). x, y, and z take
values in {1, 2, 3} and all three should be different (x 6= y, x 6= z, and y 6= z).
Given a line segment lx,i with end points p1 and p2, S (l, vy, α) is the convex hull
36
created by p1, p2, p′1, and p′2, where p′1 and p′2 is given by
p′1 = p1 + α (vy − p1) ,
p′2 = intersection (line (vx, p′1) , line (vy, p2)) ,
where line (·, ·) denotes a line passing through two points and intersection (·, ·)
denotes the point of intersection of two lines.
The sweep extends until the sweep region contains a line that “blocks” the sweep.
The amount of sweep α̂x,i and −β̂x,i, towards and away from its sweep direction is:
α̂x,i = max (α) , β̂x,i = max (β) ,
such that α ≥ 0, β ≥ 0, and no lines in Lz intersect S(lx,i, vy, α) and S(lx,i, vy,−β).
The set of pixels that is supported by all lines in Lx swept towards vy to be
orientation z is:
Px,y,z =⋃
lx,i∈Lx
S(lx,i, vy, α̂x,i) ∪ S(lx,i, vy, β̂x,i).
A pixel is believed to have orientation z when two lines of different orientation x
and y support the pixel, and only when it is exclusively supported to be z. The final
orientation map Oz for orientation z is given by:
Rz = Px,y,z ∩ Py,x,z
Oz = Rz ∩Rcx ∩Rc
y.
Figure 3.7(b) shows O1, O2, and O3 colored in red, green, and blue.
37
(1)
(2)
(a) (b)
Figure 3.7: Line segments and Orientation map. (a) Line segments, vanishing points, and
vanishing lines. (b) Orientation map. Lines segments and regions are colored according to
their orientation. (Best viewed in color)
l l
1p
2p
1p
2p
yv
xv
Figure 3.8: The shaded area denotes the sweep S (l, vy, α) of line l towards vanishing point
vy by amount α, and it potentially supports the region to be orthogonal to vx and vy.
3.5.4 Converting Building Models to 3D
Two dimensional building model hypotheses always encode valid 3D models, so
computing 3D coordinates can be done easily without ambiguity. 3D coordinates
can be computed sequentially for floor, then walls using the constraint that floor and
38
walls are connected, and finally the ceiling, using the following formulas.
All units of metrics are in camera height, i.e., the distance between the floor and
the camera measured perpendicular to the floor equals 1, since absolute distances
can not be measured from images. Lower case: 2D homogeneous coordinates.
Upper case: 3D coordinates. Vanishing points with subscript 1 (v1, V1) indicates
the vertical vanishing point. K: camera intrinsic parameter matrix
• Ray
P = λK−1p, λ > 0
• Normal direction of the three major axes given coordinates of three vanishing
points (xk, yk) in image.
vk = (xk, yk, 1)T ⇔ Vk =K−1vk‖K−1vk‖2
• 3D coordinate of a point on the floor. Note that the height is normalized to 1.
P =K−1p
V T1 K
−1p
• Height h between two points p1 and p2, with p1 being a point on the floor.
p1, p2, and v1 should roughly be in line when applying this formula, as we
assume P1 and P2 are vertically aligned in 3D.
P2 = λK−1p2
= P1 + hV1
=K−1p1
V T1 K
−1p1+ hV1
[−V1 K−1p2
] h
λ
=K−1p1
V T1 K
−1p1
Solving least-squares gives h.
39
Figure 3.9: 3D models
Small errors can accumulate during the above mentioned sequential process, so
we follow Delage et al. [11] to globally minimize the distances between connected
planes using linear programming. Recovered 3D models are visualized in Fig-
ure 3.9.
3.6 Experiments
We have collected 54 images of indoor scenes. We have also included objects in
the image that obstruct the view of the scene frame. We have manually labeled
the ground truth orientation for every pixel, ignoring the occluding objects. The
percentage of pixels that have the correct orientation for each image is reported
in Figure 3.10. On average, 81% of the pixels were classified correctly. 76% of
40
the images had less than 30% misclassified pixels, and 44% had less than 10%
misclassified pixels. Qualitatively, around 70% of the images returned acceptable
3D models. Notice that even when objects occlude the floor-wall boundary, the
underlying building structure could be recovered (Figure 3.14). In these cases, the
unobstructed view of the ceiling-wall boundary have helped finding the underlying
building structure. Typical failure cases are: hallways being cut off early when
there are no lines supporting down the hallway, missing corners, or misaligned
boundaries (Figure 3.15).
We have compared our results with other works on recovering indoor structure
from a single image. We had comparable results as Delage et al. [12], with their
experimental setup and dataset, which had 48 images of indoor campus scenes.
RMS error between the estimated and ground truth floor boundary was measured
in pixel space, and is plotted as a function of the position of the true floor boundary
(Figure 3.11). Comparing with Hoiem et al. [35], using their classifier trained for
indoor images, we have a higher percentage of correctly classified pixel orientation
on 20 out of 48 images, and a mean percentage of 80% versus 87%. In both cases,
our results are comparable, while relying only on line segments and not on image
properties such as colors and image gradients, which can be scene specific.
We have also tested on the 44 images downloaded from the web, also collected
by Delage et al.. Qualitatively, around 20 of them returned acceptable 3D models.
Failures were due to many objects that cluttered the scene, and scenes that do not
match our building model. Sample results are shown in Figure 3.16.
41
0 5 10 15 20 25 30 35 40 45 50 550
0.2
0.4
0.6
0.8
1
Image index
Per
cent
age
Average=0.82
Figure 3.10: Percentage of pixels with correct orientation.
0 50 100 150 200 250 300 350 400 4500
20
40
60
80
100
120
Height of ground truth (pixels)
RM
S e
rror
in lo
calis
atio
n (p
ixel
s)
Our resultDelage et.al.
Figure 3.11: Comparison of floor boundary error
3.7 Populating the Scene Frame with Objects
Now that we have the scene structure, we would like to use it as a “frame” that
defines the scene, and populate it with objects in the scene. Recovering the “scene
42
Figure 3.12: Examples of doors and people in a scene frame.
frame” is a stepping stone toward a more complete scene understanding, as it pro-
vides a global geometric context of the scene. Our ultimate goal is to recognize all
the objects in a scene. Most objects of interest fall into one of the two categories:
objects that lie on the floor, and objects that are attached to a wall. Objects that lie
on the floor interacts with the scene frame by being supported at the point it contacts
the floor of the frame, which determines its 3D location. These objects need to be
in an empty space of the frame, and not inside walls. Locations of objects attached
to walls are also constrained by the scene frame. Figure 3.12 shows results of in-
tegrating the recovered scene structure with door and pedestrian detection. More
thorough study on improving building structure recovery by adding objects into the
framework will be done in Chapter 4. Study on improving object detection using
geometry will be presented in Chapter 5.
3.8 Conclusion
We have proposed a framework to interpret collection of line segments to recover
three dimensional building structure. We have shown that, by geometric reasoning,
43
and by using the prior knowledge of indoor environments, we can recover the struc-
ture of a building, using only line segments. An interesting future problem would
be to use our recovered structure as a “scene frame” to recognize more components
in the scene and step towards the grand goal of complete scene interpretation.
44
Figure 3.13: Examples (Best viewed in color)
45
Figure 3.14: Examples with occluding objects. Unobstructed view of the ceiling-wall
boundary helps finding the underlying building structure. (Best viewed in color)
46
Figure 3.15: Failure examples. (Best viewed in color)
47
Figure 3.16: Examples of images downloaded from the web. Top two rows: Success.
Bottom two rows: Failure. (Best viewed in color)
48
Chapter 4
Volumetric Reasoning for Structure
and Objects
In the previous chapter, we have developed a method to recover the structure of
building interiors, while treating objects in the scene as outliers. In this chapter,
we show that by explicitly modeling objects and applying volumetric constraints
derived from the principles based on the physical world, the estimated structure is
geometrically plausible and the performance of the estimate improves.
4.1 Introduction
Consider the indoor image shown in Figure 4.1. Understanding such a complex
scene not only involves visual recognition of objects but also requires extracting
the 3D spatial layout of the room (ceiling, floor and walls). Extraction of the spatial
layout of a room provides crucial geometric context required for visual recognition.
There has been a recent push to extract spatial layout of the room by classifiers
which predict qualitative surface orientation labels (floor, ceiling, left, right, center
49
wall and object) from appearance features and then fit a parametric model of the
room. However, such an approach is limited in that it does not use the additional
information conveyed by the configuration of objects in the room and, therefore, it
fails to use all of the available cues for estimating the spatial layout.
In this work, we propose to incorporate an explicit volumetric representation of
objects in 3D for spatial interpretation process. Unlike previous approaches which
model objects by their projection in the image plane, we propose a parametric rep-
resentation of the 3D volumes occupied by objects in the scene. We show that such
a parametric representation of the volume occupied by an object can provide crucial
evidence for estimating the spatial layout of the rooms. This evidence comes from
volumetric reasoning between the objects in the room and the spatial layout of the
room. We propose to augment the existing structured classification approaches with
volumetric reasoning in 3D for extracting the spatial layout of the room.
Figure 4.1 shows an example of a case where volumetric reasoning is crucial in
estimating the surface layout of the room. Figure 4.1(b) shows the estimated spatial
layout for the room (overlaid on surface orientation labels predicted by a classi-
fier) when no reasoning about the objects is performed. In this case, the couch is
predicted as floor and therefore there is substantial error in estimating the spatial
layout. If the couch is predicted as clutter and the image evidence from the couch
is ignored (Figure 4.1(c)), multiple room hypotheses can be selected based on the
predicted labels of the pixels on the wall (Figure 4.1(d)) and there is still not enough
evidence in the image to select one hypothesis over another in a confident manner.
However, if we represent the object by a 3D parametric model, such as a cuboid
(Figure 4.1(e)), then simple volumetric reasoning (the 3D volume occupied by the
couch should be contained in the free space of the room) can help us reject physi-
50
Object pushes wall
(a) Input image
(b) Spatial layout without
object reasoning (c) Object removed (d) Spatial layout with 2D object reasoning
(e) Object fitted with
parametric model (f) Spatial layout with 3D volumetric reasoning
Figure 4.1: (a) Input image. (b) Estimate of the spatial layout of the room without object
reasoning. Colors represent the output of the surface geometry by [36]. Green: floor, red:
left wall, yellow: center wall, cyan: right wall. (c) Evidence from object region removed.
(d) Spatial layout with 2D object reasoning. (e) Object fitted with 3D parametric model. (f)
Spatial layout with 3D volumetric reasoning. The wall is pushed by the volume occupied
by the object.
cally invalid hypotheses and estimate the correct layout of the room by pushing the
walls to completely contain the cuboid (Figure 4.1(f)).
In this work, we propose a method to perform volumetric reasoning by combin-
ing classical constrained search techniques and current structured prediction tech-
niques. We show that the resulting approach leads to substantially improved per-
formance on standard datasets with the added benefit of a more complete scene
description that includes objects in addition to surface layout.
51
4.1.1 Background
The goal of extracting 3D geometry by using geometric relationships between ob-
jects dates back to the start of computer vision around four decades ago. In the early
days of computer vision, researchers extracted lines from “blockworld” scenes [54]
and used geometric relationships using constraint satisfaction algorithms on junc-
tions [27, 69]. However, the reasoning approaches used in these block world scenar-
ios (synthetic line drawings) proved too brittle for the real-world images and could
not handle the errors in extraction of line-segments or generalize to other shapes.
In recent years, there has been renewed interest in extracting camera param-
eters and three-dimensional structures in restricted domains such as Manhattan
Worlds [8]. Kosecka et al. [40] developed a method to recover vanishing points and
camera parameters from a single image by using line segments found in Manhat-
tan structures. Using the recovered vanishing points, rectangular surfaces aligned
with major orientations were also detected by [41]. However, these approaches are
only concerned with dominant directions in the 3D world and do not attempt ex-
tract three dimensional information of the room and the objects in the room. Yu et
al. [71] inferred the relative depth-order of rectangular surfaces by considering their
relationship. However, this method only provides depth cues of partial rectangular
regions in the image and not the entire scene.
There has been a recent series of methods related to our work that attempt to
model geometric scene structure from a single image, including geometric label
classification [36, 57] and finding vertical/ground fold-lines [12]. Lee et al. [45]
introduced parameterized models of indoor environments, constrained by rules in-
spired by blockworld to guarantee physical validity. However, since this approach
samples possible spatial layout hypothesis without clutter, it is prone to errors
52
caused by the occlusion and tend to fit rooms in which the walls coincide with
the object surfaces. A recent paper by Hedau et al. [31] uses an appearance based
clutter classifier and computes visual features only from the regions classified as
“non-clutter”, while parameterizing the 3D structure of the scene by a box. They
use structured approaches to estimate the best fitting room box to the image. A sim-
ilar approach has been used by Wang et al. [70] which does not require the ground
truth lables of clutter. In these methods, however, the modeling of interactions be-
tween clutter and spatial-layout of the room is only done in the image plane and the
3D interactions between room and clutter are not considered.
In work concurrent to ours, Hedau et al. [32] have also modeled objects as
three dimensional cuboids and considered the volumetric intersection with the room
structure. The goal of their work differs from ours. Their primary goal is to improve
object detection, such as beds, by using information of scene geometry, whereas our
goal is to improve scene understanding by proposing a control structure that incor-
porates volumetric constraints. Therefore, we are able to improve the estimate of
the room by estimating the objects and vice versa, whereas in their work informa-
tion flows in only one direction (from scene to objects).
In recent work by Gupta et al. [25], qualitative reasoning of scene geometry
was done by modeling objects as “blocks” for outdoor scenes. In contrast, we
use stronger parameteric models for rooms and objects in indoor scenes, which are
more structured, that allows us to do more explicit and exact 3D volumetric reason-
ing.
53
4.2 Overview
Our goal is to jointly extract the spatial layout of the room and the configuration of
objects in the scene. We model the spatial layout of the room by 3D boxes and we
model the objects as solids which occupy 3D volumes in the free space defined by
the room walls. Given a set of room hypotheses and object hypotheses, our goal
is to search the space of scene configurations and select the configuration that best
matches the local surface geometry estimated from image cues and satisfies the vol-
umetric constraints of the physical world. These constraints (shown in Figure 4.3)
are:
• Finite volume: Every object in the world should have a non-zero finite vol-
ume.
• Spatial exclusion: The objects are assumed to be solid objects which cannot
intersect. Therefore, the volumes occupied by different object are mutually
exclusive. This implies that the volumetric intersection between two objects
should be empty.
• Containment: Every object should be contained in the free space defined
by the walls of the room (i.e, none of the objects should be outside the room
walls).
Our approach is illustrated in Figure 4.2. We first extract line segments and
estimate three mutually orthogonal vanishing points (Figure 4.2(b)). The vanishing
points define the orientation of the major surfaces in the scene [41, 45, 31] and
hence constrain the layout of ceilings, floor and walls of the room. Using the line
segments labeled by their orientations, we then generate multiple hypotheses for
54
(a) Input image (b) Line segments and
Vanishing points
(e) Room hypotheses
(f) Cube hypotheses (d) Orientation map (c) Geometric context
(h) Scene configuration hypotheses
(g) Reject invalid
configurations
(i) Evaluate
(j) Final scene
configuration
Figure 4.2: Overview of our approach for estimating the spatial layout of the room and the
objects.
rooms and objects (Figure 4.2(e)(f)). A hypothesis of a room is a 3D parametric
representation of the layout of major surfaces of the scene, such as floor, left wall,
center wall, right wall, and ceiling. A hypothesis of an object is a 3D parametric
representation of an object in the scene, approximated as a cuboid.
The room and cuboid hypotheses are then combined to form the set of possible
configurations of the entire scene (Figure 4.2(h)). The configuration of the entire
scene is represented as one sample of the room hypothesis along with some subset
of object hypotheses. The number of possible scene configurations is exponential
in the number of object hypotheses 1. However, not all cuboid and room subsets
1O(n ·2m) where n is the number of room hypotheses and m is the number of object hypotheses
55
are compatible with each other. We use simple 3D spatial reasoning to enforce the
volumetric constraints described above (See Figure 4.2(g)). We therefore test each
room-object pair and each object-object pair for their 3D volumetric compatibility,
so that we allow only the scene configurations which have no room-object and no
object-object volumetric intersection.
Finally, we evaluate the scene configurations created by combinations of room
hypotheses and object hypotheses to find the scene configuration that best matches
the image (Figure 4.2(i)). As the scene configuration is a structured variable, we
use a variant of the structured prediction algorithm [65] to learn the cost function.
We use two sources of surface geometry, orientation map [45] and geometric con-
text [36], which serve as features in the cost function. Since it is computationally ex-
pensive to test exhaustive combinations of scene configurations in practice, we use
beam-search to sample the scene configurations that are volumetrically-compatible
(Section 4.5.1).
4.3 Estimating Surface Geometry
We would like to predict the local surface geometry of the regions in the image.
A scene configuration should satisfy local surface geometry extracted from image
cues and should satisfy the 3D volumetric constraints. The estimated surface geom-
etry is therefore used as features in a scoring function that evaluates a given scene
configuration.
For estimating surface geometry we use two methods: the line-sweeping algo-
rithm [45] and a multiple segmentation classifier [36]. The line-sweeping algorithm
takes line segments as input and predicts an orientation map in which regions are
56
classified as surfaces into one of the three possible orientations. Figure 4.2(d) shows
an example of an orientation map. The region estimated as horizontal surface is
colored in red, and vertical surfaces are colored in green and blue, corresponding
to the associated vanishing point. This orientation map is used to evaluate scene
configuration hypotheses. The multiple segmentation classifier [36] takes the full
image as input, uses image features, such as combinations of color and texture, and
predicts geometric context represented by surface geometry labels for each super-
pixel (floor, ceiling, vertical (left, center, right), solid, and porous regions). Similar
to orientation maps, the predicted labels are used to evaluate scene configuration
hypotheses.
4.4 Generating Scene Configuration Hypothesis
Given the local surface geometry and the oriented line segments extracted from the
image, we now create multiple hypotheses for possible spatial layout of the room
and object layout in the room. These hypotheses are then combined to produce
scene configuration layout such that all the objects occupy exclusive 3D volumes
and the objects are inside the freespace of the room defined by the walls.
4.4.1 Generating Room Hypotheses
A room hypothesis encodes the position and orientation of walls, floor, and ceil-
ing. In this work, we represent a room hypothesis by a parametric box model [31].
Room hypotheses are generated from line segments in a way similar to the method
described in the previous chapter. In the previous chapter, we examine exhaus-
tive combinations of line segments and check which of the resulting combinations
57
define physically valid room models. Instead, we sample random tuples of line
segments lines that define the boundaries of the parametric box. Only the mini-
mum number of line segments to define the parametric room model are sampled.
Figure 4.2(e) shows examples of generated room hypotheses.
4.4.2 Generating Object Hypotheses
Our goal is to extract the 3D geometry of the clutter objects to perform 3D spatial
reasoning. Estimating precise 3D models of objects from a single image is an ex-
tremely difficult problem and probably requires recognition of object classes such
as couches and tables. However, our goal is to perform coarse 3D reasoning about
the spatial layout of rooms and spatial layout of objects in the room. We only need
to model a subset of objects in the scene to provide enough constraints for volu-
metric reasoning. Therefore, we adopt a coarse 3D model of objects in the scene
and model each object-volume as cuboids. We found that parameterizing objects as
cuboids provides a good approximation to the occupied volume in man-made en-
vironments. Furthermore, by modeling objects by a parametric model of a cuboid,
we can determine the location and dimensions in 3D up to scale, which allows
volumetric reasoning about the 3D interaction between objects and the room.
We generate object hypotheses from the orientation map described above. Fig-
ure 4.4(a)(b) shows an example scene and its orientation map. The three colors
represent the three possible plane orientations used in the orientation map. We can
see from the figure that the distribution of surfaces on the objects estimated by the
orientation map suggests the presence of a cuboidal object. Figure 4.4(c) shows a
pair of regions which can potentially form a convex edge if the regions represent
the visible surfaces on a cuboidal object.
58
We test all pairs of regions in the orientation map to check whether they can
form convex edges. This is achieved by checking the estimated orientation of the
regions and the spatial location of the regions with respect to the vanishing points.
If the region pair can form a convex corner, we utilize these regions to form an
object hypothesis. To generate a cuboidal object hypothesis from pairs of regions,
we first fit tight bounding quadrilaterals (Figure 4.4(c)) to each region in the pair
and then sample all combinations of three points out of the eight vertices on the
two quadrilaterals, which do not lie on a plane. Three is the minimum number of
points (with (x, y) coordinates) that have enough information to define a cuboid
projected onto a 2D image plane, which has five degrees of freedom. We can then
hypothesize a cuboid, whose corner best apprximates the three points. Figure 4.4(d)
shows a sample of a cuboidal object hypothesis generated from the given orientation
map.
4.4.3 Volumetric Compatibility of Scene Configuration
Given a room configuration and a set of candidate objects, a key operation is to eval-
uate whether the resulting combination satisfies the three fundamental volumetric
compatibility constraints described in Section 4.2. The problem of estimating the
three dimensional layout of a scene from a single image is inherently ambiguous
because any measurement from a single image can only be determined up to scale.
In order to test the volumetric compatibility of room-object hypotheses pairs and
object-object hypotheses pairs, we make the assumption that all objects rest on the
floor. This assumption fixes the scale ambiguity between room and object hypothe-
ses and allows us to reason about their 3D location.
To test whether an object is contained within the free space of a room, we check
59
(a) Containment Constraint
(b) Spatial Exclusion Constraint(b) Spatial Exclusion Constraint
Figure 4.3: Examples of volumetric constraint violation.
whether the projection of the bottom surface of the object onto the image is com-
pletely contained within the projection of the floor surface of the room. If the pro-
jection of the bottom surface of the object is not completely within the floor surface,
the corresponding 3D object model must be protruding into the walls of the room.
Figure 4.3(a) shows an example of an incompatible room-object pair.
Similarly, to test whether the volume occupied by two objects is exclusive, we
assume that the two objects rest on the same floor plane and we compare the pro-
jection of their bottom surfaces onto the image. If there is any overlap between the
projections of the bottom surface of the two object hypotheses, that means that they
occupy intersecting volumes in 3D. Figure 4.3(b) shows an example of an incom-
60
(a) Image (b) Orientation Map
(c) Convex Edge Check (d) Hypothesized Cuboid( ) g ( ) yp
Figure 4.4: Object hypothesis generation: we use the orientation maps to generate object
hypotheses by finding convex edges.
patible object-object pair.
4.5 Evaluating Scene Configurations
4.5.1 Inference
Given an image x, a set of room hypotheses {r1, r2, ..., rn}, and a set of object
hypotheses {o1, o2, ..., om}, our goal is to find the best scene configuration y =
(yr,yo), where yr = (y1r , ..., ynr ), yo = (y1o , ..., y
mo ). yir = 1 if room hypothesis
61
ri is used in the scene configuration and yir = 0 otherwise, and yio = 1 if object
hypothesis oi is present in the scene configuration and yio = 0 otherwise. Note that∑i y
ir = 1 as only one room hypothesis is needed to define the scene configuration.
Suppose that we are given a function f(x,y) that returns a score for y. Finding
the best scene configuration y∗ = arg maxy f(x,y) through testing all possible
scene configurations requires n · 2m evaluations of the score function. We resort to
using beam search (fixed width search tree) to keep the computation manageable
by avoiding evaluating all scene configurations.
In the first level of the search tree, scene configurations with a room hypothesis
and no object hypothesis are evaluated. In the following levels, an object hypothesis
is added to its parent configuration and the configuration is evaluated. The top kl
nodes with the highest score are added to the search tree as the child node, where
kl is a pre-determined beam width for level l.2 The search is continued for a fixed
number of levels or until no cubes that are compatible with existing configurations
can be added. After the search tree has been explored, the best scoring node in the
tree is returned as the best scene configuration.
4.5.2 Learning the Score Function
We set the score function to f(x,y) = wTψ(x,y) + wTφφ(y), where ψ(x,y) is a
feature vector for a given image x and measures the compatibility of the scene con-
figuration y with the estimated surface geometry. φ(y) is the penalty term for in-
compatible configurations and penalizes the room and object configurations which
violate volumetric constraints.
2We set kl to (100, 5, 2, 1), with a maximum of 4 levels. The results were not sensitive to these
parameters.
62
We use structured SVM [65] to learn the weight vector w. The weights are
learned by solving
minw,ξ
1
2‖w‖2 + C
∑i
ξi
s.t. wTψ(xi,yi)− wTψ(xi,y)− wTφφ(y) ≥ ∆(yi,y)− ξi,∀i,∀y
ξi ≥ 0,∀i,
where xi are images, yi are the ground truth configuration, ξi are slack variables,
and ∆(yi,y) is the loss function that measures the error of configuration y. Tsochan-
taridis [65] deals with the large number of constraints by iteratively adding the most
violated constraints. We simplify this by sampling a fixed number of configurations
per each training image, using the same beam search process used for inference,
and solving using quadratic programming.
Loss Function: The loss function ∆(yi,y) is the percentage of pixels in the
entire image having incorrect label. For example, pixels that are labeled as left wall
when they actually belong to the center wall, or pixels labeled as object when they
actually belong to the floor would be counted as incorrectly labeled pixels. A wall is
labeled as center if the surface normal is within 45 degrees from the camera optical
axis and labeled as left or right, otherwise.
Feature Vector: The feature vector ψ(x,y) is computed by measuring how well
each surface in the scene configuration y is supported by the orientation map and
the geometric context. A feature is computed for each of the six surfaces in the
scene configuration (floor, left wall, center wall, right wall, ceiling, object) as the
relative area which the orientation map or the geometric context correctly explains
the attribute of the surface. This results in a twelve dimensional feature vector for a
given scene configuration. For example, the feature for the floor surface in the scene
63
configuration is computed by the relative area which the orientation map predicts a
horizontal surface, and the area which the geometric context predicts a floor label.
Volumetric Penalty: The penalty term φ(y) measures how much the volumet-
ric constraints are violated. (1) The first term φ(yr, yo) measures the volumetric
intersection between the volume defined by room walls and objects. It penalizes
the configurations where the object hypothesis lie outside the room volume and
the penalty is proportional to the volume outside the room. (2) The second term∑i,j φ(yio, y
jo) measures the volume intersection between two objects (i, j). This
penalty from this term is proportional to the overlap of the cubes projected on the
floor.
4.6 Experimental Results
We evaluated our 3D geometric reasoning approach on an indoor image dataset in-
troduced in [31]. The dataset consists of 314 images, and the ground-truth consists
of the marked spatial layout of the room and the clutter layouts. For our experi-
ments, we use the same training-test split as used in [31] (209 training and 105 test
images). We use training images to estimate the weight vector.
Qualitative Evaluation: Figure 4.5 illustrates the benefit of 3D spatial reasoning
introduced in our approach. If no 3D clutter reasoning is used and the room box
is fitted to the orientation map and geometric context, the box gets fit to the object
surfaces and therefore leads to substantial error in the spatial layout estimation.
However, if we use 3D object reasoning walls get pushed due to the containment
constraint and the spatial layout estimation improves. We can also see from the
examples that extracting a subset of objects in the scene is enough for reasoning and
64
Input image Room only Room and objects Orientation map Geometric context
Figure 4.5: Two qualitative examples showing how 3D volumetric reasoning aids estimation
of the spatial layout of the room.
Figure 4.6: Additional examples to show the performance on a wide variety of scenes.
Dotted lines represent the room estimate without object reasoning.
improving the spatial layout estimation. Figure 4.6 and 4.7 shows more examples of
the spatial layout and the estimated clutter objects in the images. Additional results
are in the supplementary material.
65
Figure 4.7: Failure examples. The first two examples are the failure cases when the cuboids
are either missed or estimated wrong. The last two failure cases are due to errors in vanish-
ing point estimation.
OM+GC OM GC
No object reasoning 18.6% 24.7% 22.7%
Volumetric reasoning 16.2% 19.5% 20.2%
Table 4.1: Percentage of pixels with correct estimate of room surfaces. First row performs
no reasoning about objects. Second row is our approach with 3D volumetric reasoning of
objects. Columns shows the features that are used. OM: Orientation map from [45]. GC:
Geometric context from [36].
Quantitative Evaluation: We evaluate the performance of our approach in esti-
mating the spatial layout of the room. We use the pixel-based measure introduced
in [31] which counts the percentage of pixels on the room surfaces that disagree
with the ground truth. For comparison, we employ the simple multiple segmen-
tation classifier [36] and the recent approach introduced in [31] as baselines. The
images in the dataset have significant clutter; therefore, simple classification based
approaches with no clutter reasoning perform poorly and have an error of 26.5%.
The state-of-the-art approach [31] which utilizes clutter reasoning in the image
plane has an error of 21.2%. On the other hand, our approach which uses a para-
66
metric model of clutter and simple 3D volumetric reasoning outperforms both the
approaches and has an error of 16.2%.
We also performed several experiments to measure the significance of each step
and features in our approach. When we only use the surface layout estimates
from [36] as features of the cost function, our approach has an error rate of 20.2%
whereas using only orientation maps as features yields an error rate of 19.5%. We
also tried several search techniques to search the space of hypotheses. With a greedy
approach (best cube added at each iteration) to search the hypothesis space, we
achieved an error rate of 19.2%, which shows that early commitment to partial con-
figurations leads to error and search strategy that allows late commitment, such as
beam search, should be used.
4.7 Conclusion
This chapter proposes the use of volumetric reasoning between objects and surfaces
of room layout to recover the spatial layout of a scene. By parametrically represent-
ing the 3D volume of objects and rooms, we can apply constraints for volumetric
reasoning, such as spatial exclusion and containment. Our experiments show that
volumetric reasoning improves the estimate of the room layout and provides a richer
interpretation about objects in the scene. The rich geometric information provided
by our method can provide crucial information for object recognition and eventually
aid in complete scene understanding.
67
Chapter 5
Detecting Objects Characterized by
Geometry
5.1 Introduction
Most successful object detection methods have focused on using the regularities
in the appearance of objects to identify them. These methods have had success
on objects that can be characterized by the regularities in their appearance in the
image, such as faces[56, 59, 68], pedestrians from a distance[10], and cars[59].
These methods are continuing to be applied to broader categories, such as bicycles,
televisions, potted plants, and so on, to varying degrees of success.
This method relies on the fact the appearance of objects in images are fairly con-
sistent. In the case of faces, eyes and mouth region consistently have darker inten-
sity than nose and cheek region. By designing features that can capture those con-
sistant characteristics, vision researchers have been able to develop a face detector[68].
In the case of pedestrians, edge histograms have been discovered to be consistent
68
across different instances of pedestrians and were successfully used to develop a
pedestrian detector[10]. By continuing to improve ways to capture the consisten-
cies in appearance, we can expect this method to improve for categories that have
strong consistencies in their appearance.
However, there is an entirely different class of objects, which does not have
strong characteristics and consistency in their appearance, but are rather easily char-
acterized by their geometry in the 3D world. Some example of such objects are
doors, desks, computer monitors, beds, and chairs. For example, the characteristic
that defines a door is the fact that it is a rectangle, which has the appropriate size and
location, so that a person can pass through, and that it is attached to a wall, so that
it serves its function as a passage to a different room in a building. Therefore, the
geometry of doors in 3D are very consistent, which makes it a desirable property
to use when developing a door detector. On the otherhand, the appearance of doors
in an image varies to a greater degree due to the various color and texture of doors
or posters attached to doors, and varying viewing angle, which leads to perspective
distortion.
Similarly, computer monitors are, appearance-wise, simply a rectangle, but they
can be characterized by the fact that they have the proper dimensions of a monitor,
usually about 19 inches diagonally, and that they are usually placed on a desk. The
dimensions of beds are also consistent and are standardized into twin, queen, king,
etc.
We would like to make the distinction between class of objects that can be better
characterized by their appearance in the image and class of objects that can be
better characterized by their geometry in 3D. We define the first class of objects as
“painted” objects and the second class of objects as “sculpted” objects.
69
We argue that, in order to build an object detector, different approach must be
taken for painted objects and sculpted objects. Appearance-based detectors can
be expected work well for detecting painted objects, since these objects have con-
sistent appearance. However, for sculpted objects, the detector must focus on the
geometric properties of the object, which is what best characterizes the object.
The focus of our work will be on developing a detector for sculpted objects, as
most prior work so far on object recognition has focused on the consistency of ap-
pearance of painted objects. Most objects, however, will not belong exclusively to
one class but will have aspects of both classes. Ultimately, we believe that both
aspects of appearance and geometry are very important for building an object de-
tector and must be used in conjunction for best performance. But as an initial study,
it is worth focusing our attention on sculpted objects to learn about the potential of
using geometric properties for object detection.
The physical dimensions of an object is a well defined property in the real world.
Therefore, in order to obtain the prior distribution of the physical dimensions, it
can easily be directly measured in the real world, rather than being learned from
training images. One can also consult publicly available statistics of the physical
dimensions when available, as done by Hoiem et al.[35] for height of people and
cars.
In this work, we have built a system that detects objects in conjunction with the
building structure in indoor environments. The goal of this work is to detect com-
mon objects in indoor environments that is strongly characterized by their geometry,
such as doors, desks, and monitors, as well as to recover the structure of building
interior. We examine the physical dimensions of objects to verify that they have the
correct size and location to be a certain object. We have also looked at the physi-
70
cal relation between objects to make sure that they are physically and semantically
correct.
5.2 Related Work
Object detection has a long history in computer vision. The past decade was par-
ticularly successful and has matured enough to be of practical use for a few classes
of objects, such as faces [68], and pedestrians [10]. Such efforts are expanding to
more classes of objects [18, 17], driven by organized challenges, such as the PAS-
CAL challenge [14]. Such success has been based on methods that make use of
appearance features. But as reported in [14], some objects turns out to be more
easy to detect than others, even when the same method has been applied.
In this work, we argue that, although some objects are effectively characterized
by their appearance, there are classes of objects that are better characterized by their
geometry. Such idea has indeed been explored in the past and has been the primary
method for the most part of the history of computer vision from 1960s to 1990s
before the surge of appearance based methods. Early geometry based methods are
well summarized in the article by Mundy [50].
One of the earliest and most influential is the work on blocks world [54]. It as-
sumes that the world is made of composition of polyhedral components and has
solved for parameters of polyhedral models to fit edges. The work has been ex-
tended by many researchers, especially in exploring constraints for labeling edges [27,
6, 37, 69, 48, 61]. These work were limited to either contrived scenes or ground
truth line drawing images, rather than real scenes, and the objects they considered
were artificial blocks and not realistic objects. Also, their focus was on recovering
71
the geometric structure of objects, rather than determining the semantic category.
Then, a group of work has emerged that recognizes objects by aligning manually
defined 3D object models to images [46, 4, 24, 1, 5, 38, 63]. Such methods by-
passes the problem of grouping of features and are robust to occlusion or missing
evidences. However, these methods eventually led to the problem of ambiguity of
image features, so the focus of research has shifted away from geometry and led to
methods that focus on learning statistical distribution in appearance.
An interesting work on chairs has been done recently by [23]. Chairs are a class
of objects that has been very difficult to detect because of the fact that chairs can
vary so much in their appearance. Not only there are so many types of chairs,
such as office chairs, dining chairs, couches, etc., but even within a single type of
chair, their appearance in the image can look drastically different from one another.
However, what is universally common among chairs is the fact that they have a
support surface for a person to sit on, usually at a fairly consistent height, and an
optional surface for back support. In fact, what defines a chair is its geometry.
In [23], they detect chairs by looking at occupied voxels and finding places where
people can afford to sit on. This work recognizes the fact that some objects require
examining, not only its appearance, but also its geometry for object detection.
5.3 Representation of objects and building structure
We have used simple three dimensional models to represent objects and the building
structure. We fit these models to a given image to estimate the location of objects
in the 3D world and the physical dimensions of objects.
72
Figure 5.1: Three common “Sculpted Objects” objects modeled using rectangles. (a) Desk
and Computer monitor. (b) Doors.
5.3.1 Objects
We have used a simple geometric primitive, a rectangle, to represent three common
“Sculpted Objects” objects in indoor environments. The objects we have considered
are doors, desks, and computer monitors. The main rectangular surface of doors and
monitors are modeled with a vertical rectangle and the top surface of the desk is
modeled with a horizontal rectangle. Figure 5.1 illustrates rectangles representing
the three objects.
5.3.2 Geometric Properties of Objects
In order to verify a candidate hypothesis of a “Sculpted Objects” object, we mea-
sure its geometric property. We consider both the geometric property of the object
itself (self geometric property), such as the width and length of the object, and the
geometric property of the object in relation to other components in the scene (rela-
tional geometric property). Both self and relational geometric properties are used
to evaluate whether a given candidate corresponds to an acutal object.
73
Self Geometric Properties
Self geometric properties describe the geometric properties relating to the object
itself. For rectangular objects, such as doors, desks, and monitors, their physical di-
mensions are represented with two parameters, width and height. Other dimensions
can be used for objects modeled using different primitives. For “Sculpted Objects”
objects, we expect physical dimensions to be fairly consistent across different in-
stances of the same object category. Therefore, we expect them to be good features
to use for detecting objects.
We use a Gaussian distribution to mobel the probability of parameters. The dis-
tribution can be learned from a set of training data or can be collected from available
census data.
Relational Geometric Properties
Relational geometric properties describe the geometric relation between objects and
other components in the scene. One subset of relational geometric properties has
already been considered in the previous chapter. It has considered the rules that are
caused by the physical constraints of our world and that are applied to all objects in
the world, regardless of their semantic object category. The rules considered in the
previous chapter says that multiple objects may not share the same volume in the
world and has applied this principle to aid in scene understanding. By following
those principles, we were able to discover locations that are occupied by an object,
but we did not attempt to identify the category of those objects. In addition to the
rules due to physics that apply to all object categories, we expand this set of rules
to those that are specific to each object categories. (Figure 5.2) For example, the
height of the top surface of a desk relative to the floor is a characeristic feature
74
Room
Door on wallDesk inside room
Door Desk
Monitor on desk
Monitor
Monitor on desk
Figure 5.2: Relational geometric properties specific to object categories. Doors are on
walls. Desks are within the boundaries of a room and at a specific height from the floor.
Computer monitors are on desks.
that defines a desk. A desk usually has a specific height, so that it is comfortable
for humans to work on its surface. Also, computer monitors are placed on a desk,
with the bottom edge of the monitor being slightly raised above the top surface of
the desk. And doors are on a wall with their bottom edge aligned with the floor.
These relational geometric properties are consistant across different instances of
the same semantic relationship, and therefore, we believe that they useful features
for identifying objects, along with self geometric properties.
75
5.4 Method Details
5.4.1 Creating rectangle hypotheses
The process to generate rectangle hypotheses is based on [51]. We create rect-
angle hypotheses by connecting four line segments to define the four edges of a
rectangle. Line segments and their associated vanishing points are given as the in-
put. Associated vanishing points of line segments determines the 3D orientation of
line segments. Given this input, we first decide on the orientation of the rectangle
that we want to generate. We can later repeat this process for rectangles of other
orientations. For a given orientation of a rectangle, we select the two sets of line
segments having the orientation of the edges of the rectangle. Each line is assigned
a unique ID. Then L-junctions (two edges) are formed with two line segments, one
line segment of each orientations. L-junctions are categorized into four types: top-
left, top-right, bot-left, and bot-right (Figure 5.3). The notion of top/bottom and
left/right need not correspond to the actual meaning of top/bottom or left/right, as
long as it marks a direction and is consistent within the given image. The ID of line
segments forming L-junctions are also recorded for each junction.
Once all L-junctions of four types are formed, we proceed to build rectangles by
tying L-junctions to form rectangles. We first pick a type for starting L-junction,
for example top-left, and progress in one direction, e.g., clockwise, to build all U-
junctions (three edges) of type left-top-right, then finally generating rectangles with
all four edges, left-top-right-bottom.
U-junctions are built by tying two L-junctions together. This is done efficiently
by looking at the type of the L-junction and the ID of line segments forming L-
junctions. That is, a U-junction with left-top-right edges can be formed by connect-
76
Up
top-left top-rightp p g
bottom-right
Rightbottom-left
(a) (b) (c)(a) (b) (c)
Figure 5.3: Four types of L-junctions. (a) Given a designation of “up” and “right” direction,
L-junctions are categorized into four types: top-left, top-right, bottom-left, and bottom-
right. (b)(c) L-junctions are formed by connecting two line segments. Depending on the
relative configuration of two line segments, they form different types of L-junctions. (b) A
bottom-right junction. (c) A top-left junction.
ing two L-junctions of type top-left and top-right, which share the same ID for the
top line segment. Finally, U-junctions are closed to form a full cycle of four edges
and four L-junctions. This is done by first adding another L-junction to U-junctions
to form structures with three L-junctions (top-left, top-right, and bottom-right). The
structure is made up of four line segments, with two bridging the three L-junctions
and two with open ends. We then search for the final L-junction of bottom-left type
marked with the ID of the two open-ended line segments. If such L-junction ex-
ists, then the structure with three L-junctions closed with the final L-junction can
be added as a completed rectangle.
77
top-right (3,17)
Line 3
top-left (12,6)
Line 12
top-right (12,17)
Line 6 Line 17
Figure 5.4: Connection of L-junctions. ID of line segments that form L-junctions determine
which L-junctions can be connected with each other. A top-left type junction with ID (12,6)
can connect with top-right junction with ID (12,17) but not with ID (3,17)
5.4.2 Lifting Rectangle Hypotheses to 3D
Rectangle hypotheses generated from line segments have known orientation in 3D,
which is determined by the orientation of the four edges. However, the location of
the rectangle is not known by the process of connecting line segments. The location
of rectangles can be determined in two ways. The first method is by relating to the
environment and requires only a single image as input. If we already know the 3D
structure of the room, and we know how the rectangle contacts the room, we can
then infer the 3D location of the rectangle from the contact point of the rectangle
and the room. For example, if we know that an edge of a desk contacts a wall,
then the 3D location of the contact point can be assumed to be at the same location
as the 3D location of the contact point of the room. We can then infer the 3D
coordinates of the rest of the rectangle. The second method is through independent
78
3D measurements, such as through a stereo camera, structure from motion, or a
depth camera. In this work, we use 3D point clouds obtained from a stereo camera
and use the points on the rectangle to determine the 3D position of the rectangle.
Keeping the known orientation of the rectangle, its translation is estimated as the
median of the translation of points falling on the rectangle when projected onto the
image. This method requires additional measurement means, but the location of
rectangles can be more reliably determined because it is estimated independently
from the environment.
5.4.3 Creating Building Structure Hypotheses
We create building structure hypotheses by creating instances of indoor manhattan
models proposed by [45, 44]. The model assumes a single floor plane, a single
ceiling plane, and walls that are orthogonal to each other. Hypotheses can be created
from either a single image or from 3D measurements from a stereo camera. To
create hypotheses from a single image, we adopt the method directly from [45, 44].
This method samples line segments and connects them to form building models. To
create hypotheses from 3D measurements from a stereo camera, we first obtain the
3D orientation of major surfaces from vanishing points and then fit planes with fixed
orientation to 3D point clouds for potential walls, floor, and ceiling. Combination
of walls, floor, and ceiling provides hypotheses for the entire building structure.
5.5 Results
We present results of our method on two sequences taken from indoor environ-
ments. We used video sequences taken with a stereo camera. We have applied a
79
stereo egomotion algorithm [2] to obtain sparse 3D point clouds and camera motion.
A single configuration is estimated for the entire sequence by collecting evidences
from all frames. Object hypotheses from each frame are projected onto other frames
in the sequence using camera motion recovered from the stereo system.
In the “Office sequence”, all four walls and the floor have been accurately es-
timated, even though the walls and the floor are occluded by other objects. Two
desk surfaces and two monitors have also been detected. Examining self geomet-
ric properties of rectangle hypotheses ensures that detected desk and monitors have
plausible dimensions. The relational constraint of monitor and desk rules out the
majority of monitor hypotheses that have the correct size to be monitor but are not
supported by a desk. In the “Common area sequence”, again all walls, floor, and
ceiling have been accurately estimated. Relational geometric properties ensures de-
tected doors to lie on walls. However, geometric properties are not sufficient to rule
out rectangular structures caused by windows, which has proper dimensions to be
doors, and results in false detection in the beginning of the sequence.
80
(a)
(b)
Figure 5.5: Result for estimating building structure and detecting doors, desks, and moni-
tors. (a) Office sequence. (b) Common area sequence.
81
Chapter 6
Conclusion
In this thesis, we have developed methods for scene understanding using three di-
mensional representation and reasoning. As our world is in three dimensions and is
made up of three dimensional components, modeling our world using three dimen-
sional representation rules out invalid structures that can only exist in drawing on a
2D image and helps us keep the problem tractable. At the same time, the resulting
structure is guaranteed to be a physically valid structure. Rules derived by careful
observation of 3D representation allows us to perform 3D reasoning that makes in-
ference efficient and tractable. We have also considered the geometric relationships
among components in the scene. This ensures that the resulting configuration of
components is physically valid and improves the accuracy of the estimate. Finally,
we have demonstrated that, while some objects are have proven to be effectively
characterized by their apperance, there are also classes of objects that can be better
characterized by their geometry. For such objects, we have demonstrated the use of
geometric features to detect and localize them.
82
6.1 Future Work
Our focus has been on indoor environments. In the future, it would be interesting
to see geometry based methods applied to more broader domains, such as outdoor
environments. Outdoor environments are not as highly structured as indoor environ-
ments, but similar ideas may be applied to outdoor scene understanding to produce
geometrically plausible estimates and to improve performance.
We have suggested using 3D geometry as the characterizing feature for certain
object classes. However, all objects lie somewhere on the spectrum of being well
characterized by their appearance and being well characterized by their geometry.
Therefore, we think that the next step is to fuse appearance-based methods and
3D-geometry-based methods to build object classifiers that applies to all classes of
objects. There are already work that incorporate 2D spatial factors into modeling
the appearance. It would be interesting to see detectors that consider three dimen-
sional spatial factors and appearance together.
To apply methods that make use of 3D geometry for scene understanding, it is
natural to use three dimensional measurements, rather than to be confined to using
only images. Multi-view methods have matured and can now produce accurate 3D
point clouds. Recently introduced low-cost depth cameras provides an easy way to
obtain accurate 3D point clouds. It would be interesting to use 3D geometry based
methods with 3D measurements along with images to advance scene understanding.
83
Bibliography
[1] N. Ayache and O. Faugeras. Hyper: A new approach for the recognition
and positioning of two-dimensional objects. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 1986.
[2] H. Badino and T. Kanade. A head-wearable short-baseline stereo system for
the simultaneous estimation of structure and motion. In IAPR Conference on
Machine Vision Applications (MVA), Nara, Japan, 2011.
[3] Olga Barinova, Victor Lempitsky, Elena Tretiak, and Pushmeet Kohli. Geo-
metric image parsing in man-made environments. In In: European Conference
on Computer Vision, 2010.
[4] R. Bolles and R. Cain. Recognizing and locating partially visible objects:
The local-feature-focus method. International Journal of Robotics Research,
1982.
[5] R. Bolles and R. Horaud. 3dpo: A tree-dimensional part orientation system.
International Journal of Robotics Research, 1986.
[6] M. B. Clowes. On seeing things. In Artificial Intelligence, 1971.
[7] Microsoft Corp. Redmond WA. Kinect for Xbox 360.
84
[8] J.M. Coughlan and A.L. Yuille. Manhattan world: Compass direction from a
single image by bayesian inference. In Proceedings ICCV, 1999.
[9] A. Criminisi, I. Reid, and A. Zisserman. Single view metrology. In Proc.
International Conference on Computer Vision (ICCV), 1999.
[10] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
detection. In Proceedings of IEEE Conference Computer Vision and Pattern
Recognition, 2005.
[11] Erick Delage, Honglak Lee, and Andrew Y. Ng. Automatic single-image 3d
reconstructions of indoor manhattan world scenes. In ISRR, 2005.
[12] Erick Delage, Honglak Lee, and Andrew Y. Ng. A dynamic bayesian net-
work model for autonomous 3d reconstruction from a single indoor image. In
CVPR, 2006.
[13] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class
object layout. In ICCV, 2009.
[14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.
The pascal visual object classes (voc) challenge. International Journal of
Computer Vision, 88(2):303–338, June 2010.
[15] Olivier Faugeras, Quang-Tuan Luong, and Theodore Papadopoulo. The ge-
ometry of multiple images. MIT Press, 2001.
[16] Olivier Faugeras and Quant-Tuan Luong. The Geometry of Multiple Images.
The MIT Press, 2001.
85
[17] P. Felzenszwalb, R. Girshick, and D. McAllester. Cascade object detection
with deformable part models. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2010.
[18] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained,
multiscale, deformable part model. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2008.
[19] Alex Flint, Christopher Mei, David Murray, and Ian Reid. A dynamic pro-
gramming approach to reconstructing building interiors. 2010.
[20] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski.
Manhattan-world stereo. In CVPR, 2009.
[21] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski.
Reconstructing building interiors from images. In ICCV, 2009.
[22] Stephen Gould, Richard Fulton, and Daphne Koller. Decomposing a scene
into geometric and semantically consistent regions. In ICCV, 2009.
[23] H. Grabner, J. Gall, and L. van Gool. What makes a chair a chair? In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR’11), 2011.
[24] W. E. L. Grimson and T. Lozano-Perez. Model-based recognition and local-
ization from sparse range or tactile data. International Journal of Robotics
Research, 1984.
[25] Abhinav Gupta, Alexei Efros, and Martial Hebert. Blocks world revisited:
Image understanding using qualitative geometry and mechanics. In European
Conference on Computer Vision (ECCV), 2010.
86
[26] Abhinav Gupta, Scott Satkin, Alexei A. Efros, and Martial Hebert. From
3d scene geometry to human workspace. In Computer Vision and Pattern
Recognition(CVPR), 2011.
[27] A. Guzman. Decomposition of a visual scene into three-dimensional bodies.
In Proceedings of Fall Joint Computer Conference, 1968.
[28] F. Han and S.C. Zhu. Bottom-up/top-down image parsing by attribute graph
grammar. In Proc. Int’l Conf. on Computer Vision (ICCV), 2005.
[29] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer
vision. Cambridge University Press, 2003.
[30] X. He, R. S. Zemel, and M. A. Carreira-Perpinan. Multiscale conditional
random fields for image labeling. In CVPR, 2004.
[31] Varsha Hedau, Derek Hoiem, and David Forsyth. Recovering the spatial
layout of cluttered rooms. In International Conference on Computer Vision
(ICCV), 2009.
[32] Varsha Hedau, Derek Hoiem, and David Forsyth. Thinking inside the box:
Using appearance models and context based on room geometry. In European
Conference on Computer Vision (ECCV), 2010.
[33] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classification models:
Combining models for holistic scene understanding. In In NIPS, 2008.
[34] Derek Hoiem, Alexei Efros, and Martial Hebert. Geometric context from
a single image. In Proceedings of IEEE Conference Computer Vision and
Pattern Recognition, 2005.
87
[35] Derek Hoiem, Alexei Efros, and Martial Hebert. Putting objects in perspec-
tive. In CVPR, 2006.
[36] Derek Hoiem, Alexei Efros, and Martial Hebert. Recovering surface lay-
out from an image. International Journal on Computer Vision (IJCV), 75(1),
2007.
[37] D. A. Huffman. Impossible objects as nonsense sentences. In Machine Intel-
ligence, 1971.
[38] D. P. Huttenlocher and S. Ullman. Object recognition using alignment. In
Proceedings of the First International Conference on Computer Vision, 1987.
[39] T. Kanade. A theory of origami world. In Artificial Intelligence, 1980.
[40] J. Kosecka and W. Zhang. Video compass. In Proceedings of European Con-
ference on Computer Vision, pages 657 – 673, 2002.
[41] J. Kosecka and W. Zhang. Extraction, matching and pose recovery based
on dominant rectangular structures. Computer Vision Image Understanding,
2005.
[42] P. D. Kovesi. MATLAB and Octave functions for computer vi-
sion and image processing. School of Computer Science & Soft-
ware Engineering, The University of Western Australia. Available from:
<http://www.csse.uwa.edu.au/∼pk/research/matlabfns/>.
[43] Sanjiv Kumar and Martial Hebert. Discriminative fields for modeling spatial
dependencies in natural images. In in proc. advances in Neural Information
Processing Systems (NIPS), December 2003.
88
[44] David C. Lee, Abhinav Gupta, Martial Hebert, and Takeo Kanade. Estimating
spatial layout of rooms using volumetric reasoning about objects and surfaces.
In Advances in Neural Information Processing Systems 24 (NIPS), 2010.
[45] David Changsoo Lee, Martial Hebert, and Takeo Kanade. Geometric reason-
ing for single image structure recovery. In IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR), June 2009.
[46] David Lowe. Perceptual organization and visual recognition. Kluwer Aca-
demic Publishers, 1985.
[47] Yi Ma, S. Shankar Sastry, Jana Kosecka, and Stefano Soatto. An invitation
to 3-d vision: From images to geometric models. Interdisciplinary Applied
Mathematics Series. Springer-Verlag New York, 2003.
[48] A. K. Mackworth. Interpreting pictures of polyhedral scenes. In Artificial
Intelligence, 1973.
[49] B. Micusik, H. Wildenauer, and J. Kosecka. Detection and matching of recti-
linear structures. In IEEE Conference on Computer Vision and Pattern Recog-
nition, 2008.
[50] Joseph L. Mundy. Object recognition in the geometric era: A retrospective.
In Toward CategoryLevel Object Recognition, volume 4170 of Lecture Notes
in Computer Science, pages 3–29. Springer, 2006.
[51] Ana Cris Murillo, J. Kosecka, J. J. Guerrero, and C. Sagues. Visual door
detection integrating appearance and shape cues. Robotics and Autonomous
Systems, 2008.
89
[52] Vladmir Nedovic, Arnold W.M. Smeulders, and Andre Redert. Depth infor-
mation by stage classification. In Proc. International Conference on Computer
Vision, 2007.
[53] Y. Ohta, T. Kanade, and T. Sakai. An analysis system for scenes containing
objects with substructures. IJCPR, pages 752-754, 1978.
[54] Lawrence G. Roberts. Machine perception of three-dimensional solids.
OEOIP, pages 159-197, 1965.
[55] C. Rother. A new approach for vanishing point detection in architectural en-
vironments. In BMVC, pages 382–391, 2000.
[56] Henry Rowley, Shumeet Baluja, and Takeo Kanade. Neural network-based
face detection. In Computer Vision and Pattern Recognition ’96, June 1996.
[57] Ashutosh Saxena, Sung H. Chung, and Andrew Y. Ng. Learning depth
from single monocular images. In In Neural Information Processing Systems
(NIPS), 2005.
[58] Ashutosh Saxena, Min Sun, and Andrew Y. Ng. Make3d: Learning 3d scene
structure from a single still image. IEEE Transactions of Pattern Analysis and
Machine Intelligence (PAMI), 2008.
[59] Henry Schneiderman and Takeo Kanade. A statistical model for 3d object
detection applied to faces and cars. In IEEE Conference on Computer Vision
and Pattern Recognition. IEEE, June 2000.
90
[60] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Texton-boost: Joint ap-
pearance, shape and context modeling for multi-class object recognition and
segmentation. In In ECCV, 2006.
[61] K. Sugihara. A necessary and sufficient condition for a picture to represent a
polyhedral scene. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, PAMI, 1984.
[62] J.-P Tardif. Non-iterative approach for fast and accurate vanishing point de-
tection. In 12th IEEE International Conference on Computer Vision, 2009.
[63] D. W. Thompson and J. L. Mundy. Three-dimensional model matching from
an unconstrained viewpoint. In Proceedings of the International Conference
on Robotics and Automation, 1987.
[64] Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Contextual
models for object detection using boosted random fields. In NIPS, 2005.
[65] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin
Altun. Large margin methods for structured and interdependent output vari-
ables. Journal of Machine Learning Research 6: 1453-1484, 2005.
[66] Z. Tu. Auto-context and its application to high-level vision tasks. In In CVPR,
2008.
[67] F.A. van den Heuvel. Vanishing point detection for architectural photogram-
metry.
[68] Paul Viola and Michael Jones. Robust real-time face detection. In IEEE
International Conference on Computer Vision, 2001.
91
[69] D. A. Waltz. Generating semantic descriptions from line drawings of scenes
with shadows. Technical report, MIT, 1972.
[70] Huayan Wang, Stephen Gould, and Daphne Koller. Discriminative learning
with latent variables for cluttered indoor scene understanding. In European
Conference on Computer Vision (ECCV), 2010.
[71] Stella Yu, Hao Zhang, and Jitendra Malik. Inferring spatial layout from a sin-
gle image via depth-ordered grouping. In IEEE Computer Society Workshop
on Perceptual Organization in Computer Vision, 2008.
92