Three Dimensional Representation and Reasoning …dclee/pub/lee_thesis.pdfThree Dimensional Representation and Reasoning for Indoor Scene Understanding David C. Lee August 2011 Department

Three Dimensional Representationand Reasoning for Indoor Scene

Understanding

David C. Lee

August 2011

Department of Electrical and Computer EngineeringCarnegie Mellon University

Pittsburgh, Pennsylvania 15213

Thesis Committee:Takeo Kanade, Chair

Martial HebertAlexei A. EfrosMarios Savvides

Jitendra Malik, UC Berkeley

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical and Computer

Engineering

c©2011 by David C. Lee. All rights reserved.

Abstract

When addressing the problem of scene understanding from a single image, we

want our system to understand not only where objects are in the image, but also

where they are in the 3D world. Segmenting and labeling regions only in the 2D

image plane does not achieve this goal. We need a representation that inherently

encodes the 3D properties of the scene. In addition to understanding the location in

3D, we also want our system to make use of physical knowledge about valid config-

urations of our world by rejecting configurations that violate physical constraints,

such as two objects occupying the same volume. 3D geometric properties can also

aid in detecting and identifying certain clasess of objects that are well characterized

by their geometry. In this thesis, we will demonstrate the benefits of using 3D rep-

resentation for indoor scene understanding. We will show that the use of models

provides a natural way to represent objects in 3D and inject knowledge we have

about the world to perform geometric reasoning.

3

Acknowledgements

I would first like to thank my advisor Professor Takeo Kanade for his support

and guidance throughout my PhD study. He has provided practical guidance and

has steered me to pursue bigger goals. I would also like to thank Professor Martial

Hebert for his advice and encouragements. I thank friends at CMU for making

my stay in Pittsburgh fun and memorable. Finally, I thank my family and my wife

SooYoon for their endless support and love.

The work presented in this thesis was supported in part by NSF Grant EEEC-

0540865, ONR MURI Grant N00014-07-1-0747, NSF Grant IIS-0905402, and

ONR Grant N000141010766.

4

Contents

1 Introduction 13

1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Related Work 19

3 Representation of the Structure of Building Interiors 24

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Indoor World Model . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Geometric Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Finding Building Structure . . . . . . . . . . . . . . . . . . . . . . 32

3.5.1 Line Segment Detection and Vanishing Point Estimation . . 32

3.5.2 Generating Building Hypotheses . . . . . . . . . . . . . . . 33

3.5.3 Evaluating Building Hypotheses . . . . . . . . . . . . . . . 34

3.5.4 Converting Building Models to 3D . . . . . . . . . . . . . . 38

3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7 Populating the Scene Frame with Objects . . . . . . . . . . . . . . 42

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5

4 Volumetric Reasoning for Structure and Objects 49

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Estimating Surface Geometry . . . . . . . . . . . . . . . . . . . . . 56

4.4 Generating Scene Configuration Hypothesis . . . . . . . . . . . . . 57

4.4.1 Generating Room Hypotheses . . . . . . . . . . . . . . . . 57

4.4.2 Generating Object Hypotheses . . . . . . . . . . . . . . . . 58

4.4.3 Volumetric Compatibility of Scene Configuration . . . . . . 59

4.5 Evaluating Scene Configurations . . . . . . . . . . . . . . . . . . . 61

4.5.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5.2 Learning the Score Function . . . . . . . . . . . . . . . . . 62

4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Detecting Objects Characterized by Geometry 68

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Representation of objects and building structure . . . . . . . . . . . 72

5.3.1 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3.2 Geometric Properties of Objects . . . . . . . . . . . . . . . 73

5.4 Method Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4.1 Creating rectangle hypotheses . . . . . . . . . . . . . . . . 76

5.4.2 Lifting Rectangle Hypotheses to 3D . . . . . . . . . . . . . 78

5.4.3 Creating Building Structure Hypotheses . . . . . . . . . . . 79

6

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6 Conclusion 82

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7

List of Figures

1.1 An example of a complex indoor environment . . . . . . . . . . . . 15

1.2 The Penrose triangle, an example of a physically impossible object. 16

1.3 An example of an invaid configuration, where an object protrudes

into a wall of a room. . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Line segments. Can you recognize the building structure? Can you

find doors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Levels of completeness of line drawings. Left: Complete. Middle:

Missing. Not all structure edges in the real world are present in the

image. Right: Missing and Spurious. Not all lines in the image are

structure edges or even part of the target structure. . . . . . . . . . . 26

3.3 Examples of building models under Indoor World model. All build-

ing models are built by connecting three basic types of corners. Top

left: concave(-) corner. Top middle: convex(+) corner. Top right:

occluding(>) corner. Bottom row: combinations of corners. . . . . 29

8

3.4 Regions divided by vanishing lines and restrictions on types of cor-

ners. Top: Line drawing, vanishing points, and vanishing lines.

Bottom: Types of possible corners in each of the three regions. En-

closed in small boxes are depictions of corners as they would appear

in the image, and next to it are the top-down view of each corners.

In each of the three regions, four types of corners can exist: one

convex(+), one concave(-), and two occluding(>) corners. . . . . . 31

3.5 Solid lines are the minimal set of lines needed to define a corner.

Three lines are needed for convex(+) and concave(-) corners. Four

lines are needed for occluding(>) corners. . . . . . . . . . . . . . . 34

3.6 Generating hypotheses. Left: The process of a hypothesis being

generated by four line segments. Right: A sample of generated

building hypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.7 Line segments and Orientation map. (a) Line segments, vanishing

points, and vanishing lines. (b) Orientation map. Lines segments

and regions are colored according to their orientation. (Best viewed

in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.8 The shaded area denotes the sweep S (l, vy, α) of line l towards van-

ishing point vy by amount α, and it potentially supports the region

to be orthogonal to vx and vy. . . . . . . . . . . . . . . . . . . . . . 38

3.9 3D models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.10 Percentage of pixels with correct orientation. . . . . . . . . . . . . 42

3.11 Comparison of floor boundary error . . . . . . . . . . . . . . . . . 42

3.12 Examples of doors and people in a scene frame. . . . . . . . . . . . 43

3.13 Examples (Best viewed in color) . . . . . . . . . . . . . . . . . . . 45

9

3.14 Examples with occluding objects. Unobstructed view of the ceiling-

wall boundary helps finding the underlying building structure. (Best

viewed in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.15 Failure examples. (Best viewed in color) . . . . . . . . . . . . . . . 47

3.16 Examples of images downloaded from the web. Top two rows: Suc-

cess. Bottom two rows: Failure. (Best viewed in color) . . . . . . . 48

4.1 (a) Input image. (b) Estimate of the spatial layout of the room with-

out object reasoning. Colors represent the output of the surface

geometry by [36]. Green: floor, red: left wall, yellow: center wall,

cyan: right wall. (c) Evidence from object region removed. (d)

Spatial layout with 2D object reasoning. (e) Object fitted with 3D

parametric model. (f) Spatial layout with 3D volumetric reasoning.

The wall is pushed by the volume occupied by the object. . . . . . 51

4.2 Overview of our approach for estimating the spatial layout of the

room and the objects. . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Examples of volumetric constraint violation. . . . . . . . . . . . . . 60

4.4 Object hypothesis generation: we use the orientation maps to gen-

erate object hypotheses by finding convex edges. . . . . . . . . . . 61

4.5 Two qualitative examples showing how 3D volumetric reasoning

aids estimation of the spatial layout of the room. . . . . . . . . . . 65

4.6 Additional examples to show the performance on a wide variety

of scenes. Dotted lines represent the room estimate without object

reasoning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

10

4.7 Failure examples. The first two examples are the failure cases when

the cuboids are either missed or estimated wrong. The last two

failure cases are due to errors in vanishing point estimation. . . . . 66

5.1 Three common “Sculpted Objects” objects modeled using rectan-

gles. (a) Desk and Computer monitor. (b) Doors. . . . . . . . . . . 73

5.2 Relational geometric properties specific to object categories . . . . . 75

5.3 Four types of L-junctions. (a) Given a designation of “up” and

“right” direction, L-junctions are categorized into four types: top-

left, top-right, bottom-left, and bottom-right. (b)(c) L-junctions are

formed by connecting two line segments. Depending on the relative

configuration of two line segments, they form different types of L-

junctions. (b) A bottom-right junction. (c) A top-left junction. . . . 77

5.4 Connection of L-junctions. ID of line segments that form L-junctions

determine which L-junctions can be connected with each other. A

top-left type junction with ID (12,6) can connect with top-right

junction with ID (12,17) but not with ID (3,17) . . . . . . . . . . . 78

5.5 Result for estimating building structure and detecting doors, desks,

and monitors. (a) Office sequence. (b) Common area sequence. . . . 81

11

List of Tables

4.1 Percentage of pixels with correct estimate of room surfaces. First

row performs no reasoning about objects. Second row is our ap-

proach with 3D volumetric reasoning of objects. Columns shows

the features that are used. OM: Orientation map from [45]. GC:

Geometric context from [36]. . . . . . . . . . . . . . . . . . . . . . 66

12

Chapter 1

Introduction

Seeing is a major part of our daily lives. We visually perceive the scene that sur-

rounds us at almost every moment that we are awake. Our perception is not limited

to detecting objects of interest, such as faces, people, cars, chairs, desks, etc. It

includes understanding the entire scene and perceiving the environment, such as

knowing that we are on a busy street and cars are on a road, or that we are in an

office and there are desks and chairs. Furthermore, our understanding is not limited

to understanding just the semantic category of the environment and objects. Our

understanding extends to the underlying 3D geometry of the scene and we know

where things are in the real 3D world, which allows us to navigate in our 3D world

and perform daily tasks, such approaching and sitting on a chair.

Our goal in this thesis is to create computer vision methods that mimic the ability

of human to understand a scene in 3D. We would like our system to understand

the 3D spatial layout of its environment and locate the 3D position objects in the

environment. Such system could allow robots to navigate and manipulate objects in

an environment. It could also be used as an assistive device for people with visual

13

impairment to help perceive their surrounding.

For scene understanding, we believe that it is crucial to attempt to understand it

in three dimensions, rather than to recognize just the semantic category of the scene

and objects in the scene, or to detect objects just in the image and not in the 3D

world. Three dimensional understanding is necessary for a robot to navigate or to

assist people perceive their environment. However, aside from its implications, we

believe that it is better to understand in 3D, even from pure computer vision per-

spective, because 1) there are objects better defined by their 3D geometric properties

rather than appearance properties, and 2) 3D geometry provides strong constraints

on the size and relative location of various components in the scene.

Our goal of three dimensional scene understanding also differs from pure 3D re-

construction methods, such as stereo vision, structure from motion, or depth cam-

eras. Such methods provide only 3D point clouds of a scene and are not capable

of assigning the semantic meaning of those point clouds, such as floor, walls, desk,

and so on. Our goal of scene understanding provides a higher level semantic under-

standing of an environment.

1.1 Challenges

One of the major challenges in scene understanding comes from the loss of three

dimensional information as a result of the perspective projection of the 3D world

onto the 2D image plane. Thus, to recover the three dimensional information from

an image, one must make use of the rules and regularities that exists in the world to

resolve the inherent ambiguity caused by perspective projection.

Our focus in this thesis is on discovering rules that every physical object must

14

Figure 1.1: An example of a complex indoor environment

obey in our world and applying the rules to guide us in scene understanding. For

example, Figure 1.2 and 1.3 shows examples in which such rules are violated. They

are the Penrose triangle (Fig 1.2) and the scene where multiple objects occupy the

same volume (Fig 1.3). The Penrose triangle is an example of object, which in itself

can not be realized in our world. The second example is a scene in which individual

components of the scene, i.e. the room and the object, are physically valid but their

relation prevents the configuration from being realized in our world. By explicitly

ruling out such configurations that do not exist in our world, we can make the task

of understanding the scene easier.

Another challenge is in choosing the right model to represent the 3D scene. Our

goal is to understand both the semantic category and the 3D geometry of com-

ponents in a scene. There has been a recent surge in work on scene understand-

ing [60, 66, 33, 22]. Most of this work represent objects in the image as segmented

regions with associated labels of the category. These methods can tell what objects

are in the scene, but are unable to tell where those objects are in the 3D scene. In-

15

Figure 1.2: The Penrose triangle, an example of a physically impossible object.

Figure 1.3: An example of an invaid configuration, where an object protrudes into a wall of

a room.

stead, if the model we use to represent objects are 3D models, we can both detect

and localize the object simultaneously and achieve our goal of 3D scene under-

standing. However, choosing the right model to represent the 3D scene is not an

easy problem. For example, representing the environment using dense 3D point

clouds or 3D polygons will suffice in providing accurate geometric structure of the

16

scene but will be unable to assign semantic meaning to the structure. Also, 3D point

clouds and polygons have the potential to represent the given scene to a very high

level of detail, but it is very hard to robustly fit to the scene with limited input such

as just a single image of the scene.

Therefore, we need a model that can represent both the semantic category and

the 3D structure, while striking the right balance between model complexity and

robustness to fitting and between generalizabillity and adherance to common envi-

ronments. In this thesis, we propose a model to represent the structure of building

interiors that can represent the 3D structure of the scene and identify the major sur-

faces, such as floor, walls, and ceiling, and is easy to manage in the 2D image space.

We have also proposed the use of simple geometric primitives, such as rectangles

and cuboids, to represent common objects found in indoor environments.

1.2 Our Approach

Our goal is to understand a scene, given an image acquired by a camera. We would

like to build computer algorithms that can recover the structure of building interiors

in 3D given a single image. In addition to the structure of building interiors, we also

detect common objects in indoors, such as doors, desks, and computer monitors.

Our approach towards scene understanding is to use 3D representation and rea-

soning. We carefully make observations about our physical world and then decide

on the representation that is suitable to model our target environment. We then dis-

cover rules about the geometric properties of indoor environments, which objects in

the real world must satisfy. We consider both rules about individual parameterized

model, described in Chapter 3, as well as rules among different objects in the scene,

17

described in Chapter 4. Such rules allow us to limit search to geometrically valid

configurations, resulting in improved estimate of the scene, due to smaller search

space, while guaranteeing the estimated configuration to always be physically valid.

Finally, we extend the idea of 3D representation to detecting the identity of the ob-

jects by recognizing object categories that are better characterized by their geometry

than their appearance, described in Chapter 5.

We limit our target environment to man-made indoor environments, as we spend

major part of our days indoors and indoor scene understanding has huge impli-

cations for robots and assistive technology. In addition, indoor environments are

highly structured, so it is easy to represent components in the scene using parame-

teric model and easy to discover and apply geometric constraints.

The following are the key contribution of this thesis:

• Estimation of structure of building interiors.

– Model to represent building interiors and geometric reasoning to rule

out invalid structures

– Method to estimate local surface orientation in Manhattan environments.

• Detection of objects in indoor environments

– Reasoning about occupied volume of objects and building structures

– Use of three dimensional geometric properties as the main characteriz-

ing feature to identify objects

18

Chapter 2

Related Work

3D scene understanding is one of the most important problems in computer vision

and has received much attention from many researchers. It is related to many dif-

ferent subfields in computer vision. Our work has been influenced by and is built

upon prior work. In this chapter, we will introduce related work and will put our

work in the context of these work.

Scene understanding involves understanding the overall scene and the various

components in an image. In order to understand the components in an image, many

have utilized the relationship among the components in a scene. Ohta et al. [53]

have modeled the relationships of properties among substructures in a scene. More

recently, many researches applied machine learning to model the relationship of

various objects, such as a computer mouse being next to a keyboard, and used that

information to detect objects together [64, 43, 13, 30]. The relationship that they

considered were two dimensional, such as cars being above in the image compared

to the road. While two dimensional relationships are useful, there are cases when

two dimensional relationships can not correctly model a scene. For example, when

19

a car in the foreground is occluding the road behind it, the road appears to be above

the car in the image.

More explicit modeling of the three dimensional relationship between compo-

nents has been done recently by Hoiem et al.[35]. They have modeled the angle

of pitch of the camera while detecting objects simulatneously. This simple model

puts a constraint on the size and position of objects, when the size of objects in real

world are known and are assumed to rest on the ground.

Recovering the 3D structure from images is also a major part of our goal of 3D

scene understanding. There are a number of methods to recover 3D structure from

multiple images, such as structure-from-motion and stereo [15, 29, 47]. The theo-

retical aspects of such methods have matured and modern stereo systems are able

to produce point clouds to high level of accuracy [2]. But such methods relying on

multiple images have the fundamental limitation imposed by the distance between

camera at the time of acquiring the multiple images, that is, the baseline distance,

in the case of stereo, or the distance traveled, in the case of structure-from-motion.

There has been some recent developments that have shown that 3D structure can

be estimated from a single image [34, 57, 58]. Such methods rely on the fact that

there is a pattern in the apperance of image patches that depend on surface normal

or the distance from the camera. For example, the appearance of the ground is

different from buildings or the sky, and the texture of tree leaves are different when

viewed from nearby or far away. They have used machine learning to learn the

distribution of appearance features and map appearance to depth or surface normals,

and eventually recover the underlying 3D structure. These methods do not have

the fundamental limiting factor of baseline distance, so it works for scenes with

greater depth. However, the fidelity of the reconstruction can not be guaranteed and

20

these methods do not generalize to scenes that greatly differ from previously trained

scenes.

The most recent breakthrough in obtaining 3D structure is with dedicated hard-

ware that measures depth directly. [7] Such depth cameras have existed in the past,

but the cost has dropped drastically in the past year to consumer level, so that it is

now possible to use for practical applications.

The three methods that were mentioned for 3D reconstruction, multiple-view,

single-view, and depth cameras, estimates only the 3D structure and are unable to

assign any semantic meaning to the recovered scene. Our goal to understand a given

scene includes both semantic understanding, as well as geometric reconstruction of

a scene.

For man-made environments, a useful subclass of scenes has been proposed

called “Manhattan World” [8]. It assumes that the world is made up of planar

surfaces that have three mutually orthogonal orientations. Such an assumption

holds for many man-made environments, both indoors and outdoors, and proved

to be useful. In a Manhattan World, there are three vanishing points, which are

points in the image to which parallel lines in 3D converge. Estimating vanishing

points [55, 67, 40, 62, 3] allows us to infer the 3D orientation of parallel lines and

provides useful information for later processes. A number of work have detected

rectangular structures [41, 49, 28, 51, 71] by benefiting from vanishing point esti-

mation and the Manhattan World assumption. There are also multiple-view meth-

ods that make use of Manhattan assumption to achieve impressive results [20, 21].

Another subclass of “Manhattan World”, called “Indoor Manhattan World”, has

recently been explored, both by the work in this thesis and others that were done

during a similar time. Our work [45] have first propsed a subclass of “Manhattan

21

world” by adding an additional constraint to Manhattan world that there are at most

two horizontal surfaces, the floor and the ceiling. Such constraint allowed us to

build a model that can represent most indoor building structures. A slightly simpler

model was proposed by [31] that represents rooms by boxes. Since then, many

work have adapted the model to estimate the structure of indoor environments. [44,

19, 32, 70, 26]

At the object level, the past decade was particularly successful for object detec-

tors and has matured enough to be of practical use for a few classes of objects,

such as faces [68], and pedestrians [10]. Such efforts are expanding to more classes

of objects [18, 17], driven by organized challenges, such as the PASCAL chal-

lenge [14]. Such success has been based on methods that make use of appearance

features. But as reported in [14], some objects turns out to be harder to detect than

others, even when the same appearance-based method has been applied.

In contrast to recent appearance-based object detection methods, geometry-based

methods has been explored in the past and has been the primary method for the

most part of the history of computer vision from 1960s to 1990s before the surge of

appearance based methods. Early geometry based methods are well summarized in

the article by Mundy [50].

One of the earliest and most influential is the work on blocks world [54]. It as-

sumes that the world is made of composition of polyhedral components and has

solved for parameters of polyhedral models to fit edges. The work has been ex-

tended by many researchers, especially in exploring constraints for labeling edges [27,

6, 37, 69, 48, 61]. These work were limited to either contrived scenes or ground

truth line drawing images, rather than real scenes, and the objects they considered

were artificial blocks and not realistic objects. Also, their focus was on recovering

22

the geometric structure of objects, rather than determining the semantic category.

A group of work has emerged that recognizes objects by aligning manually de-

fined 3D object models to images [46, 4, 24, 1, 5, 38, 63]. Such methods bypasses

the problem of grouping of features and are robust to occlusion or missing evi-

dences. However, these methods eventually led to the problem of ambiguity of

image features, so the focus of research has shifted away from geometry and led to

methods that focus on learning statistical distribution in appearance.

Our work tries to make use of 3D geometry and three dimensional reasoning

at all levels of scene understanding: to represent the global structure, to rule out

physically invalid configurations, and to detect objects. This is the main motivation

of our work.

23

Chapter 3

Representation of the Structure of

Building Interiors

We study the problem of generating plausible interpretations of a scene from a

collection of line segments automatically extracted from a single indoor image.

We show that we can recognize the three dimensional structure of the interior of a

building, even in the presence of occluding objects. Several physically valid struc-

ture hypotheses are proposed by geometric reasoning and verified to find the best

fitting model to line segments, which is then converted to a full 3D model. Our ex-

periments demonstrate that our structure recovery from line segments is comparable

with methods using full image appearance. Our approach shows how a set of rules

describing geometric constraints between groups of segments can be used to prune

scene interpretation hypotheses and to generate the most plausible interpretation.

24

3.1 Introduction

It is easy for us to recognize the building structure in Figure 3.1, as well as locate

a few doors. However, automatic recognition of structure from a collection of line

segments is challenging, as not all lines defining the building structure are perfectly

detected by low level image processing. To further complicate the problem, extra

edges may lie on surfaces of walls or even on objects that are not part of the target

structure (Figure 3.2). We can still interpret the collection of line segments because

1) we perform geometric reasoning and only consider physically plausible interpre-

tations, 2) we have the ability to look globally at the overall structure, and 3) we

have prior knowledge on how the world, in our case the interior of a building, is

structured.

As images are projections of the real world, it is desirable to interpret them only

in ways which can be realized in the real world. Geometric inference, when jointly

done with semantic labeling, may be more demanding, but it may significantly

reduce the problem space and make the problem, in fact, easier.

In this work, we tackle the problem of interpreting collection of line segments to

recognize the structure of buildings. We search for building models that translate to

physically plausible three dimensional building models. We perform geometric rea-

soning to generate many physically valid structure hypotheses from line segments.

Each hypothesis is tested to find the one that best matches the collection of line

segments. We have also done preliminary experiments to detect objects, using the

recovered structure as a “scene frame”, which provides geometric context to objects

in the scene.

25

Figure 3.1: Line segments. Can you recognize the building structure? Can you find doors?

Figure 3.2: Levels of completeness of line drawings. Left: Complete. Middle: Missing.

Not all structure edges in the real world are present in the image. Right: Missing and

Spurious. Not all lines in the image are structure edges or even part of the target structure.

3.2 Prior Work

Line drawings have been studied from the early days of computer vision. Guz-

man [27] was the first to interpret line drawings to separate collection of polyhedral

objects into parts. Huffman [37] and Clowes [6] came up with a formal scheme

of labeling lines into convex, concave, and occluding for polyhedral objects, with

26

which 3D description of objects can be recovered and impossible objects can be re-

jected. Mackworth [48] introduced the concept of gradient space and surface based

constraints. Waltz [69] expanded the problem by allowing line drawings to include

shadows, cracks, and missing edges (Figure 3.2). Kanade [39] dealt with “origami

world”, which includes hollow shells and planar sheets, and utilized heuristics, such

as parallel lines in image are parallel in space. Sugihara [61] provided an algebraic

optimization approach for interpreting line drawings. However, these approaches

were limited to synthetic line drawings and were not applied to real images.

Kosecka’s group have a number of papers on images of the Manhattan world by

using information from line segments. Kosecka and Wei [40] developed a method

to recover vanishing points and camera parameters from a single image by us-

ing line segments found in Manhattan structures. Using the recovered vanishing

points, rectangular surfaces aligned with major orientations were detected by Wei

and Kosecka [41] and more recently by Micusik et al. [49]. Han and Zhu [28]

have also worked on finding rectangles aligned with vanishing points from line seg-

ments. They used top-down grammars, which helped finding rectangles forming

regular patterns, such as grid or box patterns. However, these approaches operate

directly in 2D image space (except when multiple images were used) and do not

attempt to extract three dimensional information from a single image.

A number of papers address the problem of recovering three dimensional struc-

ture from a single image. Three dimensional information can be extracted from a

single image when there is a reference in the image [9]. A commonly used refer-

ence is the ground plane. Hoiem et al. [34] and Delage et al. [12] take a two-step

approach for recovering 3D structure of outdoor images and indoor images respec-

tively: 1) estimate image region orientation (e.g., ground, vertical) using statistical

27

methods on image properties, such as color, texture, edge orientation, position in

image, etc. 2) “pop-up” vertical regions by “folding” along the crease between

ground and vertical regions. Saxena et al. have taken a different approach by esti-

mating absolute depth directly from image properties [57], and smoothly connect-

ing regions under weak assumptions, such as connectivity or coplanarity, without

the explicit assumption of a ground plane [58].

An interesting observation was made by Nedovic et al. [52] that a typical scene

can be categorized into a limited number of categories of 3D scene geometry, which

they call “stages”. Categories of stages include sky+ground, box, corner, and per-

son+background, and the stage information can potentially serve as a guide for a

more complete depth estimation or a more detailed scene understanding.

3.3 Indoor World Model

Most indoor environments satisfy the Manhattan World assumption [8], i.e., most

planes lie in one of three mutually orthogonal orientations. In addition, indoor envi-

ronments usually have a single floor plane and a single ceiling plane with constant

ceiling height. Combining the “Manhattan World” and “single-floor single-ceiling”

models, we propose the “Indoor World” model as an useful approximation for in-

door scenes.

This world model applies to most indoor environments and has a number of de-

sirable properties. First of all, it is easy to represent a physically valid model of a

scene in two dimensional image space, which can be effortlessly translated into a

three dimensional model. By geometric reasoning on the configuration of edges, we

can represent a scene structure in two dimensions that encodes a physically valid

28

Figure 3.3: Examples of building models under Indoor World model. All building models

are built by connecting three basic types of corners. Top left: concave(-) corner. Top

middle: convex(+) corner. Top right: occluding(>) corner. Bottom row: combinations of

corners.

three dimensional structure. Examples of such representation of scenes are depicted

in 3.3.

Another desirable property is the symmetry that it introduces between the shape

of the ceiling and the floor. Building models under this assumption have sym-

metric floor and ceiling shape. Evidence to infer building structure from a single

image mostly comes from the position of boundaries between planes, but floor-

wall boundaries are often occluded by objects such as desks, chairs, and bookcases,

as shown in Figure 3.14. Even in those cases, ceiling-wall boundaries are rarely

occluded, so observing ceiling-wall boundaries and assuming symmetry between

them allows us to infer the location of floor-wall boundaries.

29

3.4 Geometric Reasoning

As the world is made up of solid objects, projections of the world onto an image

obey a set of rules. In particular, projections of buildings under the Indoor World

assumption are geometrically constrained by a small set of rules defined on connec-

tion of walls, which we define as corners. An indoor scene can be fully represented

by corners, so geometric constraits on corners will guarantee the entire structure to

be valid.

There are three types of corners: convex(+), concave(-), and occluding(>). A

convex(+) or concave(-) corner is formed when two walls meet at one place in 3D

space and an occluding(>) corner is formed when one wall is in front of another

wall but appears to be adjacent in the image. The type and position of a corner is

constrained depending on where the corner is in the image.

The simplest constraint on a corner is that it should consist of two junctions, one

above the horizon and one below the horizon. This rule holds because the camera

itself is between the floor and the ceiling. Regions divided by vertical vanishing

lines also create constraints. In each of the three regions divided by two vertical

vanishing lines, only a total of four types of corners can exist, as illustrated in Fig-

ure 3.4. These rules are derived from facts about the physical world and geometry,

such as, the camera must be in an empty quadrant of a wall in order for it to be able

to observe the corner, and walls should have non-zero thickness.

These constraints are simple to adhere to, even at an early stage of inference when

no consideration about the 3D coordinates are made. Also, they can be applied

only to local and primitive corner structures, even when no consideration about the

global structure of the scene has been made. Yet, performing geometric reasoning

30

Figure 3.4: Regions divided by vanishing lines and restrictions on types of corners. Top:

Line drawing, vanishing points, and vanishing lines. Bottom: Types of possible corners

in each of the three regions. Enclosed in small boxes are depictions of corners as they

would appear in the image, and next to it are the top-down view of each corners. In each of

the three regions, four types of corners can exist: one convex(+), one concave(-), and two

occluding(>) corners.

according to these constraints will guarantee that our entire building model encodes

a valid model, which can be easily converted to a valid 3D model without ambiguity.

31

3.5 Finding Building Structure

Finding the building structure is done in three steps; line segments and vanishing

points are found, many plausible building model hypotheses are created, and each

hypothesis is tested against an orientation map, which is a map of local belief of re-

gion orientations, to find the best matching hypothesis. Each step will be explained

in detail in the following sections.

3.5.1 Line Segment Detection and Vanishing Point Estimation

We extract line segments using the Matlab toolbox by Kovesi [42], which runs

Canny edge detector, links edge pixels, and fits line segments. We then recover

vanishing points from these line segments.

From the three vanishing points, we can recover the orientation of the three axes

of the building in the camera coordinate by formulas in Appendix. This allows us to

reconstruct an accurate 3D model, even when none of the camera axes are aligned

with world coordinates.

We loosely follow Rother [55] to find three orthogonal vanishing points. Two

pairs of lines are randomly sampled in RANSAC fashion and the intersection of

each pair of lines generates a candidate vanishing point. Orthogonality of the two

vanishing points is verified using formulas in Appendix and the third vanishing

point is computed to be orthogonal to the two vanishing ponts. Then the three

candidates are evaluated using the cost function proposed in [55]. Finally, the x,

y coordinates of the best RANSAC solution are fine tuned using non-linear opti-

mization (Matlab fminsearch) with the same cost function. To ensure orthogonality

under optimization, vanishing points are translated into a rotation matrix, which

32

can then be parameterized with three unbounded parameters using Rodrigues’ for-

mula [16]. The highly non-convex nature of the cost function is not a big issue, as

the RANSAC solution was already close to the true solution.

For uncalibrated images with no available camera intrinsic parameters, three

pairs of lines are sampled to create a proposal, and orthogonality is loosely en-

forced by constraining three vanishing points to be apart from each other. Once

three vanishing points are found in image space, the focal length of the camera can

be recovered by finding a focal length that makes the angles exactly 90 degrees.

In practice, this method returned vanishing points within a few pixels of the true

vanishing points for all 102 test images when camera parameters were available,

and 40 out of 44 images when camera parameters were not available. It failed when

there were no lines in one of the three direction, or when many lines were not in the

principal directions.

3.5.2 Generating Building Hypotheses

For this and the following section, we define “orientation of a line segment” to

be the orientation of the line in the world, which can be estimated by the vanishing

point that lies on the extension of the line segment in the image. Similarly, “parallel”

line segments means parallel in the world. “Orientation of a surface’ is defined as

the normal orientation of the surface in the world and “pixel orientation” as the

orientation of the surface projected to the pixel.

Building models can be generated by connecting line segments to create corners,

and connecting corners to create building models. A corner consists of five lines,

but not all five lines need to be present to define a corner. Concave(-) and convex(+)

corners need three lines, and occluding(>) corners need four lines to be defined

33

Figure 3.5: Solid lines are the minimal set of lines needed to define a corner. Three lines

are needed for convex(+) and concave(-) corners. Four lines are needed for occluding(>)

corners.

(Figure 3.5). A new corner is proposed when a minimal set of lines defines a corner,

while obeying the constraints on corners described in Section 3.4.

The process of generating hypotheses is illustrated in Figure 3.6. We start by

creating building hypotheses with zero corners, i.e., scenes with just one wall. Two

parallel line segments, one above the horizon and one below the horizon, are ex-

tended until the image boundaries to define the floor-wall and ceiling-wall bound-

ary of a wall. Next, we search for line segments that can be extended to “attach” to

existing walls to propose a new corner. Note that an existing wall already defines

two lines, so only one additional line need to be added to propose a concave(-) or

a convex(+) corner, and two for an occluding(>) corner. By repeatedly attaching

more corners to an existing structure, we can create a scene with many corners.

This process is described in Algorithm 1.

3.5.3 Evaluating Building Hypotheses

We test all building hypotheses to find the best fitting hypothesis to a given col-

lection of line segments. This is done by evaluating the fitness of hypotheses to

an orientation map (Figure 3.7), which is a map that expresses the local belief of

region orientations computed from line segments. The fitness of a hypothesis to an

34

Figure 3.6: Generating hypotheses. Left: The process of a hypothesis being generated by

four line segments. Right: A sample of generated building hypotheses.

orientation map is defined as the total number of pixels which the orientation agrees

between that encoded by the hypothesis and that given by the orientation map. The

hypothesis with the largest fitness is chosen as the best fitting hypothesis.

Two line segments having different orientation supporting a pixel is a strong in-

dication of the pixel orientation to be perpendicular to the orientation of the two

lines. For example, we, as human, believe pixel (1) in Figure 3.7(a) is on a hor-

izontal surface because a green line above it and a blue line to the right supports

pixel (1) to be perpendicular to the orientation of both lines. Pixel (2) seems to be

on a vertical surface because green lines above and below and red lines to the left

support it. Notice that, although there is a blue line below pixel (2), its support is

blocked by the green line between the blue line and the pixel. The support of a line

extends until it hits a line which has the same orientation as the normal orientation

of the surface it is supporting. This is because a line can not be on a plane that is

35

Algorithm 1 Generating building hypotheses

Set H0 ← ∅, where H0 is the set of hypotheses with zero corners.

for all pair of line segments (li, lj) do

if li above horizon ∧ lj below horizon ∧ li and lj have overlap then

Add scene with no corner (li, lj) to H0

end if

end for

for k = 1 to n, where n is maximum number of corners in scene do

Set Hk ← ∅, where Hk is the set of hypotheses with k corners.

for all h ∈ Hk−1 do

Find sets of lines that create corners that attaches to h and satisfies geometric

constraints.

H ′ ← Set of all scenes with a new corner attached to h

Hk ← Hk ∪H ′

end for

end for

return H ← H0 ∪H1 ∪ · · · ∪Hn

perpendicular to it. This logic usually produces accurate orientation map, except

around occluding boundaries.

More formally, let Lx = {lx,1, lx,2, · · · , lx,nx} be the set of line segments of ori-

entation x, where x ∈ {1, 2, 3} denotes the one of the three orientations. A “sweep”

S (lx,i, vy, α) of a line lx,i towards vanishing point vy by amount α is the set of pix-

els that is supported by line lx,i to be orientation z (Figure 3.8). x, y, and z take

values in {1, 2, 3} and all three should be different (x 6= y, x 6= z, and y 6= z).

Given a line segment lx,i with end points p1 and p2, S (l, vy, α) is the convex hull

36

created by p1, p2, p′1, and p′2, where p′1 and p′2 is given by

p′1 = p1 + α (vy − p1) ,

p′2 = intersection (line (vx, p′1) , line (vy, p2)) ,

where line (·, ·) denotes a line passing through two points and intersection (·, ·)

denotes the point of intersection of two lines.

The sweep extends until the sweep region contains a line that “blocks” the sweep.

The amount of sweep α̂x,i and −β̂x,i, towards and away from its sweep direction is:

α̂x,i = max (α) , β̂x,i = max (β) ,

such that α ≥ 0, β ≥ 0, and no lines in Lz intersect S(lx,i, vy, α) and S(lx,i, vy,−β).

The set of pixels that is supported by all lines in Lx swept towards vy to be

orientation z is:

Px,y,z =⋃

lx,i∈Lx

S(lx,i, vy, α̂x,i) ∪ S(lx,i, vy, β̂x,i).

A pixel is believed to have orientation z when two lines of different orientation x

and y support the pixel, and only when it is exclusively supported to be z. The final

orientation map Oz for orientation z is given by:

Rz = Px,y,z ∩ Py,x,z

Oz = Rz ∩Rcx ∩Rc

y.

Figure 3.7(b) shows O1, O2, and O3 colored in red, green, and blue.

37

(1)

(2)

(a) (b)

Figure 3.7: Line segments and Orientation map. (a) Line segments, vanishing points, and

vanishing lines. (b) Orientation map. Lines segments and regions are colored according to

their orientation. (Best viewed in color)

l l

1p

2p

1p

2p

yv

xv

Figure 3.8: The shaded area denotes the sweep S (l, vy, α) of line l towards vanishing point

vy by amount α, and it potentially supports the region to be orthogonal to vx and vy.

3.5.4 Converting Building Models to 3D

Two dimensional building model hypotheses always encode valid 3D models, so

computing 3D coordinates can be done easily without ambiguity. 3D coordinates

can be computed sequentially for floor, then walls using the constraint that floor and

38

walls are connected, and finally the ceiling, using the following formulas.

All units of metrics are in camera height, i.e., the distance between the floor and

the camera measured perpendicular to the floor equals 1, since absolute distances

can not be measured from images. Lower case: 2D homogeneous coordinates.

Upper case: 3D coordinates. Vanishing points with subscript 1 (v1, V1) indicates

the vertical vanishing point. K: camera intrinsic parameter matrix

• Ray

P = λK−1p, λ > 0

• Normal direction of the three major axes given coordinates of three vanishing

points (xk, yk) in image.

vk = (xk, yk, 1)T ⇔ Vk =K−1vk‖K−1vk‖2

• 3D coordinate of a point on the floor. Note that the height is normalized to 1.

P =K−1p

V T1 K

−1p

• Height h between two points p1 and p2, with p1 being a point on the floor.

p1, p2, and v1 should roughly be in line when applying this formula, as we

assume P1 and P2 are vertically aligned in 3D.

P2 = λK−1p2

= P1 + hV1

=K−1p1

V T1 K

−1p1+ hV1

[−V1 K−1p2

] h

λ

=K−1p1

V T1 K

−1p1

Solving least-squares gives h.

39

Figure 3.9: 3D models

Small errors can accumulate during the above mentioned sequential process, so

we follow Delage et al. [11] to globally minimize the distances between connected

planes using linear programming. Recovered 3D models are visualized in Fig-

ure 3.9.

3.6 Experiments

We have collected 54 images of indoor scenes. We have also included objects in

the image that obstruct the view of the scene frame. We have manually labeled

the ground truth orientation for every pixel, ignoring the occluding objects. The

percentage of pixels that have the correct orientation for each image is reported

in Figure 3.10. On average, 81% of the pixels were classified correctly. 76% of

40

the images had less than 30% misclassified pixels, and 44% had less than 10%

misclassified pixels. Qualitatively, around 70% of the images returned acceptable

3D models. Notice that even when objects occlude the floor-wall boundary, the

underlying building structure could be recovered (Figure 3.14). In these cases, the

unobstructed view of the ceiling-wall boundary have helped finding the underlying

building structure. Typical failure cases are: hallways being cut off early when

there are no lines supporting down the hallway, missing corners, or misaligned

boundaries (Figure 3.15).

We have compared our results with other works on recovering indoor structure

from a single image. We had comparable results as Delage et al. [12], with their

experimental setup and dataset, which had 48 images of indoor campus scenes.

RMS error between the estimated and ground truth floor boundary was measured

in pixel space, and is plotted as a function of the position of the true floor boundary

(Figure 3.11). Comparing with Hoiem et al. [35], using their classifier trained for

indoor images, we have a higher percentage of correctly classified pixel orientation

on 20 out of 48 images, and a mean percentage of 80% versus 87%. In both cases,

our results are comparable, while relying only on line segments and not on image

properties such as colors and image gradients, which can be scene specific.

We have also tested on the 44 images downloaded from the web, also collected

by Delage et al.. Qualitatively, around 20 of them returned acceptable 3D models.

Failures were due to many objects that cluttered the scene, and scenes that do not

match our building model. Sample results are shown in Figure 3.16.

41

0 5 10 15 20 25 30 35 40 45 50 550

0.2

0.4

0.6

0.8

1

Image index

Per

cent

age

Average=0.82

Figure 3.10: Percentage of pixels with correct orientation.

0 50 100 150 200 250 300 350 400 4500

20

40

60

80

100

120

Height of ground truth (pixels)

RM

S e

rror

in lo

calis

atio

n (p

ixel

s)

Our resultDelage et.al.

Figure 3.11: Comparison of floor boundary error

3.7 Populating the Scene Frame with Objects

Now that we have the scene structure, we would like to use it as a “frame” that

defines the scene, and populate it with objects in the scene. Recovering the “scene

42

Figure 3.12: Examples of doors and people in a scene frame.

frame” is a stepping stone toward a more complete scene understanding, as it pro-

vides a global geometric context of the scene. Our ultimate goal is to recognize all

the objects in a scene. Most objects of interest fall into one of the two categories:

objects that lie on the floor, and objects that are attached to a wall. Objects that lie

on the floor interacts with the scene frame by being supported at the point it contacts

the floor of the frame, which determines its 3D location. These objects need to be

in an empty space of the frame, and not inside walls. Locations of objects attached

to walls are also constrained by the scene frame. Figure 3.12 shows results of in-

tegrating the recovered scene structure with door and pedestrian detection. More

thorough study on improving building structure recovery by adding objects into the

framework will be done in Chapter 4. Study on improving object detection using

geometry will be presented in Chapter 5.

3.8 Conclusion

We have proposed a framework to interpret collection of line segments to recover

three dimensional building structure. We have shown that, by geometric reasoning,

43

and by using the prior knowledge of indoor environments, we can recover the struc-

ture of a building, using only line segments. An interesting future problem would

be to use our recovered structure as a “scene frame” to recognize more components

in the scene and step towards the grand goal of complete scene interpretation.

44

Figure 3.13: Examples (Best viewed in color)

45

Figure 3.14: Examples with occluding objects. Unobstructed view of the ceiling-wall

boundary helps finding the underlying building structure. (Best viewed in color)

46

Figure 3.15: Failure examples. (Best viewed in color)

47

Figure 3.16: Examples of images downloaded from the web. Top two rows: Success.

Bottom two rows: Failure. (Best viewed in color)

48

Chapter 4

Volumetric Reasoning for Structure

and Objects

In the previous chapter, we have developed a method to recover the structure of

building interiors, while treating objects in the scene as outliers. In this chapter,

we show that by explicitly modeling objects and applying volumetric constraints

derived from the principles based on the physical world, the estimated structure is

geometrically plausible and the performance of the estimate improves.

4.1 Introduction

Consider the indoor image shown in Figure 4.1. Understanding such a complex

scene not only involves visual recognition of objects but also requires extracting

the 3D spatial layout of the room (ceiling, floor and walls). Extraction of the spatial

layout of a room provides crucial geometric context required for visual recognition.

There has been a recent push to extract spatial layout of the room by classifiers

which predict qualitative surface orientation labels (floor, ceiling, left, right, center

49

wall and object) from appearance features and then fit a parametric model of the

room. However, such an approach is limited in that it does not use the additional

information conveyed by the configuration of objects in the room and, therefore, it

fails to use all of the available cues for estimating the spatial layout.

In this work, we propose to incorporate an explicit volumetric representation of

objects in 3D for spatial interpretation process. Unlike previous approaches which

model objects by their projection in the image plane, we propose a parametric rep-

resentation of the 3D volumes occupied by objects in the scene. We show that such

a parametric representation of the volume occupied by an object can provide crucial

evidence for estimating the spatial layout of the rooms. This evidence comes from

volumetric reasoning between the objects in the room and the spatial layout of the

room. We propose to augment the existing structured classification approaches with

volumetric reasoning in 3D for extracting the spatial layout of the room.

Figure 4.1 shows an example of a case where volumetric reasoning is crucial in

estimating the surface layout of the room. Figure 4.1(b) shows the estimated spatial

layout for the room (overlaid on surface orientation labels predicted by a classi-

fier) when no reasoning about the objects is performed. In this case, the couch is

predicted as floor and therefore there is substantial error in estimating the spatial

layout. If the couch is predicted as clutter and the image evidence from the couch

is ignored (Figure 4.1(c)), multiple room hypotheses can be selected based on the

predicted labels of the pixels on the wall (Figure 4.1(d)) and there is still not enough

evidence in the image to select one hypothesis over another in a confident manner.

However, if we represent the object by a 3D parametric model, such as a cuboid

(Figure 4.1(e)), then simple volumetric reasoning (the 3D volume occupied by the

couch should be contained in the free space of the room) can help us reject physi-

50

Object pushes wall

(a) Input image

(b) Spatial layout without

object reasoning (c) Object removed (d) Spatial layout with 2D object reasoning

(e) Object fitted with

parametric model (f) Spatial layout with 3D volumetric reasoning

Figure 4.1: (a) Input image. (b) Estimate of the spatial layout of the room without object

reasoning. Colors represent the output of the surface geometry by [36]. Green: floor, red:

left wall, yellow: center wall, cyan: right wall. (c) Evidence from object region removed.

(d) Spatial layout with 2D object reasoning. (e) Object fitted with 3D parametric model. (f)

Spatial layout with 3D volumetric reasoning. The wall is pushed by the volume occupied

by the object.

cally invalid hypotheses and estimate the correct layout of the room by pushing the

walls to completely contain the cuboid (Figure 4.1(f)).

In this work, we propose a method to perform volumetric reasoning by combin-

ing classical constrained search techniques and current structured prediction tech-

niques. We show that the resulting approach leads to substantially improved per-

formance on standard datasets with the added benefit of a more complete scene

description that includes objects in addition to surface layout.

51

4.1.1 Background

The goal of extracting 3D geometry by using geometric relationships between ob-

jects dates back to the start of computer vision around four decades ago. In the early

days of computer vision, researchers extracted lines from “blockworld” scenes [54]

and used geometric relationships using constraint satisfaction algorithms on junc-

tions [27, 69]. However, the reasoning approaches used in these block world scenar-

ios (synthetic line drawings) proved too brittle for the real-world images and could

not handle the errors in extraction of line-segments or generalize to other shapes.

In recent years, there has been renewed interest in extracting camera param-

eters and three-dimensional structures in restricted domains such as Manhattan

Worlds [8]. Kosecka et al. [40] developed a method to recover vanishing points and

camera parameters from a single image by using line segments found in Manhat-

tan structures. Using the recovered vanishing points, rectangular surfaces aligned

with major orientations were also detected by [41]. However, these approaches are

only concerned with dominant directions in the 3D world and do not attempt ex-

tract three dimensional information of the room and the objects in the room. Yu et

al. [71] inferred the relative depth-order of rectangular surfaces by considering their

relationship. However, this method only provides depth cues of partial rectangular

regions in the image and not the entire scene.

There has been a recent series of methods related to our work that attempt to

model geometric scene structure from a single image, including geometric label

classification [36, 57] and finding vertical/ground fold-lines [12]. Lee et al. [45]

introduced parameterized models of indoor environments, constrained by rules in-

spired by blockworld to guarantee physical validity. However, since this approach

samples possible spatial layout hypothesis without clutter, it is prone to errors

52

caused by the occlusion and tend to fit rooms in which the walls coincide with

the object surfaces. A recent paper by Hedau et al. [31] uses an appearance based

clutter classifier and computes visual features only from the regions classified as

“non-clutter”, while parameterizing the 3D structure of the scene by a box. They

use structured approaches to estimate the best fitting room box to the image. A sim-

ilar approach has been used by Wang et al. [70] which does not require the ground

truth lables of clutter. In these methods, however, the modeling of interactions be-

tween clutter and spatial-layout of the room is only done in the image plane and the

3D interactions between room and clutter are not considered.

In work concurrent to ours, Hedau et al. [32] have also modeled objects as

three dimensional cuboids and considered the volumetric intersection with the room

structure. The goal of their work differs from ours. Their primary goal is to improve

object detection, such as beds, by using information of scene geometry, whereas our

goal is to improve scene understanding by proposing a control structure that incor-

porates volumetric constraints. Therefore, we are able to improve the estimate of

the room by estimating the objects and vice versa, whereas in their work informa-

tion flows in only one direction (from scene to objects).

In recent work by Gupta et al. [25], qualitative reasoning of scene geometry

was done by modeling objects as “blocks” for outdoor scenes. In contrast, we

use stronger parameteric models for rooms and objects in indoor scenes, which are

more structured, that allows us to do more explicit and exact 3D volumetric reason-

ing.

53

4.2 Overview

Our goal is to jointly extract the spatial layout of the room and the configuration of

objects in the scene. We model the spatial layout of the room by 3D boxes and we

model the objects as solids which occupy 3D volumes in the free space defined by

the room walls. Given a set of room hypotheses and object hypotheses, our goal

is to search the space of scene configurations and select the configuration that best

matches the local surface geometry estimated from image cues and satisfies the vol-

umetric constraints of the physical world. These constraints (shown in Figure 4.3)

are:

• Finite volume: Every object in the world should have a non-zero finite vol-

ume.

• Spatial exclusion: The objects are assumed to be solid objects which cannot

intersect. Therefore, the volumes occupied by different object are mutually

exclusive. This implies that the volumetric intersection between two objects

should be empty.

• Containment: Every object should be contained in the free space defined

by the walls of the room (i.e, none of the objects should be outside the room

walls).

Our approach is illustrated in Figure 4.2. We first extract line segments and

estimate three mutually orthogonal vanishing points (Figure 4.2(b)). The vanishing

points define the orientation of the major surfaces in the scene [41, 45, 31] and

hence constrain the layout of ceilings, floor and walls of the room. Using the line

segments labeled by their orientations, we then generate multiple hypotheses for

54

(a) Input image (b) Line segments and

Vanishing points

(e) Room hypotheses

(f) Cube hypotheses (d) Orientation map (c) Geometric context

(h) Scene configuration hypotheses

(g) Reject invalid

configurations

(i) Evaluate

(j) Final scene

configuration

Figure 4.2: Overview of our approach for estimating the spatial layout of the room and the

objects.

rooms and objects (Figure 4.2(e)(f)). A hypothesis of a room is a 3D parametric

representation of the layout of major surfaces of the scene, such as floor, left wall,

center wall, right wall, and ceiling. A hypothesis of an object is a 3D parametric

representation of an object in the scene, approximated as a cuboid.

The room and cuboid hypotheses are then combined to form the set of possible

configurations of the entire scene (Figure 4.2(h)). The configuration of the entire

scene is represented as one sample of the room hypothesis along with some subset

of object hypotheses. The number of possible scene configurations is exponential

in the number of object hypotheses 1. However, not all cuboid and room subsets

1O(n ·2m) where n is the number of room hypotheses and m is the number of object hypotheses

55

are compatible with each other. We use simple 3D spatial reasoning to enforce the

volumetric constraints described above (See Figure 4.2(g)). We therefore test each

room-object pair and each object-object pair for their 3D volumetric compatibility,

so that we allow only the scene configurations which have no room-object and no

object-object volumetric intersection.

Finally, we evaluate the scene configurations created by combinations of room

hypotheses and object hypotheses to find the scene configuration that best matches

the image (Figure 4.2(i)). As the scene configuration is a structured variable, we

use a variant of the structured prediction algorithm [65] to learn the cost function.

We use two sources of surface geometry, orientation map [45] and geometric con-

text [36], which serve as features in the cost function. Since it is computationally ex-

pensive to test exhaustive combinations of scene configurations in practice, we use

beam-search to sample the scene configurations that are volumetrically-compatible

(Section 4.5.1).

4.3 Estimating Surface Geometry

We would like to predict the local surface geometry of the regions in the image.

A scene configuration should satisfy local surface geometry extracted from image

cues and should satisfy the 3D volumetric constraints. The estimated surface geom-

etry is therefore used as features in a scoring function that evaluates a given scene

configuration.

For estimating surface geometry we use two methods: the line-sweeping algo-

rithm [45] and a multiple segmentation classifier [36]. The line-sweeping algorithm

takes line segments as input and predicts an orientation map in which regions are

56

classified as surfaces into one of the three possible orientations. Figure 4.2(d) shows

an example of an orientation map. The region estimated as horizontal surface is

colored in red, and vertical surfaces are colored in green and blue, corresponding

to the associated vanishing point. This orientation map is used to evaluate scene

configuration hypotheses. The multiple segmentation classifier [36] takes the full

image as input, uses image features, such as combinations of color and texture, and

predicts geometric context represented by surface geometry labels for each super-

pixel (floor, ceiling, vertical (left, center, right), solid, and porous regions). Similar

to orientation maps, the predicted labels are used to evaluate scene configuration

hypotheses.

4.4 Generating Scene Configuration Hypothesis

Given the local surface geometry and the oriented line segments extracted from the

image, we now create multiple hypotheses for possible spatial layout of the room

and object layout in the room. These hypotheses are then combined to produce

scene configuration layout such that all the objects occupy exclusive 3D volumes

and the objects are inside the freespace of the room defined by the walls.

4.4.1 Generating Room Hypotheses

A room hypothesis encodes the position and orientation of walls, floor, and ceil-

ing. In this work, we represent a room hypothesis by a parametric box model [31].

Room hypotheses are generated from line segments in a way similar to the method

described in the previous chapter. In the previous chapter, we examine exhaus-

tive combinations of line segments and check which of the resulting combinations

57

define physically valid room models. Instead, we sample random tuples of line

segments lines that define the boundaries of the parametric box. Only the mini-

mum number of line segments to define the parametric room model are sampled.

Figure 4.2(e) shows examples of generated room hypotheses.

4.4.2 Generating Object Hypotheses

Our goal is to extract the 3D geometry of the clutter objects to perform 3D spatial

reasoning. Estimating precise 3D models of objects from a single image is an ex-

tremely difficult problem and probably requires recognition of object classes such

as couches and tables. However, our goal is to perform coarse 3D reasoning about

the spatial layout of rooms and spatial layout of objects in the room. We only need

to model a subset of objects in the scene to provide enough constraints for volu-

metric reasoning. Therefore, we adopt a coarse 3D model of objects in the scene

and model each object-volume as cuboids. We found that parameterizing objects as

cuboids provides a good approximation to the occupied volume in man-made en-

vironments. Furthermore, by modeling objects by a parametric model of a cuboid,

we can determine the location and dimensions in 3D up to scale, which allows

volumetric reasoning about the 3D interaction between objects and the room.

We generate object hypotheses from the orientation map described above. Fig-

ure 4.4(a)(b) shows an example scene and its orientation map. The three colors

represent the three possible plane orientations used in the orientation map. We can

see from the figure that the distribution of surfaces on the objects estimated by the

orientation map suggests the presence of a cuboidal object. Figure 4.4(c) shows a

pair of regions which can potentially form a convex edge if the regions represent

the visible surfaces on a cuboidal object.

58

We test all pairs of regions in the orientation map to check whether they can

form convex edges. This is achieved by checking the estimated orientation of the

regions and the spatial location of the regions with respect to the vanishing points.

If the region pair can form a convex corner, we utilize these regions to form an

object hypothesis. To generate a cuboidal object hypothesis from pairs of regions,

we first fit tight bounding quadrilaterals (Figure 4.4(c)) to each region in the pair

and then sample all combinations of three points out of the eight vertices on the

two quadrilaterals, which do not lie on a plane. Three is the minimum number of

points (with (x, y) coordinates) that have enough information to define a cuboid

projected onto a 2D image plane, which has five degrees of freedom. We can then

hypothesize a cuboid, whose corner best apprximates the three points. Figure 4.4(d)

shows a sample of a cuboidal object hypothesis generated from the given orientation

map.

4.4.3 Volumetric Compatibility of Scene Configuration

Given a room configuration and a set of candidate objects, a key operation is to eval-

uate whether the resulting combination satisfies the three fundamental volumetric

compatibility constraints described in Section 4.2. The problem of estimating the

three dimensional layout of a scene from a single image is inherently ambiguous

because any measurement from a single image can only be determined up to scale.

In order to test the volumetric compatibility of room-object hypotheses pairs and

object-object hypotheses pairs, we make the assumption that all objects rest on the

floor. This assumption fixes the scale ambiguity between room and object hypothe-

ses and allows us to reason about their 3D location.

To test whether an object is contained within the free space of a room, we check

59

(a) Containment Constraint

(b) Spatial Exclusion Constraint(b) Spatial Exclusion Constraint

Figure 4.3: Examples of volumetric constraint violation.

whether the projection of the bottom surface of the object onto the image is com-

pletely contained within the projection of the floor surface of the room. If the pro-

jection of the bottom surface of the object is not completely within the floor surface,

the corresponding 3D object model must be protruding into the walls of the room.

Figure 4.3(a) shows an example of an incompatible room-object pair.

Similarly, to test whether the volume occupied by two objects is exclusive, we

assume that the two objects rest on the same floor plane and we compare the pro-

jection of their bottom surfaces onto the image. If there is any overlap between the

projections of the bottom surface of the two object hypotheses, that means that they

occupy intersecting volumes in 3D. Figure 4.3(b) shows an example of an incom-

60

(a) Image (b) Orientation Map

(c) Convex Edge Check (d) Hypothesized Cuboid( ) g ( ) yp

Figure 4.4: Object hypothesis generation: we use the orientation maps to generate object

hypotheses by finding convex edges.

patible object-object pair.

4.5 Evaluating Scene Configurations

4.5.1 Inference

Given an image x, a set of room hypotheses {r1, r2, ..., rn}, and a set of object

hypotheses {o1, o2, ..., om}, our goal is to find the best scene configuration y =

(yr,yo), where yr = (y1r , ..., ynr ), yo = (y1o , ..., y

mo ). yir = 1 if room hypothesis

61

ri is used in the scene configuration and yir = 0 otherwise, and yio = 1 if object

hypothesis oi is present in the scene configuration and yio = 0 otherwise. Note that∑i y

ir = 1 as only one room hypothesis is needed to define the scene configuration.

Suppose that we are given a function f(x,y) that returns a score for y. Finding

the best scene configuration y∗ = arg maxy f(x,y) through testing all possible

scene configurations requires n · 2m evaluations of the score function. We resort to

using beam search (fixed width search tree) to keep the computation manageable

by avoiding evaluating all scene configurations.

In the first level of the search tree, scene configurations with a room hypothesis

and no object hypothesis are evaluated. In the following levels, an object hypothesis

is added to its parent configuration and the configuration is evaluated. The top kl

nodes with the highest score are added to the search tree as the child node, where

kl is a pre-determined beam width for level l.2 The search is continued for a fixed

number of levels or until no cubes that are compatible with existing configurations

can be added. After the search tree has been explored, the best scoring node in the

tree is returned as the best scene configuration.

4.5.2 Learning the Score Function

We set the score function to f(x,y) = wTψ(x,y) + wTφφ(y), where ψ(x,y) is a

feature vector for a given image x and measures the compatibility of the scene con-

figuration y with the estimated surface geometry. φ(y) is the penalty term for in-

compatible configurations and penalizes the room and object configurations which

violate volumetric constraints.

2We set kl to (100, 5, 2, 1), with a maximum of 4 levels. The results were not sensitive to these

parameters.

62

We use structured SVM [65] to learn the weight vector w. The weights are

learned by solving

minw,ξ

1

2‖w‖2 + C

∑i

ξi

s.t. wTψ(xi,yi)− wTψ(xi,y)− wTφφ(y) ≥ ∆(yi,y)− ξi,∀i,∀y

ξi ≥ 0,∀i,

where xi are images, yi are the ground truth configuration, ξi are slack variables,

and ∆(yi,y) is the loss function that measures the error of configuration y. Tsochan-

taridis [65] deals with the large number of constraints by iteratively adding the most

violated constraints. We simplify this by sampling a fixed number of configurations

per each training image, using the same beam search process used for inference,

and solving using quadratic programming.

Loss Function: The loss function ∆(yi,y) is the percentage of pixels in the

entire image having incorrect label. For example, pixels that are labeled as left wall

when they actually belong to the center wall, or pixels labeled as object when they

actually belong to the floor would be counted as incorrectly labeled pixels. A wall is

labeled as center if the surface normal is within 45 degrees from the camera optical

axis and labeled as left or right, otherwise.

Feature Vector: The feature vector ψ(x,y) is computed by measuring how well

each surface in the scene configuration y is supported by the orientation map and

the geometric context. A feature is computed for each of the six surfaces in the

scene configuration (floor, left wall, center wall, right wall, ceiling, object) as the

relative area which the orientation map or the geometric context correctly explains

the attribute of the surface. This results in a twelve dimensional feature vector for a

given scene configuration. For example, the feature for the floor surface in the scene

63

configuration is computed by the relative area which the orientation map predicts a

horizontal surface, and the area which the geometric context predicts a floor label.

Volumetric Penalty: The penalty term φ(y) measures how much the volumet-

ric constraints are violated. (1) The first term φ(yr, yo) measures the volumetric

intersection between the volume defined by room walls and objects. It penalizes

the configurations where the object hypothesis lie outside the room volume and

the penalty is proportional to the volume outside the room. (2) The second term∑i,j φ(yio, y

jo) measures the volume intersection between two objects (i, j). This

penalty from this term is proportional to the overlap of the cubes projected on the

floor.

4.6 Experimental Results

We evaluated our 3D geometric reasoning approach on an indoor image dataset in-

troduced in [31]. The dataset consists of 314 images, and the ground-truth consists

of the marked spatial layout of the room and the clutter layouts. For our experi-

ments, we use the same training-test split as used in [31] (209 training and 105 test

images). We use training images to estimate the weight vector.

Qualitative Evaluation: Figure 4.5 illustrates the benefit of 3D spatial reasoning

introduced in our approach. If no 3D clutter reasoning is used and the room box

is fitted to the orientation map and geometric context, the box gets fit to the object

surfaces and therefore leads to substantial error in the spatial layout estimation.

However, if we use 3D object reasoning walls get pushed due to the containment

constraint and the spatial layout estimation improves. We can also see from the

examples that extracting a subset of objects in the scene is enough for reasoning and

64

Input image Room only Room and objects Orientation map Geometric context

Figure 4.5: Two qualitative examples showing how 3D volumetric reasoning aids estimation

of the spatial layout of the room.

Figure 4.6: Additional examples to show the performance on a wide variety of scenes.

Dotted lines represent the room estimate without object reasoning.

improving the spatial layout estimation. Figure 4.6 and 4.7 shows more examples of

the spatial layout and the estimated clutter objects in the images. Additional results

are in the supplementary material.

65

Figure 4.7: Failure examples. The first two examples are the failure cases when the cuboids

are either missed or estimated wrong. The last two failure cases are due to errors in vanish-

ing point estimation.

OM+GC OM GC

No object reasoning 18.6% 24.7% 22.7%

Volumetric reasoning 16.2% 19.5% 20.2%

Table 4.1: Percentage of pixels with correct estimate of room surfaces. First row performs

no reasoning about objects. Second row is our approach with 3D volumetric reasoning of

objects. Columns shows the features that are used. OM: Orientation map from [45]. GC:

Geometric context from [36].

Quantitative Evaluation: We evaluate the performance of our approach in esti-

mating the spatial layout of the room. We use the pixel-based measure introduced

in [31] which counts the percentage of pixels on the room surfaces that disagree

with the ground truth. For comparison, we employ the simple multiple segmen-

tation classifier [36] and the recent approach introduced in [31] as baselines. The

images in the dataset have significant clutter; therefore, simple classification based

approaches with no clutter reasoning perform poorly and have an error of 26.5%.

The state-of-the-art approach [31] which utilizes clutter reasoning in the image

plane has an error of 21.2%. On the other hand, our approach which uses a para-

66

metric model of clutter and simple 3D volumetric reasoning outperforms both the

approaches and has an error of 16.2%.

We also performed several experiments to measure the significance of each step

and features in our approach. When we only use the surface layout estimates

from [36] as features of the cost function, our approach has an error rate of 20.2%

whereas using only orientation maps as features yields an error rate of 19.5%. We

also tried several search techniques to search the space of hypotheses. With a greedy

approach (best cube added at each iteration) to search the hypothesis space, we

achieved an error rate of 19.2%, which shows that early commitment to partial con-

figurations leads to error and search strategy that allows late commitment, such as

beam search, should be used.

4.7 Conclusion

This chapter proposes the use of volumetric reasoning between objects and surfaces

of room layout to recover the spatial layout of a scene. By parametrically represent-

ing the 3D volume of objects and rooms, we can apply constraints for volumetric

reasoning, such as spatial exclusion and containment. Our experiments show that

volumetric reasoning improves the estimate of the room layout and provides a richer

interpretation about objects in the scene. The rich geometric information provided

by our method can provide crucial information for object recognition and eventually

aid in complete scene understanding.

67

Chapter 5

Detecting Objects Characterized by

Geometry

5.1 Introduction

Most successful object detection methods have focused on using the regularities

in the appearance of objects to identify them. These methods have had success

on objects that can be characterized by the regularities in their appearance in the

image, such as faces[56, 59, 68], pedestrians from a distance[10], and cars[59].

These methods are continuing to be applied to broader categories, such as bicycles,

televisions, potted plants, and so on, to varying degrees of success.

This method relies on the fact the appearance of objects in images are fairly con-

sistent. In the case of faces, eyes and mouth region consistently have darker inten-

sity than nose and cheek region. By designing features that can capture those con-

sistant characteristics, vision researchers have been able to develop a face detector[68].

In the case of pedestrians, edge histograms have been discovered to be consistent

68

across different instances of pedestrians and were successfully used to develop a

pedestrian detector[10]. By continuing to improve ways to capture the consisten-

cies in appearance, we can expect this method to improve for categories that have

strong consistencies in their appearance.

However, there is an entirely different class of objects, which does not have

strong characteristics and consistency in their appearance, but are rather easily char-

acterized by their geometry in the 3D world. Some example of such objects are

doors, desks, computer monitors, beds, and chairs. For example, the characteristic

that defines a door is the fact that it is a rectangle, which has the appropriate size and

location, so that a person can pass through, and that it is attached to a wall, so that

it serves its function as a passage to a different room in a building. Therefore, the

geometry of doors in 3D are very consistent, which makes it a desirable property

to use when developing a door detector. On the otherhand, the appearance of doors

in an image varies to a greater degree due to the various color and texture of doors

or posters attached to doors, and varying viewing angle, which leads to perspective

distortion.

Similarly, computer monitors are, appearance-wise, simply a rectangle, but they

can be characterized by the fact that they have the proper dimensions of a monitor,

usually about 19 inches diagonally, and that they are usually placed on a desk. The

dimensions of beds are also consistent and are standardized into twin, queen, king,

etc.

We would like to make the distinction between class of objects that can be better

characterized by their appearance in the image and class of objects that can be

better characterized by their geometry in 3D. We define the first class of objects as

“painted” objects and the second class of objects as “sculpted” objects.

69

We argue that, in order to build an object detector, different approach must be

taken for painted objects and sculpted objects. Appearance-based detectors can

be expected work well for detecting painted objects, since these objects have con-

sistent appearance. However, for sculpted objects, the detector must focus on the

geometric properties of the object, which is what best characterizes the object.

The focus of our work will be on developing a detector for sculpted objects, as

most prior work so far on object recognition has focused on the consistency of ap-

pearance of painted objects. Most objects, however, will not belong exclusively to

one class but will have aspects of both classes. Ultimately, we believe that both

aspects of appearance and geometry are very important for building an object de-

tector and must be used in conjunction for best performance. But as an initial study,

it is worth focusing our attention on sculpted objects to learn about the potential of

using geometric properties for object detection.

The physical dimensions of an object is a well defined property in the real world.

Therefore, in order to obtain the prior distribution of the physical dimensions, it

can easily be directly measured in the real world, rather than being learned from

training images. One can also consult publicly available statistics of the physical

dimensions when available, as done by Hoiem et al.[35] for height of people and

cars.

In this work, we have built a system that detects objects in conjunction with the

building structure in indoor environments. The goal of this work is to detect com-

mon objects in indoor environments that is strongly characterized by their geometry,

such as doors, desks, and monitors, as well as to recover the structure of building

interior. We examine the physical dimensions of objects to verify that they have the

correct size and location to be a certain object. We have also looked at the physi-

70

cal relation between objects to make sure that they are physically and semantically

correct.

5.2 Related Work

Object detection has a long history in computer vision. The past decade was par-

ticularly successful and has matured enough to be of practical use for a few classes

of objects, such as faces [68], and pedestrians [10]. Such efforts are expanding to

more classes of objects [18, 17], driven by organized challenges, such as the PAS-

CAL challenge [14]. Such success has been based on methods that make use of

appearance features. But as reported in [14], some objects turns out to be more

easy to detect than others, even when the same method has been applied.

In this work, we argue that, although some objects are effectively characterized

by their appearance, there are classes of objects that are better characterized by their

geometry. Such idea has indeed been explored in the past and has been the primary

method for the most part of the history of computer vision from 1960s to 1990s

before the surge of appearance based methods. Early geometry based methods are

well summarized in the article by Mundy [50].

One of the earliest and most influential is the work on blocks world [54]. It as-

sumes that the world is made of composition of polyhedral components and has

solved for parameters of polyhedral models to fit edges. The work has been ex-

tended by many researchers, especially in exploring constraints for labeling edges [27,

6, 37, 69, 48, 61]. These work were limited to either contrived scenes or ground

truth line drawing images, rather than real scenes, and the objects they considered

were artificial blocks and not realistic objects. Also, their focus was on recovering

71

the geometric structure of objects, rather than determining the semantic category.

Then, a group of work has emerged that recognizes objects by aligning manually

defined 3D object models to images [46, 4, 24, 1, 5, 38, 63]. Such methods by-

passes the problem of grouping of features and are robust to occlusion or missing

evidences. However, these methods eventually led to the problem of ambiguity of

image features, so the focus of research has shifted away from geometry and led to

methods that focus on learning statistical distribution in appearance.

An interesting work on chairs has been done recently by [23]. Chairs are a class

of objects that has been very difficult to detect because of the fact that chairs can

vary so much in their appearance. Not only there are so many types of chairs,

such as office chairs, dining chairs, couches, etc., but even within a single type of

chair, their appearance in the image can look drastically different from one another.

However, what is universally common among chairs is the fact that they have a

support surface for a person to sit on, usually at a fairly consistent height, and an

optional surface for back support. In fact, what defines a chair is its geometry.

In [23], they detect chairs by looking at occupied voxels and finding places where

people can afford to sit on. This work recognizes the fact that some objects require

examining, not only its appearance, but also its geometry for object detection.

5.3 Representation of objects and building structure

We have used simple three dimensional models to represent objects and the building

structure. We fit these models to a given image to estimate the location of objects

in the 3D world and the physical dimensions of objects.

72

Figure 5.1: Three common “Sculpted Objects” objects modeled using rectangles. (a) Desk

and Computer monitor. (b) Doors.

5.3.1 Objects

We have used a simple geometric primitive, a rectangle, to represent three common

“Sculpted Objects” objects in indoor environments. The objects we have considered

are doors, desks, and computer monitors. The main rectangular surface of doors and

monitors are modeled with a vertical rectangle and the top surface of the desk is

modeled with a horizontal rectangle. Figure 5.1 illustrates rectangles representing

the three objects.

5.3.2 Geometric Properties of Objects

In order to verify a candidate hypothesis of a “Sculpted Objects” object, we mea-

sure its geometric property. We consider both the geometric property of the object

itself (self geometric property), such as the width and length of the object, and the

geometric property of the object in relation to other components in the scene (rela-

tional geometric property). Both self and relational geometric properties are used

to evaluate whether a given candidate corresponds to an acutal object.

73

Self Geometric Properties

Self geometric properties describe the geometric properties relating to the object

itself. For rectangular objects, such as doors, desks, and monitors, their physical di-

mensions are represented with two parameters, width and height. Other dimensions

can be used for objects modeled using different primitives. For “Sculpted Objects”

objects, we expect physical dimensions to be fairly consistent across different in-

stances of the same object category. Therefore, we expect them to be good features

to use for detecting objects.

We use a Gaussian distribution to mobel the probability of parameters. The dis-

tribution can be learned from a set of training data or can be collected from available

census data.

Relational Geometric Properties

Relational geometric properties describe the geometric relation between objects and

other components in the scene. One subset of relational geometric properties has

already been considered in the previous chapter. It has considered the rules that are

caused by the physical constraints of our world and that are applied to all objects in

the world, regardless of their semantic object category. The rules considered in the

previous chapter says that multiple objects may not share the same volume in the

world and has applied this principle to aid in scene understanding. By following

those principles, we were able to discover locations that are occupied by an object,

but we did not attempt to identify the category of those objects. In addition to the

rules due to physics that apply to all object categories, we expand this set of rules

to those that are specific to each object categories. (Figure 5.2) For example, the

height of the top surface of a desk relative to the floor is a characeristic feature

74

Room

Door on wallDesk inside room

Door Desk

Monitor on desk

Monitor

Monitor on desk

Figure 5.2: Relational geometric properties specific to object categories. Doors are on

walls. Desks are within the boundaries of a room and at a specific height from the floor.

Computer monitors are on desks.

that defines a desk. A desk usually has a specific height, so that it is comfortable

for humans to work on its surface. Also, computer monitors are placed on a desk,

with the bottom edge of the monitor being slightly raised above the top surface of

the desk. And doors are on a wall with their bottom edge aligned with the floor.

These relational geometric properties are consistant across different instances of

the same semantic relationship, and therefore, we believe that they useful features

for identifying objects, along with self geometric properties.

75

5.4 Method Details

5.4.1 Creating rectangle hypotheses

The process to generate rectangle hypotheses is based on [51]. We create rect-

angle hypotheses by connecting four line segments to define the four edges of a

rectangle. Line segments and their associated vanishing points are given as the in-

put. Associated vanishing points of line segments determines the 3D orientation of

line segments. Given this input, we first decide on the orientation of the rectangle

that we want to generate. We can later repeat this process for rectangles of other

orientations. For a given orientation of a rectangle, we select the two sets of line

segments having the orientation of the edges of the rectangle. Each line is assigned

a unique ID. Then L-junctions (two edges) are formed with two line segments, one

line segment of each orientations. L-junctions are categorized into four types: top-

left, top-right, bot-left, and bot-right (Figure 5.3). The notion of top/bottom and

left/right need not correspond to the actual meaning of top/bottom or left/right, as

long as it marks a direction and is consistent within the given image. The ID of line

segments forming L-junctions are also recorded for each junction.

Once all L-junctions of four types are formed, we proceed to build rectangles by

tying L-junctions to form rectangles. We first pick a type for starting L-junction,

for example top-left, and progress in one direction, e.g., clockwise, to build all U-

junctions (three edges) of type left-top-right, then finally generating rectangles with

all four edges, left-top-right-bottom.

U-junctions are built by tying two L-junctions together. This is done efficiently

by looking at the type of the L-junction and the ID of line segments forming L-

junctions. That is, a U-junction with left-top-right edges can be formed by connect-

76

Up

top-left top-rightp p g

bottom-right

Rightbottom-left

(a) (b) (c)(a) (b) (c)

Figure 5.3: Four types of L-junctions. (a) Given a designation of “up” and “right” direction,

L-junctions are categorized into four types: top-left, top-right, bottom-left, and bottom-

right. (b)(c) L-junctions are formed by connecting two line segments. Depending on the

relative configuration of two line segments, they form different types of L-junctions. (b) A

bottom-right junction. (c) A top-left junction.

ing two L-junctions of type top-left and top-right, which share the same ID for the

top line segment. Finally, U-junctions are closed to form a full cycle of four edges

and four L-junctions. This is done by first adding another L-junction to U-junctions

to form structures with three L-junctions (top-left, top-right, and bottom-right). The

structure is made up of four line segments, with two bridging the three L-junctions

and two with open ends. We then search for the final L-junction of bottom-left type

marked with the ID of the two open-ended line segments. If such L-junction ex-

ists, then the structure with three L-junctions closed with the final L-junction can

be added as a completed rectangle.

77

top-right (3,17)

Line 3

top-left (12,6)

Line 12

top-right (12,17)

Line 6 Line 17

Figure 5.4: Connection of L-junctions. ID of line segments that form L-junctions determine

which L-junctions can be connected with each other. A top-left type junction with ID (12,6)

can connect with top-right junction with ID (12,17) but not with ID (3,17)

5.4.2 Lifting Rectangle Hypotheses to 3D

Rectangle hypotheses generated from line segments have known orientation in 3D,

which is determined by the orientation of the four edges. However, the location of

the rectangle is not known by the process of connecting line segments. The location

of rectangles can be determined in two ways. The first method is by relating to the

environment and requires only a single image as input. If we already know the 3D

structure of the room, and we know how the rectangle contacts the room, we can

then infer the 3D location of the rectangle from the contact point of the rectangle

and the room. For example, if we know that an edge of a desk contacts a wall,

then the 3D location of the contact point can be assumed to be at the same location

as the 3D location of the contact point of the room. We can then infer the 3D

coordinates of the rest of the rectangle. The second method is through independent

78

3D measurements, such as through a stereo camera, structure from motion, or a

depth camera. In this work, we use 3D point clouds obtained from a stereo camera

and use the points on the rectangle to determine the 3D position of the rectangle.

Keeping the known orientation of the rectangle, its translation is estimated as the

median of the translation of points falling on the rectangle when projected onto the

image. This method requires additional measurement means, but the location of

rectangles can be more reliably determined because it is estimated independently

from the environment.

5.4.3 Creating Building Structure Hypotheses

We create building structure hypotheses by creating instances of indoor manhattan

models proposed by [45, 44]. The model assumes a single floor plane, a single

ceiling plane, and walls that are orthogonal to each other. Hypotheses can be created

from either a single image or from 3D measurements from a stereo camera. To

create hypotheses from a single image, we adopt the method directly from [45, 44].

This method samples line segments and connects them to form building models. To

create hypotheses from 3D measurements from a stereo camera, we first obtain the

3D orientation of major surfaces from vanishing points and then fit planes with fixed

orientation to 3D point clouds for potential walls, floor, and ceiling. Combination

of walls, floor, and ceiling provides hypotheses for the entire building structure.

5.5 Results

We present results of our method on two sequences taken from indoor environ-

ments. We used video sequences taken with a stereo camera. We have applied a

79

stereo egomotion algorithm [2] to obtain sparse 3D point clouds and camera motion.

A single configuration is estimated for the entire sequence by collecting evidences

from all frames. Object hypotheses from each frame are projected onto other frames

in the sequence using camera motion recovered from the stereo system.

In the “Office sequence”, all four walls and the floor have been accurately es-

timated, even though the walls and the floor are occluded by other objects. Two

desk surfaces and two monitors have also been detected. Examining self geomet-

ric properties of rectangle hypotheses ensures that detected desk and monitors have

plausible dimensions. The relational constraint of monitor and desk rules out the

majority of monitor hypotheses that have the correct size to be monitor but are not

supported by a desk. In the “Common area sequence”, again all walls, floor, and

ceiling have been accurately estimated. Relational geometric properties ensures de-

tected doors to lie on walls. However, geometric properties are not sufficient to rule

out rectangular structures caused by windows, which has proper dimensions to be

doors, and results in false detection in the beginning of the sequence.

80

(a)

(b)

Figure 5.5: Result for estimating building structure and detecting doors, desks, and moni-

tors. (a) Office sequence. (b) Common area sequence.

81

Chapter 6

Conclusion

In this thesis, we have developed methods for scene understanding using three di-

mensional representation and reasoning. As our world is in three dimensions and is

made up of three dimensional components, modeling our world using three dimen-

sional representation rules out invalid structures that can only exist in drawing on a

2D image and helps us keep the problem tractable. At the same time, the resulting

structure is guaranteed to be a physically valid structure. Rules derived by careful

observation of 3D representation allows us to perform 3D reasoning that makes in-

ference efficient and tractable. We have also considered the geometric relationships

among components in the scene. This ensures that the resulting configuration of

components is physically valid and improves the accuracy of the estimate. Finally,

we have demonstrated that, while some objects are have proven to be effectively

characterized by their apperance, there are also classes of objects that can be better

characterized by their geometry. For such objects, we have demonstrated the use of

geometric features to detect and localize them.

82

6.1 Future Work

Our focus has been on indoor environments. In the future, it would be interesting

to see geometry based methods applied to more broader domains, such as outdoor

environments. Outdoor environments are not as highly structured as indoor environ-

ments, but similar ideas may be applied to outdoor scene understanding to produce

geometrically plausible estimates and to improve performance.

We have suggested using 3D geometry as the characterizing feature for certain

object classes. However, all objects lie somewhere on the spectrum of being well

characterized by their appearance and being well characterized by their geometry.

Therefore, we think that the next step is to fuse appearance-based methods and

3D-geometry-based methods to build object classifiers that applies to all classes of

objects. There are already work that incorporate 2D spatial factors into modeling

the appearance. It would be interesting to see detectors that consider three dimen-

sional spatial factors and appearance together.

To apply methods that make use of 3D geometry for scene understanding, it is

natural to use three dimensional measurements, rather than to be confined to using

only images. Multi-view methods have matured and can now produce accurate 3D

point clouds. Recently introduced low-cost depth cameras provides an easy way to

obtain accurate 3D point clouds. It would be interesting to use 3D geometry based

methods with 3D measurements along with images to advance scene understanding.

83

Bibliography

[1] N. Ayache and O. Faugeras. Hyper: A new approach for the recognition

and positioning of two-dimensional objects. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 1986.

[2] H. Badino and T. Kanade. A head-wearable short-baseline stereo system for

the simultaneous estimation of structure and motion. In IAPR Conference on

Machine Vision Applications (MVA), Nara, Japan, 2011.

[3] Olga Barinova, Victor Lempitsky, Elena Tretiak, and Pushmeet Kohli. Geo-

metric image parsing in man-made environments. In In: European Conference

on Computer Vision, 2010.

[4] R. Bolles and R. Cain. Recognizing and locating partially visible objects:

The local-feature-focus method. International Journal of Robotics Research,

1982.

[5] R. Bolles and R. Horaud. 3dpo: A tree-dimensional part orientation system.

International Journal of Robotics Research, 1986.

[6] M. B. Clowes. On seeing things. In Artificial Intelligence, 1971.

[7] Microsoft Corp. Redmond WA. Kinect for Xbox 360.

84

[8] J.M. Coughlan and A.L. Yuille. Manhattan world: Compass direction from a

single image by bayesian inference. In Proceedings ICCV, 1999.

[9] A. Criminisi, I. Reid, and A. Zisserman. Single view metrology. In Proc.

International Conference on Computer Vision (ICCV), 1999.

[10] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human

detection. In Proceedings of IEEE Conference Computer Vision and Pattern

Recognition, 2005.

[11] Erick Delage, Honglak Lee, and Andrew Y. Ng. Automatic single-image 3d

reconstructions of indoor manhattan world scenes. In ISRR, 2005.

[12] Erick Delage, Honglak Lee, and Andrew Y. Ng. A dynamic bayesian net-

work model for autonomous 3d reconstruction from a single indoor image. In

CVPR, 2006.

[13] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class

object layout. In ICCV, 2009.

[14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.

The pascal visual object classes (voc) challenge. International Journal of

Computer Vision, 88(2):303–338, June 2010.

[15] Olivier Faugeras, Quang-Tuan Luong, and Theodore Papadopoulo. The ge-

ometry of multiple images. MIT Press, 2001.

[16] Olivier Faugeras and Quant-Tuan Luong. The Geometry of Multiple Images.

The MIT Press, 2001.

85

[17] P. Felzenszwalb, R. Girshick, and D. McAllester. Cascade object detection

with deformable part models. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), 2010.

[18] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained,

multiscale, deformable part model. In IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), 2008.

[19] Alex Flint, Christopher Mei, David Murray, and Ian Reid. A dynamic pro-

gramming approach to reconstructing building interiors. 2010.

[20] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski.

Manhattan-world stereo. In CVPR, 2009.

[21] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski.

Reconstructing building interiors from images. In ICCV, 2009.

[22] Stephen Gould, Richard Fulton, and Daphne Koller. Decomposing a scene

into geometric and semantically consistent regions. In ICCV, 2009.

[23] H. Grabner, J. Gall, and L. van Gool. What makes a chair a chair? In IEEE

Conference on Computer Vision and Pattern Recognition (CVPR’11), 2011.

[24] W. E. L. Grimson and T. Lozano-Perez. Model-based recognition and local-

ization from sparse range or tactile data. International Journal of Robotics

Research, 1984.

[25] Abhinav Gupta, Alexei Efros, and Martial Hebert. Blocks world revisited:

Image understanding using qualitative geometry and mechanics. In European

Conference on Computer Vision (ECCV), 2010.

86

[26] Abhinav Gupta, Scott Satkin, Alexei A. Efros, and Martial Hebert. From

3d scene geometry to human workspace. In Computer Vision and Pattern

Recognition(CVPR), 2011.

[27] A. Guzman. Decomposition of a visual scene into three-dimensional bodies.

In Proceedings of Fall Joint Computer Conference, 1968.

[28] F. Han and S.C. Zhu. Bottom-up/top-down image parsing by attribute graph

grammar. In Proc. Int’l Conf. on Computer Vision (ICCV), 2005.

[29] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer

vision. Cambridge University Press, 2003.

[30] X. He, R. S. Zemel, and M. A. Carreira-Perpinan. Multiscale conditional

random fields for image labeling. In CVPR, 2004.

[31] Varsha Hedau, Derek Hoiem, and David Forsyth. Recovering the spatial

layout of cluttered rooms. In International Conference on Computer Vision

(ICCV), 2009.

[32] Varsha Hedau, Derek Hoiem, and David Forsyth. Thinking inside the box:

Using appearance models and context based on room geometry. In European


[33] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classification models:

Combining models for holistic scene understanding. In In NIPS, 2008.

[34] Derek Hoiem, Alexei Efros, and Martial Hebert. Geometric context from

a single image. In Proceedings of IEEE Conference Computer Vision and

Pattern Recognition, 2005.

87

[35] Derek Hoiem, Alexei Efros, and Martial Hebert. Putting objects in perspec-

tive. In CVPR, 2006.

[36] Derek Hoiem, Alexei Efros, and Martial Hebert. Recovering surface lay-

out from an image. International Journal on Computer Vision (IJCV), 75(1),

2007.

[37] D. A. Huffman. Impossible objects as nonsense sentences. In Machine Intel-

ligence, 1971.

[38] D. P. Huttenlocher and S. Ullman. Object recognition using alignment. In

Proceedings of the First International Conference on Computer Vision, 1987.

[39] T. Kanade. A theory of origami world. In Artificial Intelligence, 1980.

[40] J. Kosecka and W. Zhang. Video compass. In Proceedings of European Con-

ference on Computer Vision, pages 657 – 673, 2002.

[41] J. Kosecka and W. Zhang. Extraction, matching and pose recovery based

on dominant rectangular structures. Computer Vision Image Understanding,

2005.

[42] P. D. Kovesi. MATLAB and Octave functions for computer vi-

sion and image processing. School of Computer Science & Soft-

ware Engineering, The University of Western Australia. Available from:

<http://www.csse.uwa.edu.au/∼pk/research/matlabfns/>.

[43] Sanjiv Kumar and Martial Hebert. Discriminative fields for modeling spatial

dependencies in natural images. In in proc. advances in Neural Information

Processing Systems (NIPS), December 2003.

88

[44] David C. Lee, Abhinav Gupta, Martial Hebert, and Takeo Kanade. Estimating

spatial layout of rooms using volumetric reasoning about objects and surfaces.

In Advances in Neural Information Processing Systems 24 (NIPS), 2010.

[45] David Changsoo Lee, Martial Hebert, and Takeo Kanade. Geometric reason-

ing for single image structure recovery. In IEEE Computer Society Conference

on Computer Vision and Pattern Recognition (CVPR), June 2009.

[46] David Lowe. Perceptual organization and visual recognition. Kluwer Aca-

demic Publishers, 1985.

[47] Yi Ma, S. Shankar Sastry, Jana Kosecka, and Stefano Soatto. An invitation

to 3-d vision: From images to geometric models. Interdisciplinary Applied

Mathematics Series. Springer-Verlag New York, 2003.

[48] A. K. Mackworth. Interpreting pictures of polyhedral scenes. In Artificial

Intelligence, 1973.

[49] B. Micusik, H. Wildenauer, and J. Kosecka. Detection and matching of recti-

linear structures. In IEEE Conference on Computer Vision and Pattern Recog-

nition, 2008.

[50] Joseph L. Mundy. Object recognition in the geometric era: A retrospective.

In Toward CategoryLevel Object Recognition, volume 4170 of Lecture Notes

in Computer Science, pages 3–29. Springer, 2006.

[51] Ana Cris Murillo, J. Kosecka, J. J. Guerrero, and C. Sagues. Visual door

detection integrating appearance and shape cues. Robotics and Autonomous

Systems, 2008.

89

[52] Vladmir Nedovic, Arnold W.M. Smeulders, and Andre Redert. Depth infor-

mation by stage classification. In Proc. International Conference on Computer

Vision, 2007.

[53] Y. Ohta, T. Kanade, and T. Sakai. An analysis system for scenes containing

objects with substructures. IJCPR, pages 752-754, 1978.

[54] Lawrence G. Roberts. Machine perception of three-dimensional solids.

OEOIP, pages 159-197, 1965.

[55] C. Rother. A new approach for vanishing point detection in architectural en-

vironments. In BMVC, pages 382–391, 2000.

[56] Henry Rowley, Shumeet Baluja, and Takeo Kanade. Neural network-based

face detection. In Computer Vision and Pattern Recognition ’96, June 1996.

[57] Ashutosh Saxena, Sung H. Chung, and Andrew Y. Ng. Learning depth

from single monocular images. In In Neural Information Processing Systems

(NIPS), 2005.

[58] Ashutosh Saxena, Min Sun, and Andrew Y. Ng. Make3d: Learning 3d scene

structure from a single still image. IEEE Transactions of Pattern Analysis and

Machine Intelligence (PAMI), 2008.

[59] Henry Schneiderman and Takeo Kanade. A statistical model for 3d object

detection applied to faces and cars. In IEEE Conference on Computer Vision

and Pattern Recognition. IEEE, June 2000.

90

[60] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Texton-boost: Joint ap-

pearance, shape and context modeling for multi-class object recognition and

segmentation. In In ECCV, 2006.

[61] K. Sugihara. A necessary and sufficient condition for a picture to represent a

polyhedral scene. IEEE Transactions on Pattern Analysis and Machine Intel-

ligence, PAMI, 1984.

[62] J.-P Tardif. Non-iterative approach for fast and accurate vanishing point de-

tection. In 12th IEEE International Conference on Computer Vision, 2009.

[63] D. W. Thompson and J. L. Mundy. Three-dimensional model matching from

an unconstrained viewpoint. In Proceedings of the International Conference

on Robotics and Automation, 1987.

[64] Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Contextual

models for object detection using boosted random fields. In NIPS, 2005.

[65] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin

Altun. Large margin methods for structured and interdependent output vari-

ables. Journal of Machine Learning Research 6: 1453-1484, 2005.

[66] Z. Tu. Auto-context and its application to high-level vision tasks. In In CVPR,

2008.

[67] F.A. van den Heuvel. Vanishing point detection for architectural photogram-

metry.

[68] Paul Viola and Michael Jones. Robust real-time face detection. In IEEE

International Conference on Computer Vision, 2001.

91

[69] D. A. Waltz. Generating semantic descriptions from line drawings of scenes

with shadows. Technical report, MIT, 1972.

[70] Huayan Wang, Stephen Gould, and Daphne Koller. Discriminative learning

with latent variables for cluttered indoor scene understanding. In European


[71] Stella Yu, Hao Zhang, and Jitendra Malik. Inferring spatial layout from a sin-

gle image via depth-ordered grouping. In IEEE Computer Society Workshop

on Perceptual Organization in Computer Vision, 2008.

92

Documents

Three Dimensional Representation and Reasoning …dclee/pub/lee_thesis.pdfThree Dimensional Representation and Reasoning for Indoor Scene Understanding David C. Lee August 2011 Department