8
Context-based vision system for place and object recognition Antonio Torralba Kevin P. Murphy William T. Freeman Mark A. Rubin MIT AI lab MIT AI lab MIT AI lab Lincoln Labs Cambridge, MA 02139 Cambridge, MA 02139 Cambridge, MA 02139 Lexington, MA 02420 Abstract While navigating in an environment, a vision system has to be able to recognize where it is and what the main ob- jects in the scene are. In this paper we present a context- based vision system for place and object recognition. The goal is to identify familiar locations (e.g., office 610, con- ference room 941, Main Street), to categorize new environ- ments (office, corridor, street) and to use that information to provide contextual priors for object recognition (e.g., ta- bles are more likely in an office than a street). We present a low-dimensional global image representation that pro- vides relevant information for place recognition and cate- gorization, and show how such contextual information in- troduces strong priors that simplify object recognition. We have trained the system to recognize over 60 locations (in- doors and outdoors) and to suggest the presence and loca- tions of more than 20 different object types. The algorithm has been integrated into a mobile system that provides real- time feedback to the user. 1. Introduction We want to build a vision system that can tell where it is and what it is looking at as it moves through the world. This problem is very difficult and is largely unsolved. Our approach is to exploit visual context, by which we mean a low-dimensional representation of the whole image (the “gist” of the scene) [6]. Such a representation can be easily computed without having to identify specific regions or ob- jects. Having identified the overall type of scene, one can then proceed to identify specific objects within the scene. The power of, and need for, context is illustrated in Fig- ure 1. In Figure 1(a), we see a close-up view of an ob- ject; this is the kind of view commonly studied in the object recognition community. The recognition of the object as a coffee machine relies on knowing detailed local proper- ties (its typical shape, the materials it is made of, etc.). In Figure 1(b), we see a more generic view, where the object occupies a small portion of the image. The recognition now relies on contextual information, such as the fact that we are in a kitchen. Contextual information helps to disambiguate the identity of the object despite the poverty of the local object features (Figure 1(c)). Object recognition in context is based on our knowledge of scenes and how objects are organized. The recognition of the scene as a kitchen reduces the number of objects that need to be considered, which allows us to use simple fea- tures for recognition. Furthermore, the recognition of this scene as a particular kitchen (here, the kitchen of our lab) further increases the confidence about the identity of the ob- ject. Hence place recognition can be useful for object recog- nition. Traditionally, place and object recognition are consid- ered separate problems. Place recognition is mostly stud- ied in the mobile robotics community, where the problem is called topological localization. (Topological localization refers to identifying the discrete location of the robot (e.g., which room it is in), as opposed to metric localization, which refers to establishing the precise coordinates of the robot.) The main novelty of our approach for place recog- nition compared to previous ones, such as [17, 3], is that we also try to identify the category of the location, which works even for places which have not been seen before. Previous approaches to object recognition have focused, for the most part, on using local features to classify each image patch independently (see e.g., [4, 7, 9, 10, 16]. Our approach uses global image features to predict the scene, and then uses the scene as a prior for the local detectors. Another difference between our approach to object recognition and previous approaches is that we try to rec- ognize a large set of object types (24) in a natural, uncon- strained setting (images are collected by a wearable cam- era), as opposed to recognizing a small number of classes (such as faces and cars [7, 10, 16]), or using a constrained setting (such as uniform backgrounds [9, 4]). Finally, when performing object detection and localization, we only use global image features. At the end of this paper, we discuss how our approach to detection can be combined with more traditional object localization methods, which make use of local features. 1

Context-based vision system for place and object recognitionmurphyk/Papers/iccv03.pdf · Context-based vision system for place and object recognition ... based vision system for place

  • Upload
    vominh

  • View
    230

  • Download
    4

Embed Size (px)

Citation preview

Context-based vision system for place and object recognition

Antonio Torralba Kevin P. Murphy William T. Freeman Mark A. RubinMIT AI lab MIT AI lab MIT AI lab Lincoln Labs

Cambridge, MA 02139 Cambridge, MA 02139 Cambridge, MA 02139 Lexington, MA 02420

Abstract

While navigating in an environment, a vision system hasto be able to recognize where it is and what the main ob-jects in the scene are. In this paper we present a context-based vision system for place and object recognition. Thegoal is to identify familiar locations (e.g., office 610, con-ference room 941, Main Street), to categorize new environ-ments (office, corridor, street) and to use that informationto provide contextual priors for object recognition (e.g., ta-bles are more likely in an office than a street). We presenta low-dimensional global image representation that pro-vides relevant information for place recognition and cate-gorization, and show how such contextual information in-troduces strong priors that simplify object recognition. Wehave trained the system to recognize over 60 locations (in-doors and outdoors) and to suggest the presence and loca-tions of more than 20 different object types. The algorithmhas been integrated into a mobile system that provides real-time feedback to the user.

1. Introduction

We want to build a vision system that can tell where itis and what it is looking at as it moves through the world.This problem is very difficult and is largely unsolved. Ourapproach is to exploit visual context, by which we meana low-dimensional representation of the whole image (the“gist” of the scene) [6]. Such a representation can be easilycomputed without having to identify specific regions or ob-jects. Having identified the overall type of scene, one canthen proceed to identify specific objects within the scene.

The power of, and need for, context is illustrated in Fig-ure 1. In Figure 1(a), we see a close-up view of an ob-ject; this is the kind of view commonly studied in the objectrecognition community. The recognition of the object asa coffee machine relies on knowing detailed local proper-ties (its typical shape, the materials it is made of, etc.). InFigure 1(b), we see a more generic view, where the objectoccupies a small portion of the image. The recognition nowrelies on contextual information, such as the fact that we are

in a kitchen. Contextual information helps to disambiguatethe identity of the object despite the poverty of the localobject features (Figure 1(c)).

Object recognition in context is based on our knowledgeof scenes and how objects are organized. The recognitionof the scene as a kitchen reduces the number of objects thatneed to be considered, which allows us to use simple fea-tures for recognition. Furthermore, the recognition of thisscene as a particular kitchen (here, the kitchen of our lab)further increases the confidence about the identity of the ob-ject. Hence place recognition can be useful for object recog-nition.

Traditionally, place and object recognition are consid-ered separate problems. Place recognition is mostly stud-ied in the mobile robotics community, where the problemis called topological localization. (Topological localizationrefers to identifying the discrete location of the robot (e.g.,which room it is in), as opposed to metric localization,which refers to establishing the precise coordinates of therobot.) The main novelty of our approach for place recog-nition compared to previous ones, such as [17, 3], is thatwe also try to identify the category of the location, whichworks even for places which have not been seen before.

Previous approaches to object recognition have focused,for the most part, on using local features to classify eachimage patch independently (see e.g., [4, 7, 9, 10, 16]. Ourapproach uses global image features to predict the scene,and then uses the scene as a prior for the local detectors.

Another difference between our approach to objectrecognition and previous approaches is that we try to rec-ognize a large set of object types (24) in a natural, uncon-strained setting (images are collected by a wearable cam-era), as opposed to recognizing a small number of classes(such as faces and cars [7, 10, 16]), or using a constrainedsetting (such as uniform backgrounds [9, 4]). Finally, whenperforming object detection and localization, we only useglobal image features. At the end of this paper, we discusshow our approach to detection can be combined with moretraditional object localization methods, which make use oflocal features.

1

(a) Isolated object (b) Object in context (c) Low-res Object

Figure 1. (a) A close-up of an object; (b) An object incontext; (c) A low-res object out of context. Observers inour lab, addicts to coffee, have difficulties in recognizingthe coffee machine in figure (c), however, they recognize itin figures (a) and (b).

2. Global image features

The regularities of real world scenes suggest that wecan define features correlated with scene properties withouthaving to specify individual objects within a scene, just aswe can build face templates without needing to specify fa-cial features. Some scene features, like collections of views[2, 17] or color histograms [12], perform well for recogniz-ing specific places, but they are less able to generalize tonew places (we show some evidence for this claim in Sec-tion 3.4). We would like to use features that are related tofunctional constraints, as opposed to accidental (and there-fore highly variable) properties of the environment. Thissuggests examining the textural properties of the image andtheir spatial layout.

To compute texture features, we use a wavelet image de-composition. Each image location is represented by the out-put of filters tuned to different orientations and scales. Weuse a steerable pyramid [11] with 6 orientations and 4 scalesapplied to the intensity (monochrome) image. (Similar re-sults are obtained using Gabor filters.) The local (L) repre-sentation of an image at an instant � is then given by the jet��� ��� � �������������� , where � � �� is the number ofsubbands.

We would like to capture global image properties, whilekeeping some spatial information. Therefore, we take themean value of the magnitude of the local features averagedover large spatial regions:

����� ��

��

���� ��������� � ��

where ���� is the averaging window. The resulting rep-resentation is downsampled to have a spatial resolution of� � � pixels (here we use � � �). Thus, �� hassize � �� � � � ���. We further reduce the dimen-sionality by projecting �� onto the first � principal com-ponents (PCs) computed using a database of thousands ofimages collected with our wearable system. The resulting

Figure 2. Two images from our data set, and noise pat-terns which have the same global features. This shows thatthe features pick up on coarse-grained texture, dominantorientations and spatial organization.

�-dimensional feature vector will be denoted by ��� . Thisrepresentation proves to be rich enough to describe impor-tant scene context, yet is of low enough dimensionality toallow for tractable learning and inference.

Figure 2 illustrates the information that is retained usingthis representation with � � �� PCs. Each example showsone image and an equivalent textured image that shares thesame 80 global features. The textured images are generatedby coercing noise to have the same features as the originalimage, while matching the statistics of natural images [8].

3. Place recognition

In this section we describe the context-based placerecognition system. We start by describing the set-up usedto capture the image sequences used in this paper. Then westudy the problem of recognition of familiar places. Finallywe discuss how to do scene categorization when the systemis navigating in a new environment.

3.1 Wearable test bed

As a test-bed for the approach proposed here, we use ahelmet-mounted mobile system. The system is composedof a web-cam that is set to capture 4 images/second at a res-olution of 120x160 pixels (color). The web-cam is mountedon a helmet in order to follow the head movements while theuser explores their environment. The user receives feedbackabout system performance through a head-mounted display.

This system allows us to acquire images under realisticconditions while the user navigates the environment. Theresulting sequences contain many low quality images, dueto motion blur, saturation or low-contrast (when lightingconditions suddenly change), non-informative views (e.g.,

2

a close-up view of a door or wall), unusual camera angles,etc. However, our results show that our system is reason-ably robust to all of these difficulties.

Two different users captured the images used for the ex-periments described in the paper while visiting 63 differentlocations at different times of day. The locations were vis-ited in a fairly random order.

3.2 Model for place recognition

The goal of the place recognition system is to estimatethe most likely (discrete) location of the user given all thevisual features up to time �. Let the place be denoted by�� � �� ���, where�� � � is the number of places,and let the global features up to time � be denoted by ��

���.We can use a hidden Markov model (HMM) to solve thelocalization problem by recursively computing � �� ���

�����

as follows:

� ��� � �������� � ���� ��� � ��� ��� � ����

������

� ���� ��� � ���

��

���� ��� ����� � ����������

where ���� �� � � ��� � ������ � ��� is the transitionmatrix (topological map) and ���� ���� is the observationlikelihood, described below. 1

We learn the transition matrix from labelled sequencedata by counting the number of transitions from location � tolocation �. We add a Dirichlet smoothing prior to the countmatrix so that we do not assign zero likelihood to transitionswhich did not appear in the training data.

We model the appearance of each location as a set of� views (a mixture of � spherical Gaussians). We set thevariance, ��, and the number of mixture components, �,using cross-validation; we found �� � �� and � � ���to be the best. We could compute the means and mix-ture weights using EM. However, in this paper, we adoptthe simpler strategy of picking the means to be prototypes(views), chosen uniformly from all views assigned to thelocation; weights are then set uniformly; the result is es-sentially a sparse Parzen window density estimator. In thefuture, we plan to investigate ways of picking the mostinformative (discriminative) prototypes, possibly based onentropy minimizing techniques such as those discussed in[14].

3.3 Performance of place recognition

In this section, we discuss the performance of the placerecognition system when tested on a sequence that starts in-

1Note that this use of HMMs is quite different from previous ap-proaches in wearable computing such as [12]. In our system, states rep-resent 63 different locations, whereas Starner et al. used a collection ofseparate left-to-right HMMs to classify approach sequences to one of 14rooms.

0 500 1000 1500 2000 2500 3000

elevator 400/1elevator 400/1office 400/610office 400/611office 400/625office 400/627office 400/628

corridor 6acorridor 6bcorridor 6c

elevator 200/6kitchen floor 6Vision Area 1Vision Area 2office 200/936elevator 200/7Jason corridor

400 Back street400 plaza

Draper plaza400 Short street

200 out streetDraper street

200 side streetTheresa office

Thistle corridor

streetplazaoffice

corridoropen space

lobbykitchen

indooroutdoor

P(Qt | v1:t)

P(Ct | v1:t)G

G

Figure 3. Performance of place recognition for a se-quence that starts indoors and then (at frame � � ����)goes outdoors. Top. The solid line represents the true lo-cation, and the dots represent the posterior probability as-sociated with each location, � �����

�����, where shading in-

tensity is proportional to probability. There are 63 possiblelocations, but we only show those with non-negligible prob-ability mass. Middle. Estimated category of each location,� �����

�����. Bottom. Estimated probability of being in-

doors or outdoors.

doors (in building 400) and then (at frame � � ���) movesoutdoors. The test sequence was captured in the same wayas the training sequences, namely by walking around theenvironment, in no particular order, but with an attempt tocapture a variety of views and objects in each place. A qual-itative impression of performance can be seen by lookingat Figure 3 (top). This plots the belief state, � �����

�����,

over time. We see that the system believes the right thingnearly all of the time. Some of the errors are due the in-herent ambiguity of discretizing space into regions. Forexample, during the interval � � ���� � ����, the sys-tem is not sure whether to classify the location as “Draperstreet” or “Draper plaza”. Other errors are due to poorlyestimating the transition matrix. For example, just before� � ���, there is a transition from “elevator 200/6” to “el-evator 400/1”, which never occurred in the training set. TheDirichlet prior prevents us from ruling out this possibility,but it is considered unlikely; hence the incorrect belief thatthe system is in “corridor 6a” just before � � ���.

A more quantitative assessment of performance can beobtained by computing precision-recall curves. The recallrate is the fraction of frames which the system is required to

3

label (with the most likely location); this can be varied byadjusting a threshold, �, and only labeling frames for which� �� � ��� � ����

���� � �. The precision is the fraction offrames that are labeled correctly.

The precision-recall framework can be used to assessperformance of a variety of parameters. In Figure 4(a) wecompare the performance of three different features, com-puted by subsampling and then extracting the first 80 prin-cipal components from (1) the intensity image, (2) the colorimage, and (3) the output of the spatially averaged steerablepyramid filter bank. We see that the filter bank works thebest, then color and finally PCA applied to the raw intensityimage.

In Figure 4(b), we show the effect of “turning the HMMoff”, by using a uniform transition matrix (i.e., setting��� �� � �

��). It is clear that the HMM provides a signif-

icant increase in performance (at negligible computationalcost), because it performs temporal integration. We alsocompared to a more ad hoc approach of using the averageof � �����

��� � �����

�� ��, as was done in [14]. We

found (by cross validation) that � � �� works best, andthis is what is shown in Figure 4(b); nevertheless, the HMMis much better. (Results for � � � (i.e., without any tem-poral averaging) are significantly worse (not shown).)

When using the HMM, we noticed that the observationlikelihood terms, ����� � ���� ��� � ��, often dominatedthe effects of the transition prior. This is a well-known prob-lem with HMMs when using mixtures of high-dimensionalGaussians (see e.g., [1, p142]). We adopt the standard solu-tion of rescaling the likelihood terms; i.e., we use

������ � ���� ��� � ������ ���� ��� � ����

where the exponent �� is set by cross-validation. The neteffect is to “balance” the transition prior with the observa-tion likelihoods. (It is possible that a similar effect could beachieved using a density more appropriate to images, suchas a mixture of Laplace distributions.)

3.4 Categorizing novel places

In addition to recognizing the specific location, wewould like the system to be able to categorize places intovarious high-level classes such as office, corridor, street, etc.There are several ways to do this. The simplest is to use theHMM described above, and then to sum up the probabilitymass assigned to all places which belong to the same cate-gory, e.g., the probability I’m in an office is P(in Antonio’soffice) + P(in Kevin’s office) + P(in Bill’s office). However,this will not work in new, unfamiliar environments.

Instead, we train an independent HMM on category la-bels, i.e., we estimate the category, � �����

�����, indepen-

dently of the place, � ����������. (Below we will see that

0 20 40 60 80 100%0

10

20

30

40

50

60

70

80

90

100%

Recall

Prec

isio

n

Color

Gray

Filter bank

(a) HMM

0 20 40 60 80 100%0

10

20

30

40

50

60

70

80

90

100

Recall

Prec

isio

n

Filter bank

ColorGray

(b) No HMM

Figure 4. Precision-recall curves for different features forplace recognition. The solid lines represent median per-formance computed using leave-one-out cross-validationon all 17 sequences. The error bars represent the 80%probability region around the median. The curves repre-sent different features. From top to bottom: filter bank,color, monochrome (see text for details). (a) With HMM(�� � ���, � � �, � = learned). (b) Without HMM(�� � �, � � ��, � = uniform).

the category appearance model, � �������, uses differentfeatures of �� than the specific place appearance model,� �������.) We used 17 categories, including “office”, “cor-ridor”, “street”, etc. We also used the same methodology (athird, independent HMM) to classify the scene as indoor oroutdoor.

We train the categorization system on outdoor and in-door sequences from building 200, and then test it on a se-quence which starts in building 400 (not in the training), andthen, at � � ���, moves outside (included in the training).The results are shown in Fig. 5. Before the transition, theplace recognition system has a uniform belief state, repre-senting complete uncertainty, but the categorization systemperforms well. As soon as we move to familiar territory,the place recognition system becomes confident again (thesystem is able to re-localize itself within the map).

We also computed precision recall curves to assess theperformance of different features at the categorization task.The results are shown in Figure 6. Categorization perfor-mance is worse than recognition performance, despite thefact that there are fewer categories than places (17 insteadof 63). There are several reasons for this. First, the variabil-ity of a class is much larger than the variability of a place,so the problem is intrinsically harder. Second, some cate-gories (such as “open space” and “office”) are visually verysimilar, and tend to get confused, even by people. Third,we have a smaller training set for estimating � ���������,since we observe fewer transitions between categories thanbetween instances.

Interestingly, we see that color performs very poorly atthe categorization task. This is due to the fact that the color

4

400 Back street

inside elevator 200

conference 200/941

streetplazalobbyoffice

corridoropen spacein elevator

miscconference room

kitchen

0 500 1000 1500

indooroutdoor

400 plazaDraper plaza

400 Short street200 out streetDraper street

200 side streetTheresa office

Jason corridorKevin corridorMagic corridor

elevator 200/9elevator 200/7

Admin corridor

office 200/936office 200/777

elevator 200/1

elevator 400/1

New environmentFamiliar

environment

Figure 5. Place categorization when navigating in a newenvironment not included in the training set (frames 1 to1500). During the novel sequence, the place recognitionsystem has low confidence everywhere, but the place cate-gorization system is still able to classify offices, corridorsand conference rooms. After returning to a known environ-ment (after � � ����), performance returns to the levelsshown in Figure 3.

of many categories of places (such as offices, kitchens, etc.)may change dramatically from one environment to the next(see Figure 7). The structural composition of the scene, onthe other hand, is more invariant. Hence, although color is agood cue for recognition, it is not so good for categorization(with the exception of certain natural “objects”, such as sky,sun, trees, etc., whose color is more constrained).

4. From scenes to objects

Most approaches to object detection and recognition in-volve examining the local visual features at a variety of po-sitions and scales, and comparing the result with the set ofall known object types. However, the context can provide astrong prior for which objects are likely to appear, as wellas their expected size and position within the image, thusreducing the need for brute force search [15, 13]. In addi-tion, the context can help disambiguate cases where localfeatures are insufficient. In this paper, the context consistsof both the global scene representation, ��� , the current lo-cation, ��, and/or the current place category, � �. We showhow we can use the context to predict properties of objectswithout even looking at the local visual evidence.

0 20 40 60 80 100%0

10

20

30

40

50

60

70

80

90

100

Recall

Prec

isio

n

Color

Gray

Filter bank

(a) HMM

0 20 40 60 80 100%0

10

20

30

40

50

60

70

80

90

100

Recall

Prec

isio

n

Color

Gray

Filter bank

(b) No HMM

Figure 6. Precision-recall curves for categorization ofnon-familiar indoor environments. The curves representdifferent features sets. From top to bottom: filter bank,monochrome and color. Note that now color performsworse than monochrome, the opposite to Figure 4.

Building 200

Average office Average corridor

Building 400

Average office Average corridor

Figure 7. Average of color (top) and texture (bottom) sig-natures of offices and corridors for two different buildings.While the algorithm uses a richer representation than sim-ply the mean images shows here, these averages show thatthe overall color of offices/corridors varies significantly be-tween the two buildings, whereas the texture features aremore stable.

Let ���� represent the attributes of all objects of type �in image ��; these attributes could include the number ofsuch objects (zero or more), their size, shape, appearance,etc. Let ��� � ����� �����

�, where �� � �� is thenumber of object types considered here (bicycles, cars, peo-ple, buildings, chairs, computers, etc.). We can compute� � ������ ��� as follows:

� � ��������� ��

� � ������ �� � ��� ��� � �������

where we assume that knowing the current location, � �,renders previous frames, ������, irrelevant. The secondterm is the output of the HMM, as discussed in Section 3.The first term can be computed using Bayes’ rule:

� � ������ ��� � ���� ��� ���� � �������

5

��

�������� ����

� ���������

where we have assumed that the likelihood and prior areboth fully factorized. This allows us to treat each object(type) independently. In particular, the posterior marginalscan be computed as follows:

� ����������� ��

� �������� ��� ��������� ��������

In order to compute �������� ���, we have to makesome approximations. A common approximation is to as-sume that the object’s properties (presence, location, size,appearance, etc.) only influence a set of local features, � ��(a subset of ��). Thus

�������� � ��� � �� � ���� �� ��

However, the global context is a very powerful cue that wewant to exploit. Hence we include some global scene fea-tures, ��� (a deterministic function of ��):

����� �� � ��� ��� �� ��

� ������� � �� ��

�� �� ��

� ���� �� � ��� � ��

�� �� ��

The first term, ���� �� � ��� �, can be approximated by

���� �� �� assuming that the object attributes � specify theobject appearance (although this ignores the effect of someglobal scene factors, such as lighting). For this paper,we ignore the first term (i.e., the local appearance model ���� �� � �

�� �), and focus on the second term, ���� �� ��,

which is related to the global context.

4.1 Contextual priming for object detection

In this section, we assume ���� is just a binary randomvariable, representing whether any object of type � is presentin the image or not. � ����� � ����

���� can be used to do ob-ject priming, i.e., to decide which object detectors to run.It can also be used to rank-order images for image retrievaltasks in which we try to find images likely to contain in-stances of ��. We can compute the probability an objecttype is present using Bayes rule:

� ��������� �� � �� �

���� ����� ��� ��������

���� ����� ��� �������� � ���� ������ ��� � ��������

We labeled a set of about 1000 images to specify whetheror not each type of object was present. We estimated� ������ by counting (analogous to estimating the HMMtransition matrix). We model the conditional likelihood us-ing another mixture of spherical Gaussians. We chose the

means for ���� ����� �� � �� from the set of all imageslabelled as location � in which object � was present. Simi-larly, we chose the means for ���� � ����� �� � �� from theset of all images labelled as location � in which object � wasabsent. We chose the number of mixture components andthe variance by cross validation.

Figure 8 shows the results of applying this procedure tothe same test sequence as used in Figure 3. The system isable to correctly predict the presence of 24 different kindsof objects quite accurately, without even looking at the localimage features. Many of the errors are “inherited” fromthe place recognition system. For example, just before � ����, the system believes it is in corridor 6a, and predictsobjects such as desks and printers (which are visible in 6a);however, the system is actually in the floor 1 elevator lobby,where the only identifiable object is a red couch.

A more quantitative assessment of performance is pro-vided in Figure 9, where we plot ROC (receiver operat-ing characteristic) curves for 20 of the most frequent objectclasses. (This can be computed by varying a threshold � anddeclaring an object to be present if � ����� � ����

���� � �;we then count compare the number of estimated positiveframes with the true number. We did this for the sameindoor-outdoor sequence as used in Figures 3 and 5.) Theeasiest objects to detect are things like buildings, which arealmost always present in every outdoor scene (in this dataset at least). The hardest objects are moving ones such aspeople and cars, since they are only present in a given con-text for a small fraction of the time.

4.2 Contextual priors for object localization

In this section, we try to predict the location of an ob-ject within the image. We can use this information e.g., torun an object detector only in the most probable image loca-tions. One approach [15] is to model the expected location,� �������

�� ���� � ��, using cluster weighted regression (a

variant on the mixtures-of-experts model). In this paper, weadopt a different approach: We divide the image into 80non-overlapping patches, and represent the location usingan �� �� bit mask: ����� � � iff any object of type � over-laps image region �, where � � �� ���. This provides acrude way of representing location and size/shape.

Let ���� be the whole image mask (an 80-dimensionalbit vector). Since � ������ � �� � ������� �, we can sum-marize the distribution in terms of its marginals using theexpected mask. This can be computed as follows:

������������� �

�������

� ����� � ��� � ��������

����������� �� � � ���� � ��

where � ����� ��������� was computed by the object priming

system discussed in Section 4.1. When the object is absent

6

fire hydrantbicycle

treecar

buildingstreet

streetlightsky

stepsperson

sofaposterscreen

bookshelfdeskchair

filecabinetprinter

water coolercoffee machine

freezershelves

cogprojector

fire hydrantbicycle

treecar

buildingstreet

streetlightsky

stepsperson

sofaposterscreen

bookshelfdeskchair

filecabinetprinter

water coolercoffee machine

freezershelves

cogprojector

elevator 400/1office 400/610office 400/611office 400/625office 400/627office 400/628

corridor 6acorridor 6bcorridor 6c

elevator 200/6kitchen floor 6

Vision Area 1Vision Area 2 elevator 200/7

400 Back street400 plaza

Draper plaza400 Short street

200 out streetDraper street

200 side street

0 500 1000 1500 2000 2500 (frames)

P(Ot,i | v1:t)G

P(Qt | v1:t)G

ground truth

Figure 8. Contextual priors for object detection. Wehave trained the system to predict the presence of 24 ob-jects. Top. The predicted place, � �����

����� (the same as

Figure 3). Middle. The probability of each object beingpresent, � ���� � ��������. Bottom. Ground truth: a blackdot means the object was present in the image. We onlyshow results for the frames that have ground truth.

0 50 1000

50

100

firehydrant

0 50 1000

50

100

bicycle0 50 100

0

50

100

tree

0 50 1000

50

100

car

0 50 1000

50

100

building

0 50 1000

50

100

road

0 50 1000

50

100

streetlight

0 50 1000

50

100

sky

0 50 1000

50

100

stairs

0 50 1000

50

100

person

0 50 1000

50

100

sofa

0 50 1000

50

100

screen

0 50 1000

50

100

bookshelf

0 50 1000

50

100

desk

0 50 1000

50

100

chair

0 50 1000

50

100

filecabinet

0 50 1000

50

100

printer

0 50 1000

50

100

watercooler

0 50 1000

50

100

coffee machine

0 50 1000

50

100

freezer

Figure 9. ROC curves for the prediction of object pres-ence in the image. We plot hit rate vs false alarm rate as wevary the threshold on � ���� � ��������.

(���� � �), we have ���������� �� ���� � �� � ��. If the

object is present (���� � �), the expected mask is given by

������ � ������ �� ���� � �� �

��

�� ��� ��� ��� ���� � ��

���� ��� ���� � ��

We adopt a product kernel density estimator to model thejoint on � �

� and ����:

��� ��� ��� � � ���� � �� �

�����

���

����

����� ������������ � �������

where ���� � ������� � ��� ������� and ����� � �������is the same Gaussian kernel as used in the object primingsystem. Since the mask kernel is a delta function2 we have

��

�� ��� ��� ��� � � ���� � �� �

����

����������� � �������

Putting it all together, we get the intuitive result that theexpected mask is a set of weighted prototypes, ������ ,

���������� � ��

������ � ������

where the weights are given by how similar the image ���is to previous ones associated with place � and object � (asencoded by the ������ prototypes), times the probability ofbeing in place � and seeing object �:

������ ������ � �������� � ��� � � ���� � ����

������� ����� � ��������

We trained this model as follows. We manually created aset of about 1000 image masks by drawing polygons aroundthe objects in each image. (The images were selected fromthe same training set as used in the previous sections.) Wethen randomly picked up to 20 prototypes �

����� for eachlocation � and object �. A small testing set was created inthe same way.

Some preliminary results are shown in Figure 10. Thefigure shows the probability of each object type appearing ineach grid cell (the expected mask ��������

�����), along with

2We can use kernels with better generalization properties than a deltafunction. This can be done, for instance, by using other representationsfor �� instead of a bit mask. We can model the distribution of masks as����� � �������� where � is a one-to-one mapping. For instance, � canbe the function that converts a binary number to an integer. Then, we canuse a gaussian kernel �������� ����

�������.

7

building (.99) street (.93) tree (.87) sky (.84) car (.81) streetlight (.72) person (.66)

screen (.91)

desk (.87)

chair (.85)

filecabinet (.75)

freezer (.61)

watercooler (.54)

bookshelf (.44)

Figure 10. Some results of object localization. The gray-level images represent the probability of the objects beingpresent at that location; the black-and-white images repre-sent the ground truth segmentation (gray indicates absentobject). Images are ordered according to � ������

�����.

desk (.97)

screen (.93)

chair (.77)

poster (.65)

bookshelf (.56)

sofa (.40)

person (.38)

fire

hydr

ant

bicy

cle

tree ca

r

build

ing

stre

etst

reet

light

sky

step

spe

rson

sofa

post

ersc

reen

book

shel

fde

skch

air

filec

abin

etpr

inte

rw

ater

coo

ler

coffe

e m

achi

nefr

eeze

rlig

htsh

elve

sco

gpr

ojec

tor

200_

fl1_e

lev

400_

fl1_e

leva

tor

400_

fl6_6

0440

0_fl6

_608

400_

fl6_6

0940

0_fl6

_610

400_

fl6_6

1140

0_fl6

_612

400_

fl6_6

2540

0_fl6

_627

400_

fl6_6

2840

0_fl6

_cor

ridor

140

0_fl6

_cor

ridor

240

0_fl6

_cor

ridor

340

0_fl6

_ele

vato

r40

0_fl6

_kitc

hen

400_

fl6_v

isio

nAre

a140

0_fl6

_vis

ionA

rea2

400_

fl6_v

isio

nAre

a3ou

t_ba

ckst

reet

_400

out_

plaz

a_40

0ou

t_pl

aza_

drap

per

out_

shor

tstr

eet_

400

out_

stre

et_2

00ou

t_st

reet

_dra

pper

out_

stre

et_m

ain

out_

stre

et_s

ide2

00

Qt

Qt-1

Mt

Ot

vGt

vt

VGt

Place Recognition

Object class priming

Object location primingGlobalfeatures

Steerablepyramid

Previous Location

PCA

VL

t Localfeatures

Figure 11. Summary of the context-based system. Thedotted lines show how local features could be included inthe model (not implemented in the system presented here).

the ground truth segmentation. In some cases, the corre-sponding ground truth image is blank (gray), indicating thatthis object does not appear, even though the system predictsthat it might appear. Such false positives could be easilyeliminated by checking the local features at that position.Overall, we find the results encouraging, despite the smallnature of the training set.

5. Summary and Conclusions

We have shown how to exploit an holistic, low-dimensional representation of the image to perform robustplace recognition, categorization of novel places, and objectpriming. A summary of the system is shown in Figure 11.

In current work, we are extending this system by com-bining the top-down prior provided by context with theresults of a more standard bottom-up object recognitionphase [5].

Acknowledgments

We thank Egon Pasztor (MERL) for implementing theimage annotation tool, and Aude Oliva (MSU) for discus-

sions. This work was sponsored by the Air Force underAir Force Contract F19628-00-C-0002, by the U.S. Depart-ment of Interior-Fort Huachuca under DARPA/MARS con-tract DABT 63-00-C-1012, and by the Nippon Telegraphand Telephone Corporation as part of the NTT/MIT Collab-oration Agreement. Opinions, interpretations, conclusions,and recommendations are those of the author and are notnecessarily endorsed by the U.S. Government.

References

[1] Y. Bengio. Markovian models for sequential data. NeuralComputing Surveys, 2:129–162, 1999.

[2] M. O. Franz, B. Scholkopf, H. A. Mallot, and H. H. Bulthoff.Where did I take that snapshot? Scene-based homing by im-age matching. Biological Cybernetics, 79:191–202, 1998.

[3] J. Kosecka, L. Zhou, P. Barber, and Z. Duric. Qualitativeimage-based localization in indoors environments. In Proc.IEEE Conf. Computer Vision and Pattern Recognition, 2003.

[4] H. Murase and S. Nayar. Visual learning and recognition of3-d objects from appearance. Intl. J. Computer Vision, 14:5–24, 1995.

[5] K. Murphy, A. Torralba, and W. Freeman. Using the forestto see the trees: a graphical model relating features, objectsand scenes, 2003. AIM-2003.

[6] A. Oliva and A. Torralba. Modeling the shape of the scene:a holistic representation of the spatial envelope. Intl. J. Com-puter Vision, 42(3):145–175, 2001.

[7] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. Intl. J. Computer Vision, 38(1):15–33, 2000.

[8] J. Portilla and E. P. Simoncelli. A parametric texture modelbased on joint statistics of complex wavelets coefficients.Intl. J of Computer Vision, 40:49–71, 2000.

[9] B. Schiele and J. L. Crowley. Recognition without cor-respondence using multidimensional receptive field his-tograms. Intl. J. Computer Vision, 36(1):31–50, 2000.

[10] H. Schneiderman and T. Kanade. A statistical model for 3Dobject detection applied to faces and cars. In Proc. IEEEConf. Computer Vision and Pattern Recognition, 2000.

[11] E. P. Simoncelli and W. T. Freeman. The steerable pyra-mid: A flexible architecture for multi-scale derivative com-putation. In 2nd IEEE Intl. Conf. on Image Processing, 1995.

[12] T. Starner, B. Schiele, and A. Pentland. Visual contextualawareness in wearable computing. In Intl. Symposium onWearable Computing, pages 50–57, 1998.

[13] T. M. Strat and M. A. Fischler. Context-based vision: rec-ognizing objects using information from both 2-D and 3-Dimagery. IEEE Trans. on Pattern Analysis and Machine In-telligence, 13(10):1050–1065, 1991.

[14] A. Torralba and P. Sinha. Recognizing indoor scenes. Tech-nical report, MIT AI lab, 2001.

[15] A. Torralba and P. Sinha. Statistical context priming for ob-ject detection. In IEEE Conf. on Computer Vision and Pat-tern Recognition, pages 763–770, 2001.

[16] P. Viola and M. Jones. Robust real-time object detection.International Journal of Computer Vision - to appear, 2002.

[17] J. Wolf, W. Burgard, and H. Burkhardt. Robust vision-basedlocalization for mobile robots using an image retrieval sys-tem based on invariant features. In Proc. of the IEEE Int.Conf. on Robotics & Automation (ICRA), 2002.

8