22
Autonomous Robots 18, 59–80, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Active Appearance-Based Robot Localization Using Stereo Vision J.M. PORTA, J.J. VERBEEK AND B.J.A. KR ¨ OSE IAS Group, University of Amsterdam, Kruislaan 403, 1098SJ, Amsterdam, The Netherlands [email protected] [email protected] [email protected] Abstract. A vision-based robot localization system must be robust: able to keep track of the position of the robot at any time even if illumination conditions change and, in the extreme case of a failure, able to efficiently recover the correct position of the robot. With this objective in mind, we enhance the existing appearance-based robot localization framework in two directions by exploiting the use of a stereo camera mounted on a pan-and-tilt device. First, we move from the classical passive appearance-based localization framework to an active one where the robot sometimes executes actions with the only purpose of gaining information about its location in the environment. Along this line, we introduce an entropy-based criterion for action selection that can be efficiently evaluated in our probabilistic localization system. The execution of the actions selected using this criterion allows the robot to quickly find out its position in case it gets lost. Secondly, we introduce the use of depth maps obtained with the stereo cameras. The information provided by depth maps is less sensitive to changes of illumination than that provided by plain images. The main drawback of depth maps is that they include missing values: points for which it is not possible to reliably determine depth information. The presence of missing values makes Principal Component Analysis (the standard method used to compress images in the appearance-based framework) unfeasible. We describe a novel Expectation-Maximization algorithm to determine the principal components of a data set including missing values and we apply it to depth maps. The experiments we present show that the combination of the active localization with the use of depth maps gives an efficient and robust appearance-based robot localization system. Keywords: localization, appearance-based modeling, active vision, depth maps, stereo vision 1. Introduction One of the most basic abilities for a mobile robot is that of localization, i.e. to be able to determine its own po- sition in the environment. To localize, the robot needs some kind of representation of the environment. In the literature on mobile robots, this representation of the environment comes basically in two flavors: explicit or implicit. The explicit (or geometric) representations are based on maps of free spaces in the environment (Elfes, 1987; Horswill, 1998; Moravec, 1988) using, for in- stance, grid maps or polygonal-based representations or are based on maps with locations of distinct ob- servable objects, i.e., landmarks (Leonard and Durrant- Whyte, 1991). This approach relies on the assump- tion that geometric information (shape and position of obstacles, landmarks, etc.) can be extracted from the robot’s sensor readings. However, the transformation from sensor readings to geometric information is, in general, complex and prone to errors, increasing the difficulty of the localization problem. As a counterpart, the implicit (or appearance-based, when images are used as a sensory input) representation of the environment has attracted lot of attention recently because of its simplicity. This work is derived from the seminal work on object recognition by Murase and Nayar (1995). In this paradigm, the environment is not modeled geometrically but as an appearance map that includes a collection of sensor readings obtained at

Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Autonomous Robots 18, 59–80, 2005c© 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

Active Appearance-Based Robot Localization Using Stereo Vision

J.M. PORTA, J.J. VERBEEK AND B.J.A. KROSEIAS Group, University of Amsterdam, Kruislaan 403, 1098SJ, Amsterdam, The Netherlands

[email protected]

[email protected]

[email protected]

Abstract. A vision-based robot localization system must be robust: able to keep track of the position of the robotat any time even if illumination conditions change and, in the extreme case of a failure, able to efficiently recoverthe correct position of the robot. With this objective in mind, we enhance the existing appearance-based robotlocalization framework in two directions by exploiting the use of a stereo camera mounted on a pan-and-tilt device.First, we move from the classical passive appearance-based localization framework to an active one where the robotsometimes executes actions with the only purpose of gaining information about its location in the environment.Along this line, we introduce an entropy-based criterion for action selection that can be efficiently evaluated inour probabilistic localization system. The execution of the actions selected using this criterion allows the robot toquickly find out its position in case it gets lost. Secondly, we introduce the use of depth maps obtained with the stereocameras. The information provided by depth maps is less sensitive to changes of illumination than that provided byplain images. The main drawback of depth maps is that they include missing values: points for which it is not possibleto reliably determine depth information. The presence of missing values makes Principal Component Analysis (thestandard method used to compress images in the appearance-based framework) unfeasible. We describe a novelExpectation-Maximization algorithm to determine the principal components of a data set including missing valuesand we apply it to depth maps. The experiments we present show that the combination of the active localizationwith the use of depth maps gives an efficient and robust appearance-based robot localization system.

Keywords: localization, appearance-based modeling, active vision, depth maps, stereo vision

1. Introduction

One of the most basic abilities for a mobile robot is thatof localization, i.e. to be able to determine its own po-sition in the environment. To localize, the robot needssome kind of representation of the environment. In theliterature on mobile robots, this representation of theenvironment comes basically in two flavors: explicit orimplicit.

The explicit (or geometric) representations are basedon maps of free spaces in the environment (Elfes,1987; Horswill, 1998; Moravec, 1988) using, for in-stance, grid maps or polygonal-based representationsor are based on maps with locations of distinct ob-servable objects, i.e., landmarks (Leonard and Durrant-

Whyte, 1991). This approach relies on the assump-tion that geometric information (shape and position ofobstacles, landmarks, etc.) can be extracted from therobot’s sensor readings. However, the transformationfrom sensor readings to geometric information is, ingeneral, complex and prone to errors, increasing thedifficulty of the localization problem.

As a counterpart, the implicit (or appearance-based,when images are used as a sensory input) representationof the environment has attracted lot of attention recentlybecause of its simplicity. This work is derived fromthe seminal work on object recognition by Murase andNayar (1995). In this paradigm, the environment is notmodeled geometrically but as an appearance map thatincludes a collection of sensor readings obtained at

Page 2: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

60 Porta, Verbeek and Krose

Figure 1. The robot Lino.

known positions. The advantage of this representationis that the raw sensor readings obtained at a given mo-ment can be directly compared with the observationsin the appearance-based map. This approach has beenapplied to robot localization using range-based sensors(Burgard et al., 1996; Crowley et al., 1998a) and omni-directional images (Jogan and Leonardis, 2000; Kroseet al., 2002).

A comparison between the two families of localiza-tion methods using vision as sensory input can be foundin Sim and Dudek (2003), showing that appearance-based methods are more robust to noise, occlusionsand changes in illumination (when a edge detector isused to pre-process the images) than geometric based-methods. For this reason, and for its efficiency and sim-plicity, we decided to use this localization paradigm inLino (Fig. 1), a service robot developed as a part ofthe European project “Ambience” (van Breemen et al.,2003). In this project, the robot is the personification ofan intelligent environment: a digital environment thatserves people, that is aware of their presence and con-text, and that is responsive to their commands, needs,habits, gestures and emotions. The robot can get in-formation from the intelligent environment to makeit available to the user and, the other way around,the user can ask for services from the digital environ-ment in a natural way by means of the robot. Lino isequipped with a ‘real’ head with three degrees of free-dom (left/right, up/down and approach/withdraw). Thehead has dynamic mouth, eyes and eyebrows since thismakes the interaction more attractive and also morenatural.

The information about its location is fundamentalfor Lino. Modules such as navigation and environmentawareness are based on information about the robot’sposition. If the location information is wrong, most ofthe Lino’s modules must be stopped until the robotfinds out its position again. Thus, to achieve a smoothoperation of the robot, the localization module mustbe as robust as possible. In the extreme case the robotgets lost (i.e., completely uncertain about its location),the localization system must recover the position of therobot as fast as possible so that the services offered bythe robot are interrupted as short as possible.

It is well known that appearance-based methods arequite brittle in dynamic environments since changes inthe environment can make the matching of the currentobservation with the images in the appearance-basedmap difficult or even impossible. This is speciallycritical when localization is based on global fea-tures extracted, for instance, from omnidirectionalimages (Gaspar et al., 2000; Jogan and Leonardis,2000; Krose et al., 2001). A possible solution to thisproblem is to extract local features such as corners,lines, uniform regions, etc. (Cobzas et al., 2003; Lamonet al., 2001). We can also obtain local features using anormal camera instead of an omnidirectional one. Dueto the limited field of view of normal cameras, we canonly get features in a reduced area in the environmentand modifications in the environment would only berelevant if the camera is pointing toward them. If thisis the case, we can rotate the cameras to get (local)features in other orientations hopefully not affected bythe environment modifications. This is the solution weadopted in the Lino project. Moreover, in this paper, wepresent an entropy-based action evaluation criterionthat we use to select the cameras movement (rotationsand possibly displacements of the robot) that is likely toprovided better information on the robot’s position. Weshow that, using this criterion, the robot can efficientlyrecover its location when the localization system is inthe initial stages or when a failure forces the robot tore-localize. The active localization problem has beenaddressed by some authors before (Kaelbling et al.,1996; Kleinberg, 1994; Krose and Bunschoten, 1999)and different entropy-based criteria for action selectionsimilar to the one we describe can be found in the liter-ature for object recognition (Arbel and Ferrie, 1999),environment modeling (Sujan and Dubowsky, 2002) oreven for robot localization (Burgard et al., 1997; Foxet al., 1998a; Davison, 1999; Maeda et al., 1997).The difference is that, in our case, the entropy-based

Page 3: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 61

evaluation can be computed more efficiently than inexisting approaches.

The use of local features increases the robustnessof the localization system in front of local changes inthe environment. However, if global changes occur, therobustness of the localization system would decrease.One of the global changes that typically affect images(and, thus, the features derived from them) are changesin illumination. Thus, to achieve a robust system, wemust deal appropriately with variable lighting condi-tions. One possible solution to this problem is to in-clude in the appearance-based map images taken in dif-ferent illumination settings. These sets of images canbe obtained on-line or using rendering techniques ona geometric model of the environment. However, notall possible illumination setups can be devised whendefining the appearance map and this clearly limits theusefulness of this approach. A more general solution isto pre-process the images to compensate for the effectof illumination on the extracted features. In this line,techniques such as histogram equalization or gradientfilters have been used to obtain images that are, up toa given point, illumination-independent (Jogan et al.,2002; Murase and Nayar, 1994). In this paper, we in-troduce the use of depth maps for appearance-basedlocalization. This is possible thanks to the use of thestereo camera we mounted on Lino’s head. We showthat the information provided by stereo depth maps isless sensitive to changes of illumination than that pro-vided by plain images, even when typical illuminationcompensation techniques are used. The advantage ofthis setup is that, in contrast to standard laser rangefinders, which provide a 1-D depth map, using stereovision a more informative 2-D depth map is deter-mined. Depth information has been previously used inmany application including mapping techniques basedon map-matching (Olson, 2000), in legged robot nav-igation (Chow and Chung, 2002) and it is also com-mon in the geometric-based localization approaches toincrease the robustness of the landmark identification(Cobzas and Zhang, 2001; Little et al., 1998; Murrayand Little, 2000). In this paper, we introduce a methodto use the same kind of information also to increase therobustness in localization but, in our case, within theappearance-based approach.

The two techniques we introduce in this paper (theuse of active vision and disparity maps for appearance-based localization) are embedded in a probabilistic lo-calization framework. We first we review this basiclocalization framework (Section 2). Next, in Section 3,

we introduce the entropy-based active vision mecha-nism. After this, in Section 4 we describe how is itpossible to use disparity maps for appearance-basedlocalization and how this fits in the general localiza-tion framework. Section 5 describes some experimentsthat validate our two contributions and, finally, in Sec-tion 6, we summarize our work and we extract someconclusions out of it.

2. Appearance-Based Robot Localization

In the following subsections, we introduce the threebasic elements of the probabilistic localization system:the Markov localization model, the sensory model ofthe environment, and the auxiliary particle filter.

2.1. A Probabilistic Model for Robot Localization

Nowadays, the formalization of the robot localizationproblem in a probabilistic way has become a stan-dard (Thrun et al., 2001). In our particular case, weassume that the orientation of the camera with respectto the robot is given with sufficient accuracy by thepan-tilt device and the absolute pose of the camera isconsidered as a stochastic (hidden) variable x . The lo-calization method aims at improving the estimation ofthe pose xt of the camera at time t taking into accountthe movements of the robot and its head {u1, . . . , ut }and the observations of the environment taken by therobot {y1, . . . , yt } up to that time. In our notation, theMarkov process goes through the following sequencex0

u1−→ (x1, y1)u2−→ · · · ut−→ (xt , yt ). So, we want

to estimate the posterior p(xt | {u1, y1, . . . , ut , yt }).The Markov assumption states that this probability canbe updated from the previous state probability p(xt−1)taking into account only the last executed action ut andthe current observation yt . Therefore, we only have toestimate p(xt | ut , yt ). Applying Bayes we have that

p(xt | ut , yt ) ∝ p(yt | xt ) p(xt | ut ), (1)

where the probability p(xt | ut ) can be computed prop-agating from p(xt−1 | ut−1, yt−1)

p(xt | ut )

=∫

p(xt | ut , xt−1) p(xt−1 | ut−1, yt−1) dxt−1. (2)

Equations (1) and (2) define a recursive system to esti-mate the position of the camera/robot.

The probability p(xt | ut , xt−1) for any couple ofstates and for any action is called the action model and

Page 4: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

62 Porta, Verbeek and Krose

it is assumed as known (i.e., inferred from odometry).On the other hand, p(yt | xt ) for any observation andstate is the sensor model that has to be defined for eachparticular problem (see Section 2.2).

If the localization system can deal with a uniforminitial distribution p(x0) then we have a system ableto perform global localization. If the system requiresp(x0) to be peaked on the correct robot’s position, thenwe have a position tracking localization system that ba-sically compensates the incremental error in the robot’sodometry (Thrun et al., 2001).

2.2. Sensor Model

An open problem in the just described framework ishow to compute the sensor model p(y | x). The sensormodel is the environment representation used by therobot to localize itself. As mentioned before, due to itssimplicity, we advocate for an appearance-based sen-sor model. Such a model must provide the probabilityof a given observation over the space of configurationsof the robot according to the similarity of the currentobservation with those in the training set: the moresimilar the current observation with a given trainingpoint, the larger the probability of the robot to be closeto that training point. A problem with images is theirhigh dimensionality, resulting in large storage require-ments and high computational demands. To alleviatethis problem, Murase and Nayar (1995) proposed tocompress images z (with size D) to low-dimensionalfeature vectors y (with size d) using a linear projection

y = W z. (3)

The d × D projection matrix W is obtained by Prin-cipal Component Analysis, PCA (Jolliffe, 2002) on asupervised training set, T = {(xi , zi ) | i ∈ [1, N ]},including images zi obtained at known states xi . PCA(also known as Karhunen-Loeve transform in the con-text of signal processing) is a commonly used data com-pression technique that aims at capturing the subset ofdimensions that explain most of the variance of thedata. Additionally, the PCA provide the optimal (in theleast square sense) linear approximation to the givendata set. PCA has been widely been used in roboticsand computer vision for tasks such as face recogni-tion (Sirovich and Kirby, 1987), hand-print recogni-tion (Murase et al., 1981), object modeling (Muraseand Nayar, 1995) and, more recently, for state compres-sion in controllers based in partially observable Markovdecision processes (Roy and Gordon, 2003). The nu-

merically most stable method to compute the PCA isusing Singular Value Decomposition (SVD). SVD is aprocedure that decompose a D × N (D ≥ N ) matrixZ as

Z = U SV �,

with U the D × N orthonormal matrix of left singularvectors of Z , S a N × N diagonal matrix with the sin-gular values, and V a N × N orthonormal matrix withthe right singular vectors of Z . In our case, Z containsthe images in the training set arranged in columns. Ausual way to compute the SVD is by first determiningV and S by diagonalizing Z� Z

Z� Z = V S2V �

and, then, computing U as

U = Z V S−1.

If the rows in Z are zero-mean, then Z Z� is the co-variance of the data and it can be rewrite as

Z Z� = (USV�)(USV�)� = USV�VSU� = US2U�.

Thus, U and S2 are a diagonalization of the covari-ance matrix: U include the eigenvectors and S2 theeigenvalues, that are proportional to the variance in thedimension expanded by each associated vector in U .We want the projection matrix W to capture as muchvariance (or standard deviation) as possible with as lessdimensions as possible and, thus, we have to includein the rows of W the singular vectors in U associatedwith the d largest singular values. The ratio

v =∑d

i=1 σ 2i∑N

i=1 σ 2i

,

with σi the singular values sorted in descending ab-solute value, is generally used as an indicator of thevariance of the data captured by the d main singularvectors. In our applications, d is chosen so that v isabout 0.75 which, in general, means a value of d below15.

A classical method to approximate the sensor modelp(y | x) from a supervised training set is using kernelsmoothing (Pitt and Shephard, 1999; Wand and Jones,1995; Krose et al., 2001). This method uses a multi-variate kernel for each training point and, thus, eachevaluation of the sensor model scales linearly with thesize of the training set. This makes kernel smoothinginefficient for practical applications. Moreover, kernel

Page 5: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 63

smoothing only works well in low dimensions (e.g.,less than five). If we compress the images via PCA tosuch a reduced set of features, we might lose lots ofinformation useful for localization. For these reason,as proposed by Vlassis et al. (2002), we use a mixtureto approximate the sensor model where p(y | x) iscomputed using only the J points in the training setthat are closer to the current observation (after the cor-responding PCA dimensionality reduction). Thus, wehave that

p(y | x) =J∑

j=1

λ j φ(x | x j ), (4)

with x j the sorted set of nearest-neighbors (i.e., the setof training points xi with a set of features yi more sim-ilar to y, with ‖y − yk‖ ≤ ‖y − yl‖ if k < l), and φ aGaussian with a standard deviation equal to half of theaverage distance between adjacent training points, in-dependently computed for X , Y and the orientation. Fi-nally, λ j is a set of weights that favor nearest-neighborscloser to y that, in our implementation, is computed as

λ j = 2(J − j + 1)

J (J + 1).

The determination of the sensor model for a given ob-servation y can be performed with cost

O(Dd + J log(N ))

with D the number of pixels of image z, d the numberof features extracted from each image, J the numberof nearest neighbors used in the sensor model and Nthe size of the training set. The first factor of the costO(Dd) is due to the linear projection of the image tod features and the second factor O(J log(N )) is thedetermination of the nearest-neighbors, provided thetraining set is organized in a KD-tree to speed up theidentification of training points similar to the currentobservation. Once the sensor model parameters are de-termined, the evaluation for a given state scales with J ,that is much smaller than the size of the training set Nand that, as mentioned, would the cost in case we usea kernel smoothing based sensor model.

2.3. The Auxiliary Particle Filter

The probabilistic framework presented before is a gen-eral formulation and, for each particular case, we haveto devise how to represent and update the probabilitydistribution on the state, p(xt | ut , yt ). In the update,

the main problem we have to confront is how to com-pute the integral of Eq. (2).

If we assume p(xt | ut , yt ) to be a Gaussian, we canuse a simple representation for the probabilities (itsmean and its covariance matrix would be enough) andthe probability update can be done using a Kalman fil-ter (Leonard and Durrant-Whyte, 1991). Systems basedon this type of representation are effective avoiding theerror on the robot’s location to grow without any bound.Additionally, the simplicity of the model used on thesesystems allows for formal demonstration on the con-vergence of the localization (and of the associated map-ping process, if any). However, using a single Gaussian,it is not possible to track more than one hypothesis atthe same time and, due to this, the Kalman-based lo-calization systems are unable to deal with the globallocalization problem.

Multi-Gaussian probability representations (Coxand Leonard, 1994; Arras et al., 2002; Jensfelt and Kris-tensen, 2001; Kristensen and Jensfelt, 2003) can trackmore than one hypothesis simultaneously, but they stillrely on restrictive assumptions on the size of the motionerror and the shape of the robot’s position uncertainty.

Probabilistic occupancy grids (Burgard et al., 1996;Fox et al., 1999; Thrun et al., 1998; Roy and Gordon,2003) and particle filters (Thrun et al., 2001; Vlassiset al., 2002) can represent distributions with arbitraryshapes and, thus, they can be used to solve the globallocalization problem. In the probabilistic occupancygrids framework, the area where the robot is expectedto move is discretized in small cells and the systemmaintains the probability for the robot to be in eachone of these cells. This approach is quite expensivefrom a computational point of view. In the particle filterframework, the robot position is estimated using a setof discrete samples, each one with an associated weightto represent its importance. In this way, computationalresources are focused on the areas of the configurationspace where the robot is more likely to be. Additionally,the computational cost can be adjusted to the availableresources by initially defining more or less particlesor using techniques that on-line adapt the size of thesample set (Kwok et al., 2003).

More formally, in a particle filter, the continuous pos-terior p(xt−1 | ut−1, yt−1) is approximated by a set of Irandom samples, called particles, that are positioned atpoints xi

t−1 and have weightsπ it−1. Thus, the posterior is

p(xt−1 | ut−1, yt−1) =I∑

i=1

π it−1 δ

(xt−1

∣∣ xit−1

),

Page 6: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

64 Porta, Verbeek and Krose

where δ(xt−1 | xit−1) represents the delta function cen-

tered at xit−1. Each one of the particles can be seen as

a possible hypothesis on the cameras pose. Therefore,the estimation of the cameras/robot pose can be definedas a likely value associated with the set of particles (forinstance, its weighted mean).

Using the above approach, the integration of Eq. (2)becomes discrete

p(xt | ut ) =I∑

i=1

π it−1 p

(xt

∣∣ ut , xit−1

), (5)

and Eq. (1) reads to

p(xt | ut , yt ) ∝ p(yt | xt )I∑

i=1

π it−1 p

(xt

∣∣ ut , xit−1

).

The central issue in the particle filter approachis how to obtain a set of particles (that is, anew set of states xi

t and weights π it ) to approx-

imate p(xt | ut , yt ) from the set of particles xit−1,

i ∈ [1, I ] approximating p(xt−1 | ut−1, yt−1). The usualSampling Importance Resampling (SIR) approach(Dellaert et al., 1999b; Isard and Blake, 1998) sampleparticles using the motion model p

(xt | ut , xi

t−1

)then,

it assigns a new weight to each one of these particlesproportional to the likelihood p(yt | xt ) and, finally, itre-samples particles using these new weights in orderto make all particles weights equal. The main problemof the SIR approach is that it requires lots of particlesto converge when the likelihood p(y | x) is too peakedor when there is a only a small overlap between theprior and the posterior likelihood. In our case, the sen-sor model defined in previous section is not peaked atall, on the contrary, the Gaussian mixture of Eq. (4) israther smooth. Thus, in a perfectly static environment,the SIR technique would provide good results. How-ever, we expect our environment to be not completelystatic: occlusions, small displacements of objects in theenvironment, people walking around, etc. would affectthe observation model. In the worst case, this results insevere outliers and, in the best case, this produces par-tially wrong matches (i.e., training observations closeto the current one but not as close as they would be incase of non-noisy observations). In this circumstances,the SIR technique would require of a large number ofparticles to achieve a robust localization, with the con-sequent increase in the execution time of the system.

Vlassis et al. (2002) propose a more efficient but stillrobust alternative: to use an auxiliary particle filter. In

the auxiliary particle filter (Pitt and Shephard, 1999)the sampling problem is solved in an elegant way byinserting the likelihood inside the mixture

p(xt | ut , yt ) ∝I∑

i=1

π it−1 p(yt | xt ) p

(xt

∣∣ ut , xit−1

).

Then p(xt | ut , yt ) can be regarded as a mixture of theI transition components p(xt | ut , xi

t−1) with weightsπ i

t−1 p(yt | xt ). Therefore, sampling from p(xt | ut , yt )can be achieved just selecting one of the components jwith probability π i

t−1 p(yt | xt ) and then sampling fromthe corresponding component p(xt | ut , x j

t−1). Sam-pling is performed in the intersection of the prior andthe likelihood and, consequently, particles with largerprior and larger likelihood (even if this likelihood issmall in absolute value) are more likely to be used tore-estimate the position of the robot.

Observe that the state xt involved in p(yt | xt ) isunknown at the moment the sampling is performed (it isexactly the state we are trying to approximate). Insteadof xt , a likely value associated with the i-th transitioncomponent is used, for instance its mean µi

t

p(xt | ut , yt ) ∝I∑

i=1

π it−1 p

(yt

∣∣ µit

)p(xt

∣∣ ut , xit−1

).

(6)

After the set of states for the new particles particles isobtained using the above procedure, we have to definetheir weights. This is done using

πj

t ∝ p(yt

∣∣ x jt

)p(yt

∣∣ µi jt

) ,

where i j is the transition component from which theparticle j has been sampled.

The whole filter update is linear with the numberof particles and the number of nearest-neighbors usedin the sensor model. With this, the total cost of eachlocalization step is

O(Dd + J log(N ) + I J )

with D the number of pixels of the images, d the num-ber of features for each image, J the number of nearestneighbors used in the sensor model, N the size of thetraining set and I the number of particles. One of themost efficient localization systems described in the lit-erature is FastSLAM introduced by Montemerlo et al.(2003). This system performs concurrent map buildingand localization, while in our approach the map is as-sumed as given. However, if we take into account the

Page 7: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 65

cost of the localization part (and the data associationthat is equivalent to our sensor model definition) butnot the cost associated with the mapping, FastSLAMscales linearly with the number of particles and loga-rithmically with the size of the map, which is roughlythe cost of our system.

Despite the robustness introduced by the auxiliaryparticle filter, we can not discard catastrophic errorsin the localization: occlusions or temporary modifica-tions of the environment can lead to wrong matchesto define the sensor model for quite a long period inwhich odometry is the only source of information thatcan be used to keep track of the robot’s position. Theconsequence is a possible divergence between the cor-rect robot’s position and that provided by our localiza-tion system. A similar situation occur if the robot iskidnapped. A kidnap is any displacement that violatethe action model: wheel slippery produced while therobot is trying to advance but it is colliding with anobstacle, displacements of the robot performed by theuser, etc. Our localization system must include a mech-anism to detect and correct those error situations. Manyapproaches to this problem can be found in the litera-ture. For instance, Menegatti et al. (2003) and Lenserand Veloso (2000) propose to sample a portion of theparticles directly form the posterior likelihood. Thrunet al. (2000) introduce the Mixture-CML where normalsampling is used in conjunctions with a dual samplingwhere particles are drawn from the sensor model andthen odometry is used to assess its compliance. Finally,Terwijn et al. (2004) describe a particle filter in whichthe state is extended with information about coherencein the sensor model along time to cope in a unified waywith non-Markovian noise in the environment and withkidnap situations. In the system described in this pa-per, we use a simple but effective strategy consistingin re-sampling all particles when a mismatch betweenthe prior and the likelihood is detected for a while. Theagreement between the prior and the likelihood is mea-sured as the sum of the likelihood for the particles asin Eq. (6)

I∑i=1

π it−1 p

(yt

∣∣ µit

)p(xt

∣∣ ut , xit−1

),

If this sum is below a user defined threshold (10−6 inour implementation), the observation is considered anoutlier and it is not used in the particle update (thus,only the motion model is used). If this situation isrepeated for many consecutive time slices (10 in ourtests), the particles are re-sampled from scratch using

the sensor model of the first available observation. Toensure a fast re-identification of the correct robot’sposition the active vision procedure described in nextsection is triggered.

The existence of a error recovery mechanism andthe fact that we never sample uniformly over the wholeconfiguration space allow us to reduce the number ofparticles, increasing the performance of the system.However, a trade off must be reached since the use oftoo few particles would result in a constant activationof the active re-localization mechanism and in a contin-uous interruption of the normal behavior of the robot.

3. Active Localization

In general, the localization model outlined in the pre-vious section is implemented as a background process:the localization module keeps track of the position ofthe robot using the observations obtained along therobot’s path decided by a navigation module. How-ever, when the robot is uncertain about its position itmakes little sense to continue with the navigation taskand it is more advisable to first reduce the uncertaintyin the robot position and, then continue with the nav-igation. Therefore, in some cases, the robot’s actionshave to be determined by localization criteria and notto navigation related ones. The question in that caseis how to determine the appropriateness of actions asfar as localization is concerned. Next, we describe ageneral criterion to compute this appropriateness foreach action and how to implement this criterion in ourlocalization framework.

3.1. Entropy-Based Action Selection

We assume that, at a given moment, the robot can exe-cute a discrete set of actions {u1, . . . , un}. The useful-ness of one of these actions u, as far as localization isconcerned, can be determined examining the probabil-ity p(xt+1 | u, yt+1). An ideal action would allow therobot to find out its position without any doubt, that is,would produce a probability p(xt+1 | u, yt+1) with asingle peak centered at the correct position of the cam-era. Such a probability distribution would have a verylow entropy H (u, yt+1) defined as

H (u, yt+1) = −∫

p(xt+1 | u, yt+1)

× log p(xt+1 | u, yt+1) dxt+1.

Page 8: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

66 Porta, Verbeek and Krose

To compute the entropy of a given action u, we integrateover all possible observations

H (u) =∫

H (u, yt+1) p(yt+1 | u) dyt+1.

Following the reasoning presented at Fox et al. (1998a),the posterior involved in H (u, yt+1) can be written as

p(xt+1 | u, yt+1) = p(yt+1 | xt+1) p(xt+1 | u)

p(yt+1 | u),

and, consequently, the entropy for a given actionbecomes

H (u) = −∫ ∫

p(yt+1 | xt+1) p(xt+1 | u)

× logp(yt+1 | xt+1) p(xt+1 | u)

p(yt+1 | u)dxt+1 dyt+1.

(7)

At any moment, the best action u to be executednext (as far as localization is concerned) is the onewith lower entropy H (u) since, the lower H (u), themore informative the action u is likely to be.

3.2. Implementation

The localization framework presented in Section 2 al-lows us to efficiently implement the action-selectionformalism just described. The basic idea is to exploitthe particle filter and the appearance-based trainingset to approximate the double integral of Eq. (7). Wehave to discretize the distribution on the camera posesp(xt+1 | u) after executing action u and the distribu-tion on the observations the robot will made when itreaches its new placement p(yt+1 | xt+1). Addition-ally, we have to approximate the probability for eachobservation y after action u, p(yt+1 | u).

First, we discretize the probability p(xt+1 | u). UsingEq. (5) we have that

p(xt+1 | u) =I∑

i=1

π it p

(xt+1

∣∣ u, xit

).

In the absence of any other information (i.e., newobservations) the probability on the pose of the cam-era p(xt+1 | u) after the execution of action u canbe approximated sampling on the action model p(xt |ut , xt−1) for each one of the particles xi

t−1 approximat-ing the current pose of the camera. We denote the state

of particle resulting from the sampling process on eachp(xt | ut , xi

t−1) as xit (u). Using this new set of particles,

we have that

p(xt+1 | u) =I∑

i=1

π it δ

(xt+1

∣∣ xit (u)

).

Often, the above resampling is just a shift and a blurof the particles representing p(xt+1 | u, xi

t ). In the par-ticular case in which we only move the head of therobot, the re-sampling process becomes a simple shiftof the particles xi

t since head movements do not adderror on the position of the camera. Another particularcase is that of actions that do not affect the pose of thecameras/robot (for instance, movements around the tiltdegree of freedom of the head). In this case we havethat

xit (u) = xi

t ,

and thus, particles at time t , xit , can be used directly to

approximate p(xt+1 | u), without any modification.Using the above we can re-write Eq. (7) as

H (u) ≈ −∫ I∑

i=0

π it p

(yt+1

∣∣ xit (u)

)

× logπ i

t p(yt+1

∣∣ xit (u)

)p(yt+1 | u)

dyt+1,

where the integral over dxt+1 vanished. Now, we haveto discretize the integration over the observations.

The set of states xit (u) is a sample on the possible

poses of the camera after executing action u. The setof features observed at each one of these poses canbe inferred using the training set: the observation foreach position xi

t (u) would be similar to the observationy obtained in the training point x that is as close aspossible to xi

t (u). We take the set of views (Yu) obtainedin this way from all states xi

t (u) as a representativesample on the likely observations after executing actionu.

If the training set is sampled on a uniform grid overthe space of configuration of the robot, finding the clos-est training point to a given state xi

t (u) is straightfor-ward and can be done in constant time. If this is not thecase, a KD-three structure can be used to perform thissearch in logarithmic time in the number of trainingpoints.

With the above, we achieve a discretization onthe possible observations. Now, for each one of the

Page 9: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 67

observations y ∈ Yu we have to define p(y | xit (u)).

This can be done, as in Section 2.2, using a nearest-neighbor approach. So,

p(y∣∣ xi

t (u)) =

J∑j=1

λ j φ(xi

t (u)∣∣ x j

), (8)

for x j the J training points with observations moresimilar to y.

Observe that, we only compute Eq. (8) for sets offeatures y stored in the training set. Consequently, theprocess of finding the nearest-neighbors to y, x j , andthe corresponding weights,λ j , can be pre-computed forall the observations in the training set, saving a largeamount of time in the on-line execution of the entropyevaluation algorithm.

Finally, to complete the approximation of Eq. (7) wedefine the probability on observation y after action u(y ∈ Yu) as

p(y | u) =K∑

k=1

πikt , (9)

with {i1, . . . , ik} the set of particles that advocate forobservation y. In a situation where particles are spreadall over the configuration space of the robot, each parti-cle is likely to propose a different observation y. How-ever, in case where particles concentrate in few clustersmany particles can propose the same observation to beincluded into Yu and the weights of all these coincidentparticles are used to define p(y | u) for that observa-tion.

With the above approximations, the entropy-basedevaluation of an action u becomes

H (u) ≈ −∑y∈Yu

I∑i=1

[π i

t

J∑j=1

λ j φ(xi

t (u)∣∣ x j

)

× logπ i

t

∑Jj=1 λ j φ

(xi

t (u)∣∣ x j

)∑K

k=1 πikt

]. (10)

The algorithm to evaluate this equation is shown inTable 1. In this algorithm, we identify the most infor-mative action for localization purposes, u∗, out of a setof candidate actions U . We compute the entropy h foreach action (lines 14–19, Eq. (10)) and we select theone with the lowest entropy, h∗ (lines 20–23). To com-pute h we need to guess the possible observations afterthe action execution (the set Yu that is initialized in line3 and updated in line 10). For each possible observa-tion y ∈ Yu , we compute the probability of actuallyobserving y (lines 7 and 9, Eq. (9)).

Table 1. Algorithm for entropy-based action selection.

Input:A set of candidate actions U .The current set of particles {(π i

t , xit ) | i ∈ [1, I ]}.

The training set T = {(xi , yi ) | i ∈ [1, N ]}.Output:

The most informative action u∗1: h∗ ← ∞2: For each u ∈ U3: Yu ← ∅4: For each particle (π i

t , xit )

5: (x, y) ∈ T with minimum ‖x − xit (u)‖

6: If y ∈ Yu then7: p(y | u) ← p(y | u) + π i

t8: else9: p(y | u) ← π i

t10: Yu ← Yu ∪ {y}11: Endif12: Endfor13: h ← 014: For each y ∈ Yu

15: For each particle (π it , xi

t )16: g ← π i

t∑J

j=1 λ j φ(xit (u) | x j )

17: h ← h − g log(g/p(y | u))18: Endfor19: Endfor20: If h < h∗ then21: h∗ ← h22: u∗ ← u23: Endif24: Endfor25: Propose u∗ for execution

The cost of this algorithm is O(U I 2 J ) with U thenumber of actions considered, I the number of parti-cles, and J the number of nearest-neighbors used tocompute the sensor model. To speed up this procedure,we can replace the point xi

t (u) by its closest point inthe training set. In this way, Eq. (8) can be fully pre-computed and the cost reduces to O(U I 2). In any case,the use of the particles to approximate the entropy pro-vides a lower cost than other approaches that discretizethe whole configuration space of the robot (Fox et al.,1998a) or the whole field of view of the cameras (Sujanand Dubowsky, 2002).

The only point that remains to decide is when touse the action selection procedure just described. Theparticle filter allow us to devise a simple criterion forthat since particles not only estimate the robot’s posi-tion, but also provide an estimation on the localizationerror: the variance of the particle centers. Thus, whenthis variance grows above a given threshold, we con-sider that the robot is lost, we stop the robot servicesbased on the robot’s position and we trigger the action

Page 10: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

68 Porta, Verbeek and Krose

selection procedure to find out the action that shouldbe executed next.

4. Localization Using Sparse Disparity Maps

Beside the active vision strategy, the other enhance-ment to the classical appearance-based localization thatcan be made thanks to the use of a pan-and-tilt with astereo camera is to use disparity (i.e., depth) maps forlocalization. Next, we introduce how disparity maps aredefined, how to extract features from them, and how touse these features in our localization framework.

4.1. Disparity Maps

We can determine a depth map matching points in im-ages taken by a pair of calibrated cameras mountedon the robot. Given a single image I , the three-dimensional location of any visible object point Q mustlie on the line that passes through the center of projec-tion of the camera c and the image of the object pointp (see Fig. 2). The determination of the intersection oftwo such lines generated from two independent imagesis called triangulation and provides the 3-D position ofQ w.r.t the cameras. For a review of scene reconstruc-tion from images taken from different views we referto Faugeras and Luong (2001).

To apply triangulation, for each pixel p in one ofthe images I we have to search for a correspondentpoint in the other image I ′. Usually, the correspon-dence is done by comparing areas around pixel p withareas around each candidate pixel p′. Epipolar geom-etry allow us to reduce the search space to a smallsegment on I ′. The pixels p and p′ with more similarsurroundings are assumed to correspond to different

Figure 2. Elements in disparity computation.

projections of the same point Q in the scene. If theimages planes for the two cameras are co-planar, thedistance r from the scene point Q to the cameras can becomputed as

r = bf

d − d ′ ,

where b is the baseline (distance between the two view-points), f is the focal length of the cameras, d is thehorizontal distance from the projected point p to thecenter of one of the images, and d ′ is the same for theother image. The difference d − d ′ is called disparity.With constant baseline b and focal length f , the dis-parity is inversely proportional to the depth of point Q.For this reason, instead of working with depth maps itis enough to use disparity maps.

The stereo algorithm we use (Konolige, 1997) ap-plies many filters in the process to determine the dis-parity map both to speed up the process and to ensurethe quality of the results. For instance, if the area aroundpixel p is not textured enough it would be very difficultto find a single corresponding point p′ (we are likely tofind many points p′ with almost the same probabilityof being the corresponding point of p). For this reason,pixels on low textured areas are not even considered inthe matching process. The result of this and other fil-tering processes is to produce a sparse disparity map: adisparity map where many pixels do not have a disparityvalue (see Fig. 3). This makes the use of standard PCAto determine the projection matrix (see Section 2.2)unfeasible and we have to use more elaborated tech-niques as the Expectation-Maximization (EM) algo-rithm introduced in the following sections (for a generaloverview of the EM algorithm see Appendix I and thereferences therein).

Page 11: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 69

Figure 3. Plain image (left) and the corresponding disparity map (right). In the disparity map, light gray areas are missing values and darkareas correspond to points that are far from the robot.

4.2. Principal Component Analysis of Data Setswith Missing Values via EM

Given a D× N matrix Z , representing a set of disparitymaps Z = {z1, . . . , zN }, our aim is to find a projectionthat maps Z onto its d-dimensional principal compo-nents space (d � D). For each disparity map zi , wehave a subset of elements that are observed zi,o and aset of missing values zi,h . Thus, we want to define theprobability

p(yi | zi,o),

with yi the set of principal component features de-rived from the observed values in disparity map zi .We do so by first defining p(y) as a Gaussian withzero mean and identity covariance, and p(z | y) =N (z; C�y, σ 2 I ). The projection density p(yi | zi,o)then follows a Gaussian distribution N (yi ; µyi , �yi ),where �−1

yi= I + CoC�

o /σ 2 and µyi = �yi Cozo/σ2.

Our model is described with two parameters, C and σ ,that are the ones we want to estimate via EM. Here, Cis the d × D projection matrix from the space of dis-parity maps to the space of principal components andCo contains only the columns of C that correspond theobserved elements in a given map zi . Observe that Co

is different for each disparity map and hence so is thecovariance matrix �yi .

To find the maximum likelihood projection matrixC via EM, we have to define the E and M steps of thealgorithm (see Appendix I). This can be done payingattention only to the observed values zo, taking the pro-jected features as the hidden variables, and not makinginference about the missing pixels. However, the EMalgorithm can be made more efficient by estimating yand zh at the same time.

In the E step the KL divergence to be minimized is

KL =∫

q(y, zh)q(y, zh)

p(zh, y | zo; C, σ ).

In order to obtain an efficient E step we factor q overy and zh . Thus,

KL =∫

q(y)

[log

q(y)

p(y | zo; C, σ )

+∫

q(zh) logq(zh)

p(zh | y; C, σ )

].

Using free-form optimization, we have

q(zh) ∝ exp∫

q(y) log p(zh | y; C, σ ),

q(y) ∝ p(y | zo; C, σ ) exp∫

q(zh) log p(zh | y; C, σ ).

Both of these densities are Gaussian. The covariance ofthe q(zh) is given by σ 2 I and the mean for the missingvalues for disparity map i is given by zi,h = C�

h yi . Thecovariance and the means (collected as columns in Y ) ofthe q(y) for the disparity maps are given respectively by

�y = [I + σ−2CC�]−1,

Y = σ−2�yC Z ,

where Z is the data matrix with missing values filledin with the new zi,h .

Observe that, the covariance matrix �y is commonfor all data points when computing the set of projec-tions Y . Therefore, we only have to perform one matrixinversion and not one per data point.

Page 12: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

70 Porta, Verbeek and Krose

With respect to the M step, the relevant part of thelikelihood function to be optimized is

�M =∫

y,zh

q(y, zh) log p(zo, zh, y; C, σ ).

Analytical manipulation leads to the effective objectiveto be maximized

�M = ND log σ 2 + σ−2

[N∑

i=1

‖zi − C�yi‖2

+ Tr{C��yC}]

+ σ−2 Dhσ2old,

with Dh the total amount of missing values in Z andσold the previous value for σ . From the above, we getthe simple updates

C = ZY �(N�y + Y Y )−1,

σ 2 = 1

ND

[NTr{C��yC}

+N∑

i=1

‖zi − C�yi‖2 + Dhσ2old

].

The E and M steps have to be iterated while thereis a large (relative) change in the lower bound of thelog-likelihood function that, after the updates, reads to

�(C, σ ) = − N

2(D log σ 2 + Tr{�y} − log |�y |)

− 1

2Tr{Y Y �} + 1

2Dh log σ 2

old. (11)

A reasonable initialization of C would be that con-taining d randomly selected disparity maps (with zerosin the missing values) and σ 2 equal to the initial recon-struction error.

Each iteration of the EM steps scales with orderO(dDN), assuming d < N . Observe that the EM isapplied off-line on a training set and, thus, this costhas not to be taken into account in the on-line local-ization phase. Our approach to finding the maximumlikelihood C and σ has the advantage that the guess ofthe missing values comes essentially for free while inother approaches they were obtained extending the Estep with a optimization process in the space of imagesfor every image separately (Roweis, 1998). This is animportant saving especially when working with data

sets such as sparse disparity map where each point zcontains a large proportion of missing values.

A Matlab version of the EM algorithm just describedcan be downloaded from Verbeek (2002).

Once C and σ are determined, we can on-line com-pute the set of features y corresponding to a new dis-parity map z as

y = (σ 2 I + CoC�o )−1Cozo. (12)

The matrix (σ 2 I + CoC�o )−1Co maps from (ob-

served) disparity values to features and plays the samerole as matrix W in Eq. (3). For vanishing sigma,(σ 2 I + CoC�

o )−1Co is the pseudo inverse of Co.The most expensive step in the evaluation of Eq. (12)

is the product CoC�o that is O(Dd2). Because of this,

the on-line determination of the features is only feasiblefor relative small d (and moderated D’s).

4.3. Sensor Fusion

Once we have a way to obtain features from disparitymaps, it remains the question of how to combine thisinformation with that obtained from intensity to definea unified sensor model.

In ideal conditions (no sensor noise, static environ-ment) the two types of information are redundant, orcorrelated, since the intensity image and the disparitymap are taken in the same position and at the sametime. Thus, in this ideal case we have that

p(yi , yd | x) ≈ p(yi | x) ≈ p(yd | x).

If the correlation between the two sources of infor-mation is high, only the information provided by oneof the sensor modalities, for instance intensity, needto be used in the particle update. In this case, the in-formation provided by disparity just reinforces the in-formation given by the intensity images. On the otherhand, if the correlation is low, this indicate that there isa mismatch between the two sources of information. Inthis situation, it would be hard to decide which one ofthe sources is more reliable. In the case of changes inillumination, intensity is likely to be an outlier and, ingeneral it’s better to use disparity. If the environmenthas been locally modified (some objects moved, peoplestanding in front of the robot, etc.) intensity tends to bemore reliable than disparity. In the worst case, none ofthe sensor modalities would be correct.

Page 13: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 71

Fortunately we do not need to identify in which sit-uation we are since the particle filter does it for us. Asmentioned, the particle filter can deal with outliers andit only uses the sensory hypotheses that are more con-sistent over time. So, instead of a priori trying to selectthe correct sensory hypothesis, we just take all of theminto account in the particle filter update and the mostcoherent ones are automatically used. Thus, we definethe global sensor as a linear opinion pool (Genest andZidek, 1986; Clemen and Winkler, 1999)

p(yi , yd | x) = wp(yi | x) + (1 − w) p(yd | x) (13)

that, using Eq. (4), reads to

p(yd , yi | x) = w

J∑j=1

λ j φ(x | x j )

+ (1 − w)K∑

k=1

λk φ(x | xk),

with w ∈ [0, 1] and x j and xk the closest training pointsto the features of the current intensity and disparity im-ages respectively. The weight w can be used to balancethe importance of the information obtained with eachtype of sensor. If both information sources are assumedto be equally reliable, we can set w = 1/2. The linearpool combination of two probability distributions is awidely used technique. It is one of the simplest meth-ods to combine different judgments but, as suggested bysome empirical studies (Clemen, 1989), simpler meth-ods perform better than more complex ones in practice.

The expression on Eq. (13) for the sensor model isused both in the update of the particles (see Eq. (6) inSection 2.3) and in the active vision strategy describedin Section 3 (Eq. (8)).

Observe that, using Eq. (13), if the two sensors arehighly correlated, we have that p(yd | x) ≈ p(yi | x)and, as desired, p(yi , yd | x) ≈ p(yi | x). If the sensormodels are uncorrelated and only p(yi | x) or p(yd | x)is correct, the particle filter will work without problem:the outliers are not used and a constant scale factor inthe correct part of the likelihood does not affect theupdate procedure detailed in Section 2.3. This workseven if the information provided by intensity and thatfrom disparity are partially correct, but none of them istotally right. If both intensity and disparity are outliers,the robot’s pose is updated using only odometry andevidence is accumulated on a possible failure of thelocalization system (caused, for instance, by a kidnap).

The failure is more likely when both sensor readingsare outliers but they are coherent between them.

5. Experiments and Results

In this section, we describe the experiments we per-formed to validate our contributions in active localiza-tion and in localization using disparity maps. We alsoreport on the global performance of our localizationsystem including the two enhancements we present inthis paper.

5.1. Experiments on Active Localization

The objective of the first experiment was to investigatethe relation between the localization error and the ac-tion selection criteria introduced in Section 3. We testedthe localization system in an office environment. Wemapped a room of 800×250 cm. with obstacles takingimages in the free space every 75 cm. and every 15 de-grees. This makes a total amount of about 400 trainingimages. In the experiments, we compress the imagesusing PCA keeping 5 features that preserve up to 70%of the variance of the original images (see Section 2.2),we use 10 nearest-neighbors to approximate the sensormodel p(y | x), and we define the initial distributionp(x0) as uniformly over the configuration space of therobot.

We tested the system placing the robot at positionsnot included in the training set, rotating the camera asmeasuring the error and ‖c − a‖ with c the correctposition and a the position estimated by the particlefilter (the weighted sum of the particle centers, see Sec-tion 2.3).

An example of the difference between images at atesting point and images at the closest training pointis shown in Fig. 4: although the main elements of thescene are similar (the wall with the poster, etc.), itemson the desk changed from the moment the training setwas collected to the moment the testing was performed.In this X − Y point, the differences are not so large forother camera orientations and this is why rotating thecamera helps to improve the localization of the robot.

We considered 22 different actions (i.e., differentorientations for the camera) and we used up to 150 par-ticles to approximate p(xt | ut , yt ). With these figures,the entropy-based action evaluation for all actions takesless than 0.2 seconds in a Pentium 4 at 1.8 GHz.

Figure 5 shows the decrease on the average posi-tioning error as new actions are issued compared with

Page 14: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

72 Porta, Verbeek and Krose

Figure 4. An image taken in a testing position (left) and the image taken at the closest training point (right).

the error decrease when actions are selected at random.The results shown correspond to the average (and thestandard deviation) over ten runs placing the camera indifferent testing positions. We can see that the entropy-based action selection reduces the error in localiza-tion faster than the random-based action selection.However, the difference between the two strategies israther dependent on the environment: in highly aliasedenvironments the entropy-based strategy performsmuch better than the random one, but in non-ambiguousenvironments they perform almost the same. If we con-sider the estimation a to be correct if the closest trainingpoint to a is the same as the closest training point to thecorrect position c, then the success ratio in localizationafter 3 camera movements is over 95%.

5.2. Experiments on Localization UsingDisparity Maps

To test the sensitivity of the features obtained fromdisparity maps to changes in illumination we acquiredthree sets of images in a 900 × 500 cm. environment,

Figure 5. Evolution of the average error (and the standard devia-tion) w.r.t. the correct position as we get new images.

but with three different lighting conditions: using tubelights, using bulb lights and using natural light (open-ing the curtains of the windows placed all along onewall of the lab). For each illumination setup, we col-lected images every 50 cm. (both along X and Y ) andevery 10 degrees. This makes a total amount of about4000 images per illumination setup. We used the set ofimages obtained with tube lights to determine a projec-tion matrix W with 20 projections vectors (that retainmore than 80% of the variation of the training set).The two other sets of images (the one obtained withbulb lights and the one with natural light) were used toassess the effect of the illumination conditions on thefeatures obtained using W . These two tests sets providechanges both in the global intensity of the images andin the distribution of light sources in the scene (that isthe situation encountered in real applications).

Each disparity map has D = 160 × 120 = 19200pixels but about 20% of the pixels on the disparity mapsare undefined. We use N = 100 randomly sampledimages to compute the d = 20 principal componentsof the training set applying the algorithm we describein Section 4.2.

The principal components of the set of dispar-ity maps are computed in 15 iterations of our EM-based algorithm with a convergence threshold (relativechange on two consecutive maximums of functions �,Eq. (11)) of 0.0001. The execution of this process takes150 seconds in a Pentium 4 at 1.8 GHz. The on-linecomputation of the features for a given disparity imageinvolves inverting a 20 × 20 matrix (see Section 4.2),but it takes less than 0.1 seconds. Therefore, the use ofdisparity maps for real-time appearance-based local-ization is completely feasible.

We now need to determine which modality (disparityof intensity) is best for localization. The ideal inputfor localization would be one with maximal sensitivity

Page 15: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 73

w.r.t. translations and rotations, but with a minimumsensitivity to changes in illumination. We measure thesensitivity of features to these three different factors as

sa,b = ‖ya − yb‖‖ya‖ ,

with ya and yb the set of features corresponding to im-ages a and b respectively. To assess the sensitivity totranslations on the features, we evaluated the averageof sa,b for each couple of images (a, b) taken withthe same orientation and the same lighting conditionsbut at adjacent positions (50 cm.). Considering onlydifference in features for positions that are close eachother gives us an estimation of the minimum change(i.e., the minimum sensitivity) in features due to trans-lations. For rotations, we computed the average of sa,b

for each couple of images (a, b) taken at the same pointand with the same illumination but with adjacent ori-entations (10 degrees). Finally, to measure sensitivityto illumination conditions, we computed the averageof sa,b with a an image take with the training illumi-nation (i.e., using tube lights) and b the image taken atthe same position and orientation but with a differentlighting setup.

We computed these measures for the features ob-tained from plain images, images processed with twousual techniques for dealing with illumination re-lated problems: histogram equalization (Gonzalez andWoods, 1992; Murase and Nayar, 1994) and a gradient-based filter, (the Sobel filter (Jogan et al., 2002; Sobel,1990) and, finally, for the features obtained from dis-parity maps.

Table 2 shows the results we obtained for the ex-periment just described. We can see that plain imagesprovide the features that are less sensitive to trans-lations/rotations and more sensitive to illuminationschanges. In the case of features obtained from dispar-

Table 2. Mean (and standard deviation in parenthesis) of the rel-ative change of the features using different image processing tech-niques and in different illumination conditions.

Sensitivity to

Image process Translations Rotations Light changes

Plain images 0.42 (0.26) 0.95 (0.54) 1.57 (0.52)

Hist. equalization 0.63 (0.27) 1.40 (0.43) 1.10 (0.31)

Gradient filter 0.75 (0.28) 1.26 (0.42) 0.91 (0.21)

Disparity maps 0.85 (0.45) 1.37 (0.57) 0.56 (0.31)

ity maps, the sensitivity to changes in illumination isthe lowest one. Actually, this is the only case in whichthe sensitivity to illumination conditions is lower thanthat due to translations. This means that, in principle,disparity is the best of the three image processing tech-niques we compare, as far as independence of illumina-tion is concerned. However, we observe that the stan-dard deviation of measures corresponding to disparitymaps is, in general, larger than that using other tech-niques. This is because sometimes the features obtainedfrom disparity maps are quite affected by noise. Due tothis sporadic noisy readings features obtained from dis-parity maps are usually, but not always, the best ones.For this reason, in our localization system disparitymaps are not used as the only source of information,but as a complemented to intensity images.

To assess the contribution of using disparity mapsin definition of the sensor model, we moved the robotalong a pre-defined path in the three different illumi-nation conditions mentioned above: tube lights, bulblights and natural light. At regular distances along thetest path, we took an image and we computed the corre-sponding sensor model using the J training points witha set of features more similar to those correspondingto the just obtained image (see Section 2.2). The closerthe training points used to define the sensor model tothe actual position of the robot, the better the sensormodel and, thus, better the update of the robot’s posi-tion estimation.

Figure 6 shows the average (and the standard devia-tion) of the error in the sensor model defined as

e = min∀n j

‖r − n j‖,

with j ∈ [1, J ], r the pose of the robot at the testposition and n j the training points used to define the

Figure 6. Error in the sensor model in three different illuminationconditions, using only intensity images (dotted line), only disparitymaps (dashed line), and combining intensity and disparity images(solid line).

Page 16: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

74 Porta, Verbeek and Krose

sensor model (that are different for each test position).An error in the range [25, 50] is quite reasonable takinginto account that the distance between training pointsin X and Y dimensions is 50 cm.

We repeat the test in three cases: (a) using onlyhistogram-normalized intensity images (dotted line onFig. 6), (b) using only disparity maps (dashed line onthe figure) and (c) combining normalized images anddisparity maps (solid line). In the first and second testswe use J = 10 (quite small compared with the totalnumber of training points) and in the third case we useJ = 5 but for both intensity and disparity (so, we alsouse 10 training points in the definition of the sensormodel).

We can see that the use of disparity maps alone re-sults in a reduction of the error w.r.t. only using nor-malized intensity images. However, when we combinethe two types of information, the results are even bet-ter. As mentioned before, disparity maps are eventuallyaffected noise. It is in these noisy cases in which thecombination of normalized intensity images and dis-parity is better. Consequently, we can conclude that theuse of features computed from disparity maps com-bined with those from intensity provide a better sensormodel and, thus, it helps to obtain a more robust local-ization system.

5.3. Global Performance

The last experiment we performed was a navigationexercise to analyze the combined work of the activelocalization strategy and the use of disparity maps forlocalization. We instructed the robot to move alonga close circuit with 6 passing points (the numberedcrosses in Fig. 7).

Initially (point A in Fig. 7), the robot has no in-formation about its location and it assumes a uniformdistribution over its configuration space. This automat-ically triggers the active vision mechanism describedin Section 3. After rotating the head in three differentdirections, the robot finds out its position with goodaccuracy and starts to move along the desired path.We considerer that the robot knows its location goodenough when the variance of the set of particles is be-low a given threshold. The trajectory estimated by ourlocalization system is shown as a solid line in Fig. 7.The black dots close to positions 1, 2, 5, and 6 show thereal position of the robot when our system indicate thatthe robot is exactly on the middle of the corresponding

cross. The difference between the correct robot positionand the estimated one at these check points is alwaysbelow 25 cm (the size of the dashed squares aroundthe passing points is 40 cm). The dashed line in Fig. 7is the position according to odometry, initialized usingthe robot position determined by our localization sys-tem at point A. The path determined using odometryclearly diverges from the correct robot position, evenin the short trajectory shown in the figure. At pointsmarked as B in Fig. 7 the obstacle avoidance mecha-nisms are triggered and this produces alterations on thepath to the next passing points (2 and 4 respectively).

In Fig. 8, we can see the effect of a robot kidnap. Akidnap consists in lifting and displacing the robot to anew pose (i.e., a new position or a new orientation orboth). By lifting the robot, the odometry information(in which the action model is based) is unvalidated.Only global localization systems are able to overcomea robot kidnap (relative localization systems can notdeal with discontinuities in the robot trajectory). Inthe experiment depicted in Fig. 8 (that is a continu-ation of experiment on Fig. 7) we lifted and rotatedthe robot 90 degrees anti-clockwise at point A. Afterthe kidnap the robot moves toward point B, but the in-formation provided by odometry erroneously indicatesthat the robot is going toward C . As described in Sec-tion 2.3, in our localization system, we implementeda simple kidnap detection mechanism that re-samplesall particles using the sensor model for the current ob-servation if there is no agreement between the positionestimated by the particle filter p(xt | ut , yt ) and thatprovided by the sensor model p(yt | xt ) for some timesteps. In the example, the re-sample is done at pointB. After the re-sampling the uncertainty on the robotposition is very large and the active localization mecha-nism is triggered. Again after 3 movements of the robothead, the position is determined with good accuracyand the robot moves to passing point 1 (its objective atthe moment the robot is kidnapped). Again, the smallblack dots close to points 1 and 2 indicate the correctrobot position when the localization system indicate therobot in right on the passing points. The fact that therobot passes very close to points 1 and 2 means thatthe position estimated at point B is correct and that itis properly maintained over time.

Tests including changes of illumination (switchingon/off some lights while the robot is moving) showthat, thanks to the use of disparity maps, the robot isable to keep a good estimation of its position and, con-sequently, the navigation task can be achieved without

Page 17: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 75

Figure 7. Trajectory of the robot as estimated by our localization system (solid line) and the trajectory according to odometry (dashed line).Shadowed boxes are obstacles. The black dots indicate the real position of the robot when our system indicates the robot is exactly on the nearestcross.

Figure 8. Trajectory of the robot as estimated by our localization system after a kidnap at point A.

problems. When only intensity images are used, thesensor model is not always correct and the navigationis constantly stopped to perform active localization ac-tions that, in many cases, converge to wrong positions(and, thus, the active localization is triggered again af-ter few moments, and so on).

We let the robot to more over the six passingpoints for more than 30 iterations observing a sta-

ble behavior. Few changes in the environment (dis-placement/removal of some obstacles, people walkingaround/in front of the robot, etc.) do not have reper-cussion on the localization performance as far as thecameras are eventually pointing in a direction wherethose changes are small. More drastic changes in theenvironment would prevent our system to converge tothe correct robot location.

Page 18: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

76 Porta, Verbeek and Krose

Videos of the experiments reported here canbe downloaded from our web pages (http://www.science.uva.nl/research/ias).

Experiments in larger environments (in a 12 × 25meters area in the second floor of our building in Am-sterdam, in a 12 × 19 meters area at Philips ResearchLab in Eindhoven or at Philips Home Lab that is about10 × 13 meters) showed always a similar localizationprecision and stability. As an example, the localiza-tion is accurate enough so that Lino is be able to au-tonomously dock in a charging station using positionas the only feedback. This can only be achieved ifthe localization error is below ±25 cm and ±5 de-grees. since the passive mechanism of the re-chargingdevice can not accommodate errors above thesethresholds.

6. Conclusions

In this paper, we have introduced two extensions thatmakes the traditional appearance-based robot local-ization framework more efficient and robust: activeselection of robot actions to speed up the conver-gence to the correct location and the use of dispar-ity maps for localization to deal with illuminationchanges. These two extensions are possible thanks tothe use of a stereo camera mounted on a pan-and-tiltdevice.

We have described an active vision strategy that usesan entropy-based action evaluation criterion that canbe computed in an efficient way within our localiza-tion system. The experiments we report show that thismechanism effectively helps to find out the location ofthe robot. This is mainly useful in dynamic environ-ments, where our previous passive localization systemexhibited some problems. In the examples we describedin this paper, we only applied the action evaluationto head rotations in the pan degree of freedom, butthe algorithm we have introduced can be applied with-out modifications to evaluate any possible movementof the robot. Note that the existence of a efficient re-localization mechanism allow us to reduce the numberof particles used by the filter, reducing also the execu-tion time per time slice.

Our second contribution is the use of sparse disparitymaps to increase the robustness of appearance-basedlocalization to changes in illumination. To compresssparse disparity maps to a reduced set of features, wehave to deal with the problem of missing values. We

have presented a novel EM-based algorithm to extractthe principal components of a set of data that is moreefficient in the way in which it deals with missing val-ues than previously existing methods. After the dimen-sionality reduction, it remains a open problem of howto merge the information provided by disparity mapswith the information provided by intensity images. Wehave proposed to use a linear opinion pool to com-bine the models built separately for each type of sensorsince the particle filter update automatically takes careof filtering the outliers.

The results we have presented show that dispar-ity maps provide features that are less sensitive tochanges in the lighting conditions than features ob-tained with other techniques: histogram equalizationand gradient-based filters. These techniques work wellwhen we have changes in the global illumination butthey do not deal properly with different distributionsof light sources. On the other hand, disparity mapsare more consistent over changes in the number andin the position of the light sources. This is becauseonly reliable correspondences are taken into accountwhen defining the disparity map and those reliablematches are likely to be detected in almost all lightingconditions.

We have shown that, using features from disparitymaps in addition to those obtained from intensity im-ages, we can improve the quality of the sensor modelwhen illumination conditions are different from thosein which the training set is obtained. Thus, dispar-ity maps are a good option to increase the robustnessof appearance-based robot localization. However, thegood results achieved using disparity maps comes at thecost of using a more complex hardware (we need notonly one camera but two calibrated ones) and software(the disparity computation is quite a complex process).However, our experiments show that this cost increaseis small enough to perform real-time localization. Ad-ditionally, the cost for each localization step for thepresented system scales logarithmically with the sizeof the map and this is the same as the most efficientlocalization systems existing nowadays (Montemerloet al., 2003).

There are a number of issues we have to considerin our future work. The main drawback of appearance-based methods is that localization is only possible inpreviously mapped areas. The construction of a map is asupervised process that can be quite time-consuming.This problem limits the applicability of appearance-based localization to environments smaller than those

Page 19: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 77

tackled by other localization methods (Dellaert et al.,1999a; Montemerlo et al., 2002). However, there aremany possibility to scale the method to larger environ-ments. One is to use hybrid representations that is, forinstance to use a geometric representation to obtain aappearance-model by means of simulation (Fox et al.,1998a; Fox et al., 1998b; Thrun et al., 2001). We canalso exploit the continuity properties of the appearancemanifold (Crowley and Pourraz, 2001) to approximateit using just few real samples. Another possibility wealready explored is to generate artificial training pointsfrom a 3D reconstruction of the environment gener-ated from a reduced set of images (Bunschoten andKrose, 2003; Crowley et al., 1998b). The possibilitywe are currently working on is to let the robot learnthe appearance-based map by itself (Porta and Krose,2004). We are developing a concurrent mapping andlocalization (CML) system based in the appearance-based framework and not in the landmark-based one(as it is usually done). Observe that, the localizationmethod presented in this paper can be used indepen-dently if the map is obtained in a supervised or in a au-tomatic way and thus the work presented here is usedwithout modifications in our new CML system. Thisnew system would allow us to combine he good fea-tures of appearance-base localization without havingto deal with its limitations.

Appendix I: The EM Algorithm

The EM algorithm (Dempster et al., 1977; Minka,1998) aims at determining the maximum likelihoodmodel for a given set of data Z . Of course, not allpossible models are considered but only those inside afamily of models parameterized by a set of parameters,collectively denoted as θ . The optimal model is deter-mined by maximizing the likelihood or, equivalently,the log-likelihood1

(θ ) = log p(Z ; θ ).

In many cases, can be defined using an auxiliaryset of unobserved or hidden variables h. In this case,we have

(θ ) = log∫

hp(Z , h; θ ).

The function might be maximized using a gradientascent process. However, the EM algorithm provides a

simple to implement and step-size free algorithm usingiterative lower-bound maximization.

For a given set of parameters θ , the EM algorithmfirst finds a lower bound � of , possibly such that �

touches at point θ . After that, the lower bound �

is maximized for the parameters θ . The definition of� for a fixed set of parameters θ is the E step of thealgorithm and the maximization of the resulting � iscalled the M step. The sequential iteration of E andM steps from an arbitrary initial set of parameters θ isguaranteed to find a (local) maximum of .

The lower-bound� used in the E step, can be definedas

�(θ ) = (θ ) − KL(q(h)‖p(h | Z ; θ )) ≤ (θ ),

where KL denotes the (non-negative) Kullback-Leiblerdivergence between two probability distributions. Tomake �(θ ) = (θ ) for the current parameters θ , theKL should be zero. The KL divergence is zero iff the twocompared probability distributions are equal. Thus, inthe E step we set q(h) = p(h | Z ; θ ).

In the M step, we have to maximize �. We canrewrite � as

�(θ ) = (θ ) −∫

q(h)q(h)

p(h | Z ; θ )

= −Eq(h)

[log

q(h)

p(h | Z ; θ )

]+ log p(Z ; θ )

= Eq(h)

[log

p(Z , h; θ )

q(h)

]

=∫

hq(h) log p(Z , h; θ ) − q(h) log q(h).

The relevant term of � to be maximized is

�M =∫

hq(h) log p(Z , h; θ ),

since the other term of � does not depend on θ .For a more detailed presentation of the EM algorithm

see (Minka, 1998), which we followed in the abovebrief description.

Acknowledgments

This work has been partially supported by the Euro-pean (ITEA) project “Ambience: Context Aware Envi-ronments for Ambient Services.”

We would like to thank Bas Terwijn for helping usto perform the experiments reported in this paper. We

Page 20: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

78 Porta, Verbeek and Krose

would also want to express our gratitude to the anony-mous reviewers of this paper whose thorough workgreatly helped to improve the quality of the paper.

Note

1. The log is used since it makes the resulting expressions simpler.

References

Arbel, T. and Ferrie, F.P. 1999. Entropy-based gaze planning. In Pro-ceedings of the Second IEEE Workshop on Perception for MobileAgents, in Association with the 1999 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition, Fort Collins,Colorado.

Arras, K.O., Castellanos, J.A., and Siegwart, R. 2002. Feature-basedmulti-hypothesis localization and tracking for mobile robots us-ing geometric constraints. In Proceedings of IEEE InternationalConference on Robotics and Automation, pp. 1371–1377.

Bunschoten, R. and Krose, B. 2003. Robust scene reconstructionfrom an omnidirectional vision system. IEEE Transactions onRobotics and Automation, 19(2):351–357.

Burgard, W., Fox, D., Hennig, D., and Schmidt, T. 1996. Estimatingthe absolute position of a mobile robot using position probabilitygrids. In Proceedings of the National Conference on ArtificialIntelligence (AAAI), pp. 896–901.

Burgard, W., Fox, D., and Thrun, S. 1997. Active mobile robot local-ization. In Proceedings of the 15th International Joint Conferenceon Artificial Intelligence (IJCAI).

Chow, Y.H. and Chung, R. 2002. VisionBug: A hexapod robot con-trolled by stereo cameras. Autonomous Robots, 13:259–276.

Clemen, R.T. 1989. Combining forecasts: A review of annotatedbibliography. International Journal of Forecasting, 5:559–583.

Clemen, R.T. and Winkler, R.L. 1999. Combining probability dis-tributions from experts in risk analysis. Risk Analysis, 19(2):187–203.

Cobzas, D. and Zhang, H. 2001. Mobile robot localization usingplanar patches and a stereo panoramic model. In Proceedings ofthe Vision Interface Annual Conference, Ottawa, Canada, pp. 94–99.

Cobzas, D., Zhang, H., and Jagersand, M. 2003. Image-based lo-calization with depth-enhanced image map. In Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA), pp. 1570–1575.

Cox, I.J. and Leonard, J.J. 1994. Modeling a dynamic environ-ment using a multiple hypothesis approach. Artificial Intelligence,66(2):311–344.

Crowley, J., Wallner, F., and Schiele, B. 1998. Position estimation us-ing principal components of range data. Robotics and AutonomousSystems, 23(4):267–276.

Crowley, J., Wallner, F., and Schiele, B. 1998. Position estimation us-ing principal components of range data. In Proceedings of the IEEEInternational Conference on Robotics and Automation (ICRA),pp. 3121–3128.

Crowley, J.L. and Pourraz, F. 2001. Continuity properties of the ap-pearance manifold for mobile robot position estimation. Imageand Vision Computing, 19:741–752.

Davison, A.J. 1999. Mobile robot navigation using active vision. PhDthesis, Robotics Research Group, University of Oxford.

Dellaert, F., Burgard, W., Fox, D., and Thrun, S. 1999. Usingthe CONDENSATION algorithm for robust, vision-based mobilerobot localization. In Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR).

Dellaert, F., Fox, D., Burgard, W., and Thrun, S. 1999. Monte-Carlolocalization for mobile robots. In Proceedings of the IEEE In-ternational Conference on Robotics and Automation, pp. 1322–1328.

Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum-likelihood for incomplete data via the EM algorithm. J. RoyalStatistical Society, Series B, 39(1):1–38.

Elfes, A. 1987. Sonar-based real-world mapping and navigation.IEEE Journal of Robotics and Automation, 3(3):249–265.

Faugeras, O. and Luong, Q.-T. 2001. The geometry of multiple im-ages. MIT Press.

Fox, D., Burgard, W., and Thrun, S. 1998. Active markov localizationfor mobile robots. vol. 25, pp. 195–207.

Fox, D., Burgard, W., and Thrun, S. 1999. Markov localization formobile robots in dynamic environments. Journal of Artificial In-telligence Research, 11:391–427.

Fox, D., Burgard, W., Thrun, S., and Cremers, A.B. 1998. Positionestimation for mobile robots in dynamic environments. In Proc. ofthe Fifteenth National Conference on Artificial Intelligence (AAAI-98).

Gaspar, J., Winters, N., and Santos-Victor, J. 2000. Vision-basednavigation and environmental representations with an omnidirec-tional camera. IEEE Transactions on Robotics and Automation,16(6):890–898.

Genest, C. and Zidek, J.V. 1986. Combining probability distribu-tions: A critique and an annotated bibliography. Statistical Sci-ence, 1(1):114–148.

Gonzalez, R. and Woods, R. 1992. Digital image processing.Addison-Wesley Publishing Company.

Horswill, I. 1998. Artificial Intelligence and Mobile Robots. Chap-ter The Polly System, MIT Press: Cambridge, MA, pp. 125–139.

Isard, M. and Blake, A. 1998. Condensation—Conditional densitypropagation for visual tracking. International Journal of ComputerVision, 29(1):5–28.

Jensfelt, P. and Kristensen, S. 2001. Active global localization for amobile robot using multiple hypothesis tracking. IEEE Transac-tions on Robotics and Automation, 17(5):748–760.

Jogan, M. and Leonardis, A. 2000. Robust localization usingeigenspace of spinning-images. In Proceedings of the IEEEWorkshop on Omnidirectional Vision, Hilton Head Island, SouthCarolina, pp. 37–44.

Jogan, M., Leonardis, A., Wildenauer, H., and Bischof, H. 2002. Mo-bile robot localization under varying illumination. In Proceedingsof the International Conference on Pattern Recognition (ICPR),Quebec, Canada, pp. 741–744.

Jolliffe, I.T. 2002. Principal component analysis. Springer Verlag.Kaelbling, L.P., Cassandra, A.R., and Kurien, J.A. 1996. Acting un-

der uncertainty: Discrete bayesian models for mobile-robot navi-gation. In Proceedings of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS).

Kleinberg, J. 1994. The localization problem for mobile robots. InProceedings of the IEEE Symposium on Foundations of ComputerScience, pp. 521–531.

Page 21: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

Active Appearance-Based Robot Localization Using Stereo Vision 79

Konolige, K. 1997. Small vision system: Hardware and implemen-tation. In Proceedings of the 8th International Symposium onRobotics Research, Japan.

Kristensen, S. and Jensfelt, P. 2003. An experimental comparisonof localization methods, the MHL session. In Proceedings ofthe International Conference on Robotics and Intelligent Systems(IROS), Las Vegas, USA, pp. 992–997.

Krose, B.J.A. and Bunschoten, R. 1999. Probabilistic localization byappearance models and active vision. In Proceedings of the IEEEInternational Conference on Robotics and Automation (ICRA),pp. 2255–2260.

Krose, B.J.A., Vlassis, N., and Bunschoten, R. 2002. Omnidirec-tional vision for appearance-based robot localization. LectureNotes in Computer Science, 2238:39–50.

Krose, B.J.A., Vlassis, N., Bunschoten, R., and Motomura, Y. 2001.A probabilistic model for appearance-based robot localization.Image and Vision Computing, 19(6):381–391.

Kwok, C., Fox, D., and Meila, M. 2003. Adaptive real-time parti-cle filters for robot localization. In Proceedings of the IEEE In-ternational Conference on Robotics and Automation (ICRA), pp.2836–2841.

Lamon, P., Nourbakhsh, I., Jensen, B., and Siegwart, R. 2001. De-riving and matching image fingerprint sequences for mobile robotlocalization. In Proceedings of the IEEE International Conferenceon Robotics and Automation (ICRA), Seoul, Korea, pp. 1111–1116.

Lenser, S. and Veloso, M. 2000. Sensor resetting localization forpoorly modelled mobile robots. In Proceedings of the Interna-tional Conference on Robotics and Automation (ICRA), pp. 1225–1232.

Leonard, J.J. and Durrant-Whyte, H.F. 1991. Mobile robot local-ization by tracking geometric beacons. IEEE Transactions onRobotics and Automation, 7(3):376–382.

Little, J., Lu, J., and Murray, D. 1998. Selecting stable image fea-tures for robot localization using stereo. In Proceedings of theIEEE/RSJ International Conference on Intelligent Robots and Sys-tems (IROS), pp. 1072–1077.

Maeda, S., Kuno, Y., and Shirai, Y. 1997. Active vision based oneigenspace analysis. In Proceedings of the IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), pp. 1018–1023.

Menegatti, E., Zoccarato, M., Pagello, E., and Ishiguro, H. 2003.Image-based Monte-carlo localisation without a map. In Proceed-ings Conference of the Italian Association of Artificial Intelligence.

Minka, T. 1998. Expected-Maximization as lower bound maximiza-tion. http://www.stat.cmu.edu/ ∼minka/ papers/em.html.

Montemerlo, M., Thrun, S., Koller, D., and Webgreit, B. 2002. Fast-SLAM: A factored solution to the simultaneous localization andmapping problem. In Proceedings of the 18th AAAI National Con-ference on Artificial Intelligence, pp. 593–598.

Montemerlo, M., Thrun, S., Koller, D., and Webgreit, B. 2003. Fast-SLAM 2.0: An improved particle filtering algorithm for simultane-ous localization and mapping that provably converges. In Proceed-ings of the International Joint Conference on Artificial Intelligence(IJCAI).

Moravec, H.P. 1988. Sensor fusion in certainty grids for mobilerobots. AI Magazine, 9(2):61–74.

Murase, H., Kimura, F., Yoshimura, M., and Miyake, Y. 1981. An im-provement of the auto-correlation matrix in the pattern matchingmethod and its application to handprinted HIRAGANA recogni-

tion. Transactions of the IECEJ, J64-D(3):276–283.Murase, H. and Nayar, S.K. 1994. Illumination planning for object

recognition using parametric eigenspaces. IEEE Transactions onPattern Analysis and Machine Intelligence (PAMI), 16(12):1219–1227.

Murase, H. and Nayar, S.K. 1995. Visual learning and recognition of3-D objects from appearance. International Journal of ComputerVision, 14:5–24.

Murray, D. and Little, J. 2000. Using real-time stereo vision formobile robot navigation. Autonomous Robots, 8(2):161–171.

Olson, C.F. 2000. Probabilistic self-localization for mobile robots.IEEE Transactions on Robotics and Automation, 16(1):55–66.

Pitt, M.K. and Shephard, N. 1999. Filtering via simulation: Auxiliaryparticle filters. J. Amer. Statist. Assoc., 94(446):590–599.

Porta, J.M. and Krose, B.J.A. 2004. Appearance-based concurrentmap building and localization. In Proceedings of the 8th Interna-tional Conference on Intelligent Autonomous Systems (IAS), pp.1022–1029.

Roweis, S. 1998. EM Algorithms for PCA and SPCA. In Advances inNeural Information Processing Systems, vol. 10, The MIT Press,pp. 626–632.

Roy, N. and Gordon, G. 2003. Exponential family PCA fo belief com-pression in POMDPs. In S. Becker, S. Thrun, and K. Obermayer(Eds.), Advances in Neural Information Processing Systems. MITPress.

Sim, R. and Dudek, G. 2003. Comparing image-based localizationmethods. In Proceedings of the Eighteenth International JointConference on Artificial Intelligence (IJCAI), Acapulco, Mexico,

Sirovich, L. and Kirby, M. 1987. Low-dimensional procedure for thecharacterization of human faces. Journal of the Optical Society ofAmerica A, 4(3):519–524.

Sobel, I. 1990. Machine vision of three-dimensional scenes. ChapterAn Isotropic 3 × 3 Images Gradient Operator, Academic Press,pp. 376–379.

Sujan, V.A. and Dubowsky, S. 2002. Visually built task modelsfor robot teams in unstructured environments. In Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA), Washington D.C., USA, pp. 1782–1787.

Terwijn, B., Porta, J.M., and Krose, B.J.A. 2004. A particle filterto estimate non-markovian states. In Proceedings of the 8th In-ternational Conference on Intelligent Autonomous Systems (IAS),pp. 1062–1069.

Thrun, S., Burgard, W., and Fox, D. 1998. A probabilistic approach toconcurrent mapping and localization for mobile robots. MachineLearning, 31:29–53.

Thrun, S., Fox, D., and Burgard, W. 2000. Monte Carlo localizationwith mixture proposal distribution. In Proceedings of the AAAINational Conference on Artificial Intelligence, pp. 859–865.

Thrun, S., Fox, D., Burgard, W., and Dellaert, F. 2001. RobustMonte Carlo localization for mobile robots. Artificial Intelligence,28(1/2):99–141.

van Breemen, A.J.N., Crucq, K., Krose, B.J.A., Nuttin, M., Porta,J.M., and Demeester, E. 2003. A user-interface robot for ambientintelligent environments. In Proceedings of the 1st InternationalWorkshop on Advances in Service Robotics (ASER), Bardolino,March 13–15.

Verbeek, J.J. 2002. Probabilistic PCA with missing values. Matlabsource code. http://www.science.uva.nl/∼jverbeek/software.

Vlassis, N., Terwijn, B., and Krose, B.J.A. 2002. Auxiliaryparticle filter robot localization from high-dimensional sensor

Page 22: Active Appearance-Based Robot Localization Using Stereo Visionlear.inrialpes.fr/pubs/2005/PVK05/verbeek05ar.pdf · vided by plain images, even when typical illumination compensation

80 Porta, Verbeek and Krose

observations. In Proceedings of the IEEE International Confer-ence on Robotics and Automation (ICRA), Washington D.C., USA,pp. 7–12.

Wand, M.P. and Jones, M.C. 1995. Kernel smoothing. Chapman &Hall: London.

Josep M Porta received the Engineer Degree in Computer Sciencein 1994 and the Ph.D. in Artificial Intelligence in 2001, both from theUniversitat Politecnica de Catalunya. After that, he joined the IASgroup of the University of Amsterdam and currently, he holds a post-doc position at the Intitut de Robotica i Informatica Industrial (CSIC-UPC) in Barcelona. He carried research in legged robots, machinelearning, vision-based methods for autonomous robot localizationand computational kinematics.

Jakob Verbeek received a cum laude Masters degree in ArtificialIntelligence (1998) and a cum laude Master of Logic degree (2000),

both at the University of Amsterdam, The Netherlands. Since 2000he is a Ph.D. student in the Advanced School for Computing andImaging, working in the Intelligent Autonomous Systems group atthe University of Amsterdam. His research interests include nonlin-ear feature extraction, density estimation and machine learning ingeneral.

Ben Krose is an Associate Professor at the University of Amster-dam, where he leads a group in Computational Intelligence and Au-tonomous Systems. He teaches courses on Computational Intelli-gence and Neurocomputing, and supervises 4 Ph.D. students andmany M.Sc. students. As of 2004 he is part time lector at the Insti-tute of Information Engineering, where he leads a group in the field ofDigital Life and Smart Homes. He published numerous papers in thefields of learning and autonomous systems. He is member of IEEE,Dutch Pattern Recognition Association and Dutch AI Association.