Vision Aided Inertial Navigation

7/24/2019 Vision Aided Inertial Navigation

1/14

This article was downloaded by: [Nanyang Technological University]On: 04 September 2015, At: 00:57Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: 5 Howick Place,London, SW1P 1WG

Click for updates

Advanced RoboticsPublication details, including instructions for authors and subscription information:

http://www.tandfonline.com/loi/tadr20

Tightly-coupled stereo vision-aided inertial navigation

using feature-based motion sensorsE. Asadi

a& C.L. Bottasso

ab

aDepartment of Aerospace Science and Technology, Politecnico di Milano, Milano 20156,

Italy.bWind Energy Institute, Technische Universitt Mnchen, Garching bei Mnchen 85748,

Germany.

Published online: 13 Jan 2014.

To cite this article:E. Asadi & C.L. Bottasso (2014) Tightly-coupled stereo vision-aided inertial navigation using feature-

based motion sensors, Advanced Robotics, 28:11, 717-729, DOI: 10.1080/01691864.2013.870496

To link to this article: http://dx.doi.org/10.1080/01691864.2013.870496

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the Content) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of tContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon ashould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveor howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://

www.tandfonline.com/page/terms-and-conditions
http://crossmark.crossref.org/dialog/?doi=10.1080/01691864.2013.870496&domain=pdf&date_stamp=2014-01-13http://www.tandfonline.com/page/terms-and-conditionshttp://www.tandfonline.com/page/terms-and-conditionshttp://dx.doi.org/10.1080/01691864.2013.870496http://www.tandfonline.com/action/showCitFormats?doi=10.1080/01691864.2013.870496http://www.tandfonline.com/loi/tadr20http://crossmark.crossref.org/dialog/?doi=10.1080/01691864.2013.870496&domain=pdf&date_stamp=2014-01-13


2/14

Advanced Robotics, 2014

Vol. 28, No. 11, 717729, http://dx.doi.org/10.1080/01691864.2013.870496

FULL PAPER

Tightly-coupled stereo vision-aided inertial navigation using feature-based motion sensors

E. Asadia and C.L. Bottassoa,baDepartment of Aerospace Science and Technology, Politecnico di Milano, Milano 20156, Italy;b Wind Energy Institute, Technische

Universitt Mnchen, Garching bei Mnchen 85748, Germany

(Received 31 March 2013; revised 6 August 2013; accepted 22 September 2013)

A tightly-coupled stereo vision-aided inertial navigation system is proposed in this work, as a synergistic incorporationof vision with other sensors. In order to avoid loss of information possibly resulting by visual preprocessing, a set offeature-based motion sensors and an inertial measurement unit are directly fused together to estimate the vehicle state.Two alternative feature-based observation models are considered within the proposed fusion architecture. The first modeluses the trifocal tensor to propagate feature points by homography, so as to express geometric constraints among threeconsecutive scenes. The second one is derived by using a rigid body motion model applied to three-dimensional (3D)reconstructed feature points. A kinematic model accounts for the vehicle motion, and a Sigma-Point Kalman filter is usedto achieve a robust state estimation in the presence of non-linearities. The proposed formulation is derived for a generalplatform-independent 3D problem, and it is tested and demonstrated with a real dynamic indoor data-set alongside of a

simulation experiment. Results show improved estimates than in the case of a classical visual odometry approach andof a loosely-coupled stereo vision-aided inertial navigation system, even in GPS (Global Positioning System)-deniedconditions and when magnetometer measurements are not reliable.

Keywords:vision-aided inertial navigation; sensor fusion; tight-coupling; trifocal constraint

1. Introduction

It appears that some animal species have evolved sophisti-

cated behavioral capabilities by, among other means, fusing

together specialized multiple sensory systems, as for exam-

ple in the case of some flying insects.[1] Mimicking this sit-

uation, the sensor fusion concept is being actively exploitedin many robotics applications using both proprioceptive and

exteroceptive sensors.[2]

Stereo vision systems [3] are amongst the most

information-rich sensors available today, and perform in

a manner similar to the animal binocular vision.[4] Eyes

provide the brain with two views of a scene, which are in

turn combined and processed by the brain itself to build a

complex representation of the environment, including fea-

tures and depth information. Vision-based systems have

been gaining increased attention in the field of autonomous

control and navigation in ground,[57] underwater[8], and

also aerial [9,10] applications. Several attempts have al-

ready been documented in the design and implementation

of robust visual odometry systems relying on monocular

or binocular vision.[1113] Egomotion estimation through

the use of the trifocal tensor using monocular vision was

described in [14], and then extended in [12] to stereo

visual odometry by considering a simple constant velocity

Corresponding author. Email: [email protected]

motion. Visual odometry approaches based on a quadrifocal

relationship have also been proposed.[15]

However, reliance on the sole visual information is a

potential weakness of visual odometry methods for many

applications, as scenes may occasionally become not suf-

ficiently information rich and vision may be impaired byseveral external factors. Some authors have addressed this

problem by combining vision and inertial sensor informa-

tion, as this may enhance the quality of state estimates,

with a positive cascading effect on the control systems that

make use of such estimates. Inertial measurements have

been used as model inputs [16] or states,[1719] using vari-

ants of the Kalman filtering approach to robustly estimate

motion. For example, a combination of vision-based mo-

tion sensors with inertial navigation via three-dimensional

(3D) reconstruction was presented in [20]. An interesting

characteristic of the vision-aided inertial approach is that

it may significantly improve estimation even in difficult

situations, such as GPS (Global Positioning System)-deniedconditions, for example under vegetation cover or indoors,

or in the presence of disruptions in the tracking of visual

features, for example when the vehicle undergoes rapid

motions.

As described in [21], there are two possible architectures

for the fusion of an inertial navigation system (INS) with a

2014 Taylor & Francis and The Robotics Society of Japan


3/14

718 E. Asadi and C.L. Bottasso

vision unit. The first, termed loosely-coupled approach,

[2224] uses separate INS and structure from motion (SfM)

blocks that exchange information. In the second, termed

tightly-coupled approach,[20,25] all information compris-

ing the raw vision data and the inertial measurements are

fused within a single high-order estimation filter.

In most tightly-coupled vision-aided systems, vehiclestates as well as visual landmarks are jointly estimated,[17,

2528] similarly to simultaneous localization and mapping

(SLAM) approaches.[29] These methods are capable of

handling the displacement measurement provided by the

vision system, whilebounding the position estimation error;

however, their computational requirements can be signif-

icant due to the state vector augmentation that they per-

form. This may hinder their use in real-time demanding

applications.

There are also approximate SLAM solutions such as

[30,31], providing less computational complexity and sub-

optimal estimates. Other SLAM approaches may reduce

the amount of information that they process by incorporat-ing only high-level features (e.g. features such as corners,

junctions, and lines), in order to achieve real-time capabil-

ities. This way, the problem of data association is simpli-

fied by applying the correspondence search over a smaller

number of features. As considered in [32,33], revisiting

landmarks that were previously tracked would not be of

significant benefit for long-distance navigation. The idea

explored in [32,33] is based on incorporating high-level

landmarks that can be easily tracked for long periods of

time using a track maintenance algorithm. This way only

high-level landmarks are augmented to the state vector,

while stale landmarks (i.e. correspondences that have been

unsuccessful for a given period of time) are pruned byremoving the associated filter states. However, the resulting

estimates of the vehicle states are suboptimal compared to

those obtained by incorporating all feature points including

both low- and high-level features.

There are some other approaches [19,34] incorporating

only frame to frame measurements with a reduced compu-

tational complexity in comparison to SLAM. In this case,

relative-state measurements are processed via a sole

localization approach, while delayed states corresponding

to previous poses are maintained by expanding the state

vector. A formalapproachto this problem wasalso proposed

in [2], which developed a stochastic cloning framework

based on the indirect form of the Kalman filter. This method

also considers augmenting the state vector with two in-

stances of the vehicle states together with a copy of the

last measurements.

This work proposes a tightly-coupled stereo vision-aided

inertial navigation system (TC-SVA-INS), resulting in an

algorithm that is suitable for real-time applications with fast

dynamics, as for example the flight of small autonomous

vehicles. In order to avoid the loss of information resulting

by the preprocessing of visual data,[35] this paper uses

a tightly-coupled approach that relies on directly fusing

tracked feature points together with inertial measurements

to yield state estimates. In this work, twoalternativefeature-

based observation models are embedded within the pro-

posed tightly-coupled fusion architecture. The first model

is derived via the trifocal tensor, which encapsulates ge-

ometric constraints amongst three different scenes. Somepreliminary results concerning this model are presented in

[36], while a comprehensive assessment of its performance

is presented here in comparison to other possible alternative

approaches. The second model is derived via rigid body

motion applied to 3D reconstructed feature points. In order

to reduce the computational cost of processing relative-state

measurements,[2] here thestandardstateestimation method

is used in conjunction with a simplified observation model.

A comparison of these two tightly-coupled solutions with a

loosely-coupled one is provided as well.

The proposed approach is shown to limit the degradation

of position estimation in GPS-denied conditions. Differ-

ently from other visual odometry approaches, akinematic motion model is used here to better reflect the

actual dynamic behavior of the vehicle. The formulation,

being derived for arbitrary 3D motion using a quaternion

parameterization of finite rotations, is independent of the

platform and can be applied to any vehicle. Outlier rejec-

tion is achieved via egomotion estimation, which is per-

formed outside of the sensors fusion loop. The Sigma-Point

Kalmanfilter(SPKF) is used to perform data fusionbecause

of its robustness in the presences of non-linearities. The

TC-SVA-INS method could be also used in parallel with

low-frequency SLAM approaches in a decentralized archi-

tecture to enhance its precision, where information is dis-

tributed between two loops,[37,38] although this possibilityin not explored here.

While the efficiency in terms of computational effort

and reduced complexity is of the utmost importance in a

real-time multi-sensory application, a straightforward im-

plementation of the standard estimator without any state

augmentation is highly advisable. Our approach does not

utilize any state vector augmentation, neither the one that

SLAM methods perform nor the form of augmentation

proposed in localization methods. This way, the approach

allows for a rather high number of relative-state measure-

ments to be fused with inertial information via a tightly

coupled architecture using a single standard estimator. This

results in a reliable frame to frame displacement estimation

and robust outlier rejection by benefiting from all avail-

able visual information (i.e. both low- and high-level fea-

tures). Additionally, a reduced information loss is achieved

in comparison with a loosely-coupled architecture. The

computational cost of the method does not grow with the

environment size. Data association is simple and readily

provided by a tracking algorithm as only frame to frame

features are used. The tracking system in our work does

not use any kind of maintenance algorithm. This way, the


4/14

Advanced Robotics 719

correlation between consecutive measurements is neglected

as previously assumed in [39,40], and as suggested in [2]

to reduce computational complexity.

The effectiveness of the proposed formulation is demon-

strated with thehelp of both an aerialsimulation experiment

and also a real data-set gathered in a dynamic indoor case

by the Rawseeds project.[41]

2. Classical inertial navigation

In this work, tracked feature points are fused via a tightly-

coupled architecture with a standard inertial measurement

unit based on triaxial accelerometers and gyros, in addition

to a triaxial magnetometer, a sonar altimeter and a GPS. The

inertial measurement formulation is based on previously

presented work,[20,42] which is briefly reviewed next.

2.1. Kinematics

The inertial frame of reference is centered at point O and

denoted by a triad of unit vectors E .=(e1, e2, e3), pointing

North, East, andDown (NED navigationalsystem).A body-

attached frame has origin in the generic material point B of

the vehicle and has a triad of unit vectors B .=(b1,b2,b3)

(see Figure1).

The components of the acceleration in the body-attached

frame are sensed by an accelerometer located at point A on

the vehicle. The accelerometer yields a readingaaccaffected

by noisenacc. Gyroscopes measure the body-attached com-

ponents of the angular velocity vector, yielding a reading

gyro affected by a noise disturbance ngyro. The kinematic

equations, describing the motion of the body-attached ref-erence frame with respect to the inertial one, can be written

as

vEB = gE R

aacc+

B B rBB A

+ h(gyro,ngyro) rBB A

+Rnacc, (1a)

B =h(gyro,ngyro), (1b)

rEO B =vEB , (1c)

q =T(B)q, (1d)

where vBisthe velocity of pointB , is the angular velocity,

and the angular acceleration. rB A is the position vector

from pointB to pointA , whiler O B is from pointO to point

B. In the previous expression,gE =(0, 0, g)T indicates the

components of the acceleration of gravity, R = R(q) are

the components of the rotation tensor that brings triadEinto

triadB

, while q are rotation parameters, which are chosenhere as quaternions, so that matrix T can be written as

T(B)=1

2

0 B

T

B B

. (2)

The angular acceleration at time tk is computed accord-

ing to the following three-point stencil formula based on a

parabolic interpolation

h(tk)=

3gyro(tk)4gyro(tk1)+gyro(tk2)

/(2t),

(3)

wheret=tk tk1 = tk1 tk2, (cf.[20]).

2.2. Sensors

AGPS is located at point Gon thevehicle(see Figure1).The

inertial components of the velocity and position vectors of

pointG , noted respectivelyvEG andrEOG , can be expressed

in terms ofvEB ,B, rEO B , andq as

vEG =vEB + R

B rBBG , (4a)

rEOG = rEO B + Rr

BBG , (4b)

where rBG is the position vector from point B to point G.

The GPS yields measurements of the position and velocity

of point G affected by their respective noise vectors nrgps

andnvgps , i.e.

vgps =vEG + nvgps , (5a)

r gps = rEOG + nvgps . (5b)

A sonar altimeter measures the distance halong the body-

attached vector b3, between its location at point Son the

vehicle and point Ton the terrain, as shown in Figure 1,

resulting in the expression

h = rEO B3 /R33 s, (6)

Figure 1. Reference frames and location of sensors.


5/14


whererEO B3 = r O B e3andR = [Ri j ],i, j =1, 2, 3. The

sonar altimeter yields a reading hsonar affected by noise

nsonar, i.e.

hsonar =h +nsonar. (7)

Finally, we consider a magnetometerthat senses themag-

netic field m of the Earth in the body-attached system B.

The inertial components mE

are assumed to be known andconstant in the area of operation of the vehicle. Hence, we

get

mB =RTmE. (8)

The magnetometer yields a measurementmmagnaffected by

noisenmagn, i.e.

mmagn =mB +nmagn. (9)

3. Vision system

Considering a pair of stereo cameras located on the vehicle

(see Figure1), a triad of unit vectors C .= (c1,c2,c3) has

origin at the optical centerCof the left camera, where c1isdirected along the horizontal scanlines of the image plane

and c3 is parallel to the optical axis, pointing towards the

scene. The right camera has its optical axis and scanlines

parallel to those of the left camera, i.e. C C, where we

use the symbol () to indicate quantities of theright camera.

The origin C of triad C has a distance b = bc1 from C,

whereb is the stereo baseline.

The SVA-INS architecture investigated here is based on

tracking scene points among stereo images and across time

steps. The identification and matching of feature points is

begun with the acquisition of the images and their rectifi-

cation. In this work, strong corners are extracted and then

tracked using a dense disparitymap obtained by theefficientlarge-scale stereo (ELAS) matching method.[43]

3.1. Outlier detection

Having a set of corresponding feature points between left

and right images and from the previous frame to the current

one, various methods can be used to detect and remove

outliers. This process is here performed by an optimization

approach based on the minimization of the sum of error

projections, and merged into the random sample consensus

algorithm similarly to what done in [44]. However, differ-

ently from [44], this procedure is used here to reliably iden-

tify outliers in dynamic environments, instead of estimating

camera motions.

Consider a feature point P , whose position vectors with

respect to frames C and C are noted d and d, respec-

tively. The projections ofP onto the image plane are noted

pC and pC. Assuming an ideal pinhole camera model,

the homogeneous components of the position vectors are

indicated by the symbol () and noted dC

=(dx, dy , dz , 1)T

and pC

=(pu ,pv , 1)T, and they are related by perspective

projection.

Detection of outliers via egomotion estimation starts by a

3D reconstruction that implies theprojection of each feature

point from thestereo image planesto the3D world,usingthe

triangulation method [45] and a disparity map dp obtained

by ELAS. This results in the expressions

dCk1

z =

f b

dp, (10a)

dCk1

x =pCk1v d

Ck1z

f, (10b)

dCk1

y =pCk1u d

Ck1z

f, (10c)

where the camera calibration parameters, focal length fand

optical centercu andcv, are assumed to be known.

Next, each projected 3D point in the camera frame Ck1is transformed to the new frameCkaccording to the camera

translational r ego and rotational ego motion param-

eters. The projected 3D point in the new frame is then

re-projected to its image plane to obtain pCk

. These stepscan be expressed as

p

Cku

pCkv

1

=

f 0 cu0 f cv

0 0 1

R(ego) r ego

dCk1

x

dCk1

y

dCk1

z

1

,

(11)

where Ris the rotation matrix associated with the rotation

parametersego. An identical relationship is obtained for

the right camera.

The position of a feature obtained by the tracking pro-

cedure at the same time step tk is noted pCkvsn . Thus, the

re-projection error p for all feature points in both viewscan be readily calculated as

p =

Ni =1

pCkvsn (i ) pCk2 + pCkvsn (i ) pCk2

.

(12)

A GaussNewton approach is used to iteratively minimize

(12). Motion parameters are independently estimated based

on a given number (typically 50 here) of triple sample

points selected at random, to find the best subset of in-

liers with reprojection errors smaller than a given threshold.

Finally, a subset of inliers is selected to be incorporated in

the data fusion algorithm. These inliers are here chosen by

looking at corresponding points in stereo views that satisfy

equation pT

Fba p= 0, up to a given tight threshold, where

Fba is the fundamental matrix between stereo frames.

4. Tightly-coupled vision-aided inertial navigation

Each pair of tracked feature points across two consecutive

frames provides for an implicit measure of motion, this

way effectively playing the role as a feature-based mo-

tion sensor. A tightly-coupled architecture is developed by


6/14


incorporating in the process the selected subset of tracked

feature points to yield an estimate of the vehicle states.

The estimatoris basedon the followingstate-space model:

x(t) = f

x(t), u(t), (t)

, (13a)

y(tk) = hx(tk), (13b)z(tk) = y(tk)+ (tk), (13c)

where the state vector x is defined as

x .=(vE

T

B ,BT ,rE

T

O B , q)T. (14)

Function f(, , ) in (13a) represents in compact form the

rigid body kinematics expressed by (1). The input vectoru

is defined as measurements provided by the accelerometers

aacc and gyros gyro, i.e. u .= (aTacc,

Tgyro)

T, while .=

(nTacc,nTgyro)

T is its associated measurement noise vector.

The observation models for GPS, sonar, and magnetometer

are given by Equations (4),(6) and(8), respectively.

For the implementation of the vision system, a discrete

model is needed that provides a relationship among feature

points in three scenes. Hence, a set of feature points in two

old scenes are propagated by an observation model to their

respective positions in a new frame. Two alternative vision-

based models are considered within the proposed tightly-

coupled fusion architecture by using:

A trifocal tensor to propagate feature points by

geometric constraints among three scenes.

Or a rigid body modelapplied to 3D reconstructed

feature points.

The two approaches are described next.

4.1. Vision-based model via trifocal constraints

Given a pair of matched points in two scenes, pCk1

pCk1 , and the camera projection matrices, it is possible

to extract the position of the points in a third scene by

the use of the trifocal tensor.[12,45] In order to implement

this property as a vision-based observation model directly

within the estimator, it is useful to derive the resulting

projection matrices in terms ofvEB , B, and q, which are

the elements of the estimator state vector, x .

Consider a feature point P projected onto the imageplanesof onestereo cameraat time tk1(Figure 2), where its

position components with respect to the left camera local

reference frame are d(tk1)Ck1 . The components can be

transformed to the right frame attk1by a translation along

thestereo baseline,and to theleftframe at tk via translational

(r c) and rotational (Rc) transformations between the

two consecutive frames, as

d(tk1)C

k1 = d(tk1)Ck1 b, (15a)

d(tk)Ck =Rcd(tk1)

Ck1 rc. (15b)

Figure 2. Geometry for the derivation of the discrete vision-basedmodel via trifocal constraints.

By expressing the velocity of the camera optical center

C in terms of the velocity of point B, the translational

components can be expressed in terms of the states x , i.e.

rc = tCT

R(tk)TvEB (tk)+

B(tk) cB

, (16)

where matrixCindicates the constant-in-time rotation ten-

sor that brings triad BintoC, whilec .= r BCis the position

vector of pointCwith respect to point B .

The rotational motion between two consecutive frames

can be derived by composing the corresponding rotation

tensors, as

Rc = CTR(tk)

TR(tk1)C. (17)

By introducing the angular velocity according to

IR(tk)TR(tk1)= t(tk)

B, the rotational motion can

be expressed in terms of the states x as

Rc =I tCT (tk)

BC. (18)

Projections that map feature coordinates in local reference

frames to the corresponding homogeneous image coordi-

nates in each of three scenes can be obtained via (15) as

p(tk1)Ck1 =KL

I 0

d(tk1)Ck1 , (19a)

p(tk1)C

k1 =KR

I b

d(tk1)Ck1 , (19b)

p(tk)Ck =KL

Rc rc

d(tk1)

Ck1 . (19c)

Here, Pa = KL

I 0

, Pb = KR

I b

, and Pc =

KL

Rc rc

are projection matrices, while KL and

KR are the left and the right camera calibration matrices,

respectively.


7/14


The trifocal tensor, T, can be derived according to the

projection matrices,[45] using standard matrix-vector nota-

tion,1 yielding

Tqri =(1)

i +1det

Pa

i

P bq

P cr

. (20)

Next, point[pai ]in the first frame is transferred to the third

one, [pci ], by a homography mapping via a line passing

through the corresponding point[pbi ]in the second frame,

i.e.

pcr = pa

i lbqTqri . (21)

A convenient choice for lbwas suggested in [45], as the line

perpendicular toFba p(tk1). Equations (16)(21) provide

for a non-linear discrete observation model, which can be

used to predict the output of feature-based motion sensors

p(tk)Ck, as described later on in Section4.3.

4.2. Vision-based model via 3D reconstructionAn alternative to the trifocal tensor approach is discussed

here. In this model, the reconstructed 3D points, obtained

by(10), are propagated across time by directly using the

equations of rigid body motion. A preliminary version of

this model was previously presented in [20] and evaluated

via a different framework and simulation experiments. The

input/output structure of the model is modified in this work

from 3D3D to 3Dtwo-dimensional (2D), which not only

reduces the computational effort but also enhances the qual-

ity of the estimates.

Considering two consecutive time instants tk1 and tk,

the following vector closure relationship (see Figure (3))

can be derived for both the left and right cameras

r(tk) +c(tk) +d(tk) = r (tk1) +c(tk1) +d(tk1), (22)

wherer .= r O B is the position vector of the reference point

Bon the vehicle with respect to the origin O of the inertial

Figure 3. Geometry for the derivation of the discrete vision-basedmodel via 3D reconstruction.

frame. By expressing each of the terms of Equation(22) in

convenient reference frames, we obtain

d(tk)Ck = CT

R(tk)

T

r(tk)E r (tk1)

E

+ IR(tk)TR(tk1) (c

B +Cd(tk1)Ck1 )

+d(tk1)Ck1 . (23)

This expression depends on the absolute position vector r ,

a quantity that however can not be observed by a vision

system, which only senses relative distances. Hence, since

in the absence of GPS measurements the observed absolute

position will drift away from the true one, it is advisable to

rewrite the above equation in terms of velocities. By setting

IR(tk)TR(tk1) = t(tk)

B, (24a)

r(tk)E r (tk1)

E =tvE(tk), (24b)

we have

d(tk)Ck = tCT

R(tk)

TvE(tk)+ B(tk)

(cB +Cd(tk1)Ck1 )

+d(tk1)

Ck1 .(25)

The left-hand side of Equation (25), d(tk)Ck, may be com-

puted by stereo reconstruction from tracked feature points

at time tkusing the triangulation approach, as proposed in

[20]. In other words, the 2D feature points are first mapped

to 3D world and then used as visual observations within the

filter. This way, however, unknown errors are propagated to

the observations.

To avoid this problem, the model is modified here by

re-projecting the output of Equation(25), i.e. the predicted

3D feature points, to the 2D image plane of the currentscene, so as to compute p(tk)

Ck as the final output of the

vision-based model, givingpu (tk)

Ck

pv(tk)Ck

=

f/dz (tk)

Ck 0 cu0 f/dz (tk)

Ck cv

dx(tk)Ckdy (tk)Ck1

.(26)

In this case, the left-hand side of Equation (26), p(tk)Ck, is

directly available by tracking the system at time tk.

Equations (25)and (26)provide for an alternative non-

linear discrete model to predict the output of feature-based

motion sensors p(tk)Ck, and its use for multi-sensor fusion

is described next.

4.3. Vision-augmented fusion

The two alternative vision-based models presented above

are separately incorporated in the sensor fusion algorithm.

The first implementation, termed tightly-coupled stereo

vision-aided inertial navigation via trifocal constrains

(TC-SVA-INS-TC), is based on Equations (16)(21), which

can be appended to Equations(4), (6) and (8), defining a

vision-augmented observation model, expressed in compact


8/14


form by function h() in (13b). The other, called tightly-

coupled stereo vision-aided inertial navigation via 3D

reconstruction (TC-SVA-INS-3D), is based on Equations

(10)and (25)(26). As for the previous method, these equa-

tions can be appended to Equations (4),(6), and (8) to form

the vision-augmented observation model.

For each feature point in the subset of best inliers, weinclude in the augmented vector a new output for both the

left and right cameras:

y =vE

T

G ,rET

OG , h,mBT , . . . ,p(tk)

CTk , p(tk)CTk , . . .

T.

(27)

Considering the tracking results at time tk, these yield visual

measurements pvsn affected by noise nvsn:

pvsn = p(tk)Ck +nvsn. (28)

Accordingly, (28) is appended to the measurement equa-

tions, where the vision-augmented measurement and noise

vectors are defined as

z .=

vTgps,r

Tgps, hsonar,m

Tmagn, . . . ,p

Tvsn, p

Tvsn, . . .

T,

(29a)

.=

nTvgps ,n

Trgps

, nsonar,nTmagn, . . . ,n

Tvsn,n

Tvsn, . . . ,

T.

(29b)

SPKF[46,47] is here preferred to the extended Kalman

filter, and it is used for handling non-linearities in the

observation model.

5. Loosely-coupled vision-aided inertial navigation

In contrast to the TC-SVA-INS method, one may estimate

the motion parameters directly using a separate SfM block

and subsequently fuse its output with the INS, obtaining

a loosely-coupled stereo vision-aided inertial navigation

system (LC-SVA-INS). Although less computational effort

is required by the loosely-coupled approach,the preprocess-

ing of visual measurements results in the approximation of

statistical distributions,[35] in turn leading to the loss of

information and accuracy, as it will be shown later on in the

results section. Nonetheless, this method is considered here

for comparison purposes, and it is thereforebriefly reviewednext.

The camera motion parameters are computed at the end

of the outlier detection procedure by refining the egomotion

estimation based on the best subset of inliers (the same used

in TC-SVA-INS). Then this estimated parameters are fed to

the Kalman filter as visual information.

The estimator is basedon the state-space model expressed

by Equations (13). The relative camera motion between

two consecutive frames, noted, respectively, rc for the

displacement andcfor the rotation, can be expressed in

terms of the states x as

r c =tCT(RTvEB +

B cB), (30a)

c =tCTB. (30b)

Theseequations represent a vision-based observation model

that can be appended to the GPS observation model (5),

definingh()in Equation (13b). Accordingly, the vector ofoutputs yis defined as

y =vE

T

G ,rET

OG , . . . , rcT, c

T

. (31)

The egomotion estimation yields measurements of the rel-

ative motion of the camera affected by noise, i.e.

rego =rc + nrego , (32a)

ego =c + nego . (32b)

The definition of model (13) is complemented by the vector

of measurements z and associated noisevectors

z .= vTgps,r Tgps, . . . , rTego, Tego (33a)

.=

nTvgps ,n

Trgps

, . . . ,nTrego ,nTCego

. (33b)

6. Experiments and results

6.1. Simulation test

A Matlab/Simulink simulator was developed that includes

a flight mechanics model of a small RUAV, models of in-

ertial navigation sensors, magnetometer, GPS, and their

noise models. The simulator is used in conjunction with the

OGRE graphics engine, for rendering a virtual environment

scene and simulating the image acquisition process. All

sensor measurements are simulated (seeTable 1) as the heli-copter flies at an altitude of 2 m following a rectangular path

at a constant speed of 2 m/swithin a small village, composed

of houses and several other objects with realistic textures

(see Figure4). Navigation measurements are provided at

a rate of 100 Hz, while stereo images at the rate of 2 Hz.

The GPS is also available at a rate of 1 Hz. To show the

capability of the system in ensuring an accurate estimation

of the vehicle states even without GPS, four temporary GPS

signal losses are simulated to occur att =15, 40, 65, 90s,

and regained after 10 sec at times t =25, 50, 75, 100 s.

The estimated position and velocity obtained with the

TC-SVA-INS-3D method are shown in Figures 5(a) and (c),

respectively, using a thick solid line, while the estimated

Table 1. Sensors and vibration noise levels.

Sensors Noise Level

Gyro 50(deg /s)2

Accelerometer 0.5 (m/s2)2

Magnetometer 1104 (Gauss)2

GPS 2 (m)2


9/14


Figure 4. Simulated small village environment and flighttrajectory.

position and velocity obtained with the TC-SVA-INS-TC

method are plotted using a thin solid line. Similar plots are

also provided for the errors of position and velocity esti-

mates in Figures 5(b) and (d), respectively. The two vertical

dashed and solid lines indicate the time instants when the

GPS signal is lost and reacquired, respectively. The results

show a successful performance of both approaches in caseof temporaryloss of theGPS signal. Theerrorof theposition

estimates, as shown in Figure5(b), smoothly increases as

the GPS signal is masked but it does not appear to cause

any appreciable degradation in the quality of the velocity

and position estimates.

6.2. Real case study

The formulation given above was derived for a general

3D motion, such as the one of an aerial vehicle. However,

Figure 5. Results for the simulation experiment using two approaches: TC-SVA-INS-TC and TC-SVA-INS-3D. The two vertical dashline and vertical solid line indicate the time instants when the GPS signal is lost and reacquired, respectively.


10/14


Figure 6. Plan view of the building at the University di Milano-Bicocca, explored by the ground vehicle together with matchedreference path.

lacking a 3D data-set, a ground robot data-set was used here

to validate the concept and illustrate the basic performance

of the proposed method. The example is taken from the

Rawseeds project,[41] which provides several multi-sensor

data-sets with their associated ground truths. Specifically,

a dynamic indoor data-set was used to assess the perfor-

mance of a pure vision-aided inertial system in GPS-denied

conditions and unreliablemagnetometerreadings.The loca-

tion explored by the vehicle includes two nearby buildings,

shown together with the reference path in Figure 6. The

data-set is illustrative of practical realistic conditions, since

the number of tracked points decreases in a very marked

way when the vehicle reaches the end of corridors, because

of textureless walls, and tracking is occasionally disrupted

when the vehicle turns rapidly. In all subsequent examples,

stereo images are processed and fused at 5 Hz, while inertial

measurements are incorporated at 120 Hz.

The estimated trajectory obtained with the visual odom-

etry method is depicted in Figure 7.The method exhibits a

significant degradation of the estimation quality caused by

the loss of tracking continuity.

Several views of the dynamic environment captured by

the left camera are depicted in Figure 8,which also reports

tracked feature points across time and detected outliers.The

figure shows that the outlier detection method is capable ofsuccessfully identifying tracked feature points on moving

objects.

The estimated trajectory obtained with the LC-VA-INS

method is shown in Figure 9(a) using a thick solid line,

while theground truth is plotted using a thin solid line. Since

the reconstruction was performed without absolute position

information, trajectories are depicted after realignment with

the ground truth. Figure9(b) shows a time history plot of

the velocity estimates in thexdirection (thick solid line), to-

gether with the error of velocity estimates w.r.t ground truth

Figure 7. Results for thevisual odometryapproach.[12] Thin line:ground truth; thick line: trajectory estimate; stars: points wherefeature tracking is disrupted.

(thin solid line). Despite locally weak trackings at difficult

points in the path, and particularly at the corner of corridors,

estimatesare significantlybetter than in thevisual odometry

case, because of the incorporation of inertial measurements.

However, the position estimation errors are considerable in

parts of the path where the quality of egomotion estimation

is low.

Figures10(a) and 11(a) show the trajectories obtained

with the proposed tightly-coupled methods, respectively,

TC-SVA-INS-TC and TC-SVA-INS-3D. Similarly, Figures

10(b) and11(d) show time history plots of the xvelocitycomponent. In both cases, path and velocity estimates are

reasonably reliable and significantlybetter than in thevisual

odometry and LC-SVA-INS case, because of the use of

inertial measurements and thanks to the tightly-coupled

fusion process. The position estimate errors keep growing

throughout the navigation, which is not an unexpected out-

come because of the lack of absolute positions in the sensor

fusion process. Nevertheless, the quality of state estimates

as well as of velocities is significantly improved with re-

spect to the other methods considered in this work. Further-

more,it appears that TC-SVA-INS-3D is slightly superior to

TC-SVA-INS-TC.

7. Conclusions

A tightly-coupled vision-aided INS was proposed in this

work, that synergistically incorporates vision with other

sensors. The approach reduces loss of information by pro-

cessing all effective visual data directly within the fusion

algorithm. Two alternative feature-based observation mod-

els were described and included in the proposed fusion

architecture; one uses geometric constraints encapsulated


11/14


Figure 8. Performance of the outlier rejection algorithm in a dynamic environment. Tracked feature points are colored based on theirdistance with respect to the camera, while detected outliers are shown in blue.

by thetrifocal tensor, while the other uses rigid body motion

applied to 3D reconstructed feature points. The formulation

was derived for an arbitrary 3D motion, making the algo-

rithm platform-independent.

The proposed approach is demonstrated in case of tempo-

rary loss of GPS signal using simulation for an autonomous

helicopter flying in an urban environment. Furthermore, a

real data-set from the Rawseeds project was used to assess


12/14


Figure 9. Results for the loosely-coupled stereo vision-aided INS.

Figure 10. Results for the tightly-coupled stereo vision-aided inertial navigation via trifocal constraints.

the performance of the approach in an indoor 2D case, in

GPS-denied conditions and without reliable magnetometer

measurements. Results indicate that the proposed TC-SVA-

INS methods significantly improve the quality of state esti-mation with respect to a classical visual odometry approach

and also with respect to a loosely-coupled implementation.

Of the two alternative feature-based observation models, the

one using 3D reconstruction and rigid body motion appears

to offer the best results in terms of accuracy.

Additional work is being conducted to further optimize

the performance of the proposed method by the imple-

mentation of exact models via stochastic cloning. These

efforts will also consider the mitigation of the effects on the

estimator performance caused by delays due to the online

processing of visual information.

If the computational burden is not of particular concern,

a combination of the proposed TC-SVA-INS approach withSLAM can be considered, in order to enhance the precision

of the SLAM method. This combination may be developed

within a decentralized or a centralized architecture. In a

decentralized form, the TC-SVA-INS method can be used

in parallel with a low frequency SLAM approach, where the

information is distributed between two loops. A decentral-

ized implementation is more suitable for real-time applica-

tion compared to a centralized one. In a centralized combi-

nation, high-level features canbe included in thestate vector


13/14


Figure 11. Results for the tightly-coupled stereo vision-aided inertial navigation via 3D reconstruction.

in a similar wayto theSLAM approach,whilelow-level fea-

tures areincorporated in thesameestimator viathe proposedTC-SVA-INS formulation. These ideas are currently being

investigated and will be reported in future publications.

Note

1. Thei j entry of matrix Ais denoted bya ij , indexi being

the row number, and index j being the column number. a i

indicates thei -th row of matrix A and ai represents matrix Awithout rowi .

Notes on contributors

E. Asadi received the BSc degree from

Kashan University, Kashan, Iran, in 2002, andthe MSc degree from Yazd University, Yazd,Iran, in 2005, both in Mechanical Engineering.Currently, he is a PhD student in AerospaceEngineering at Politecnico di Milano, Milan,Italy, where he works with POLI-Rotorcraftresearch group. His current research interestsinclude sensor fusion, Kalman filtering, vision-

aided inertial navigation, and simultaneous localization andmapping (SLAM).

Carlo L. Bottasso received a PhD degree inAerospace Engineering from the Politecnico

di Milano, Italy, in 1993. Currently, he is theChair of Wind Energy at TUM, TechnischeUniversitt Mnchen, Germany, and Professorof Flight Mechanics with the Department ofAerospace Science and Technology, Politec-nico di Milano, Italy. He has held visitingpositions at various institutions, including

Rensselaer Polytechnic Institute, Georgia Institute of Technology,Lawrence Livermore National Laboratory, NASA Langley, andNREL among others. His research interests and areas of expertiseinclude the flight mechanics and aeroelasticity of rotorcraftvehicles, aeroelasticity and active control of wind turbines, andflexible multibody dynamics.

References

[1] Campolo D, Schenato L, Pi L, Deng X, GuglielmelliE. Attitude estimation of a biologically inspired robotichousefly via multimodal sensor fusion. Adv. Robotics.2009;23:955977.

[2] Mourikis AI, Roumeliotis SI, Burdick JW. SC-KF mobilerobot localization: a stochastic cloning-Kalman filterfor processing relative-state measurements. IEEE Trans.Robotics. 2007;23:717730.

[3] Konolige K, Agrawal M, Bolles R, Cowan C, Fischler M,Gerkey B. Experimental robotics: outdoor mapping andnavigation using stereo vision. Berlin: Springer; 2008. p.179190.

[4] Jung BS, Choi SB, Ban SW, Lee M. A biologically inspiredactive stereo vision system using a bottom-up saliencymap model. In: Artificial Intelligence and Soft Computing-

ICAISC; 2004. p. 730735.[5] Bonin-Fontand F, Ortiz A, Oliver G. Visual navigation for

mobile robots: a survey. J. Intell. Rob. Syst. 2008;53:263296.

[6] Jia S, Sheng J, Shang E, Takase K. Robot localization inindoor environments using radio frequency identificationtechnology and stereo vision. Adv. Robotics. 2008;22:279297.

[7] Irie K, Yoshida T, Tomono M. Outdoor localization usingstereo vision under various illumination conditions. Adv.Robotics. 2012;26:327248.

[8] Dalgleish FR, Tetlow JW, Allwood RL. Vision-basednavigation of unmanned underwater vehicles: a survey. Part2: Vision-based station-keeping and positioning. IMARESTProc., Part B: Marine Design and Operat. 2005;8:1319.

[9] Liu YC, Dai QH. Vision aided unmanned aerial vehicleautonomy: an overview. Proceedings of the 3th InternationalCongress on Image and Signal Processing. Yantai, China;2010. p. 417421.

[10] MooreRJD, Thurrowgood S, Bland D, SoccolD, SrinivasanMV. A bio-inspired stereo vision system for guidance ofautonomous aircraft. Advances in Theory and Applicationsof Stereo Vision, ISBN: 978-953-307-516-7; 2011.

[11] Nister D, Naroditsky O, Bergen J. Visual odometry forground vehicle applications. J. Field Robotics. 2006;23:320.

[12] Kitt B, Geiger A, Lategahn H. Visual odometry basedon stereo image sequences with RANSAC-based outlier


14/14


rejection scheme. Proceedings of the IEEE IntelligentVehicles Symposium; San Diego (CA), USA; 2010.p. 486492.

[13] Iida F. Biologically inspired visual odometer for navigationof a flying robot. Robot. Auton. Syst. 2003;44:201208.

[14] Yu YK, Wong KH, Chang MY, Or SH. Recursive camera-motion estimation with the trifocal tensor. IEEE Trans. Syst.Man Cybernet. B. 2006;36:10811090.

[15] Comport AI, Malis E, Rives P. Real-time quadrifocal visualodometry. Int. J. Robot. Res. 2010;29:245266.

[16] Roumeliotis SI, Johnson AE, Montgomery JF. Augmentinginertial navigation with image-based motion estimation.Proceedings of the IEEE International Conference onRobotics and Automation. Washington DC, USA, 2002.p. 43264333.

[17] Qian G, Chellappa R, Zheng Q. Robuststructure frommotionestimation using inertial data. Opt. Soc.Am. 2001;18:29822997.

[18] Veth MJ, Raquet JF, Pachter M. Stochastic constraints forefficient image correspondence search. IEEE Trans. Aero.Electron. Syst. 2006;42:973982.

[19] Mourikis AI, Roumeliotis SI. A multi-state constraintKalman filter for vision-aided inertial navigation. Proceed-

ings of the IEEE International Conference on Robotics andAutomation, Roma, Italy; 2007. p. 35653572.

[20] Bottasso CL, Leonello D. Vision-aided inertial navigationby sensor fusion for an autonomous rotorcraftvehicle.Proceedings of the AHS International Specialists Meetingon Unmanned Rotorcraft. Scottsdale (AZ), USA, 2009. p.324334.

[21] CorkeP,LoboJ, Dias J.An introduction to inertialand visualsensing. Int. J. Robot. Res. 2007;26:519535.

[22] Weiss S, Siegwart R. Real-time metric state estimationfor modular vision-inertial systems. Proceedings of theIEEE International Conferenceon Roboticsand Automation;Shanghai, China; 2011. p. 45314537.

[23] Tardif JP, George M, Laverne M, Kelly A, Stentz A. A newapproach to vision-aided inertial navigation. Proceedings

of the IEEE/RSJ International Conference on IntelligentRobots and Systems; 2010. p. 41614168.

[24] Goulding JR. Biologically-inspired image-based sensorfusion approach to compensate gyro sensor drift in mobilerobot systems that balance. Proceedings of the MultisensorFusion and Integration for Intelligent Systems; Arizona,USA; 2010. p. 102108.

[25] Strelow D, Singh S. Motion estimation from image andinertial measurements. Int. J. Robot. Res. 2004;23:11571195.

[26] Veth M, Anderson R, Webber F, Nielsen M. Tightly-coupledINS, GPS, and imaging sensors for precision geolocation.California: Proceedings of the Institute of NavigationNational Technical Meeting; San Diego; 2008.

[27] Ong LL, Ridley M, Kim JH, Nettleton E, Sukkarieh S.

Six DoF decentralised SLAM. Proceedings of AustralasianConference on Robotics and Automation; 2003. p. 1016[28] Chen J,PinzA. Structure and motionby fusion of inertialand

vision-based tracking. Proceedings of 28th OAGM/AAPRConference, Digital Imaging in Media and Education; 2004.p. 5562

[29] Durrant-Whyte H, Bailey T. Simultaneous localisation andmapping (SLAM): Part I the essential algorithms. IEEERobotics Autom. Mag. 2006;13:99110.

[30] Newman P, Leonard J, Tardos JD, Neira J. Explore andreturn: experimental validation of real-time concurrentmappingand localization. Proceedings of IEEEInternational

Conference on Robotics and Automation; Washington, DC;2002. p. 18021809

[31] Julier SJ, Uhlmann JL. Simultaneous localisation and mapbuilding using split covariance intersection. Proceedings ofIEEE/RSJ International Conference on Intelligent Robotsand Systems; Maui, HI; 2001. p. 12571262

[32] Veth M, Raquet J. Two-Dimensional stochastic projectionsfor tight integration of Optical and inertial sensors fornavigation. Proceedings of the Institute of NavigationNational Technical Meeting; 2006. p. 587596

[33] Ebcin S, Veth M. Tightly-coupled image-aided inertialnavigation using the unscented Kalman filter. Proceedingsof the 20th International Technical Meeting of the SatelliteDivision of The Institute of Navigation; Fort Worth, TX;2007. p. 18511860

[34] Roumeliotis SI, Burdick JW. Stochastic cloning: a general-ized framework for processing relative state measurements.Proceedings of IEEE International Conference on Roboticsand Automation; 2002. p. 17881795

[35] Dijkstra F, Luinge HSchon TB. Tightly coupled UWB/IMUpose estimation. Proceedings of the IEEE InternationalConference on Ultra-Wideband; 2009. p. 688692

[36] Asadi E, Bottasso CL. Tightly-coupled vision-aided inertial

navigation via trifocal constraints. Proceedings of the IEEEInternational Conference on Robotics and Biomimetics;Guangzhou, China; 2012. p. 8590

[37] Asadi E, Bozorg M. A decentralized architecture forsimultaneous localizationand mapping.IEEE/ASME Trans.Mech. 2009;14:6471.

[38] WeissS, AchtelikM, LynenS, ChliM, SiegwartR. Real-timeonboard visual-inertial state estimation and self-calibrationof MAVs in unknown environments. Proceedings of theIEEE International Conferenceon Roboticsand Automation;St. Paul, Minnesota, USA; 2012. p. 957964

[39] Konolige K. Large-scale map-making. Proceedings ofAAAINationalConferenceonArtificial Intelligence; SanJose, CA;2004. p. 457463

[40] Hoffman BD, Baumgartner ET, Huntsberger TL, Schenker

PS. Improved state estimation in challenging terrain. Auton.Robot. 1999;6:113130.

[41] Bonarini A, Burgard W, Fontana G, Matteucci M, SorrentiDG, Tardos JD. RAWSEEDS: roboticsadvancement throughweb-publishing of sensorial and elaborated extensive datasets. Proceedings of the IROS06 Workshop on Benchmarksin Robotics, Research; 2006.

[42] Willner D, Chang CB, Dunn KP. Kalman filter algorithms fora multi-sensor system. Proceedings of the IEEE Conferenceon Decision and Control; 1976. p. 570574

[43] Geiger A, Roser M, Urtasun R. Efficient large-scale stereomatching. Lecture notes in computer science. Springer;2011. p. 2538

[44] Geiger A, Ziegler J, Stiller C. StereoScan: dense 3Dreconstruction in real-time. Proceedings of the IEEE

Intelligent Vehicles Symposium. Baden-Baden; Germany;2011. p. 963968.[45] Hartley R, Zisserman A. Multiple view geometry in

computer vision. Cambridge University Press; 2004.[46] Van Der Merwe R, Wan E, Julier SJ. Sigma-Point

Kalman filters for nonlinear estimation and sensor fusion:applications to integrated navigation. Proceedings of theAIAA GNC Conference and Exhibition; Providence, RI;2004. p. 17351764

[47] Xiong K, Wei CL, Liu LD. Robust unscented Kalmanfiltering for nonlinear uncertain systems. Asian J. Control.2010;12:426433.

Documents

Vision Aided Inertial Navigation