8
Feature Tracking Evaluation for Pose Estimation in Underwater Environments Florian Shkurti, Ioannis Rekleitis, and Gregory Dudek School of Computer Science McGill University, Montreal, QC, Canada {florian, yiannis, dudek}@cim.mcgill.ca Abstract—In this paper we present the computer vision component of a 6DOF pose estimation algorithm to be used by an underwater robot. Our goal is to evaluate which feature trackers enable us to accurately estimate the 3D positions of features, as quickly as possible. To this end, we perform an evaluation of available detectors, descriptors, and matching schemes, over different underwater datasets. We are interested in identifying combinations in this search space that are suitable for use in structure from motion algorithms, and more generally, vision-aided localization algorithms that use a monocular camera. Our evaluation includes frame-by-frame statistics of desired attributes, as well as measures of robustness expressed as the length of tracked features. We compare the fit of each combination based on the following attributes: number of extracted keypoints per frame, length of feature tracks, average tracking time per frame, number of false positive matches between frames. Several datasets were used, collected in different underwater locations and under different lighting and visibility conditions. Keywords-Underwater feature tracking; Sensor fusion; Map- ping; State Estimation; Performance evaluation techniques; I. I NTRODUCTION The task of localization is one of the fundamental prob- lems in mobile robotics. It is particularly important in any scenario where robots are required to exhibit autonomous behavior, since many navigation and mapping algorithms require, or benefit from, the existence of some prior belief about the robot’s pose. In the underwater domain the prob- lem of localization is particularly challenging as it is a GPS denied, highly unstructured natural environment, without man-made landmarks. This paper presents an overview of a six degree of freedom (6DOF) state estimation algorithm for an underwater vehicle, and in particular, it presents an analysis of the computer vision components that it relies on. The described algorithm combines measurements from an Inertial Measurement Unit (IMU) sensor and a camera in order to accurately track the pose of the vehicle. An essential part of this algorithm is feature tracking using a monocular camera, and estimation of 3D feature positions with respect to the camera frame. Our goal in this paper is to evaluate which feature detectors, descriptors, and matching strategies are most suitable for integration with the rest of the state estimation algorithm. While our focus is directed towards this particular algorithm, our evaluation is sufficiently in- formative for other closely related problems, such as visual odometry and structure from motion. Figure 1. The Aqua vehicle starting a dataset collection run The target vehicle is a robot belonging to the Aqua family [1] of amphibious robots; see Fig. 1. It is a six-legged vehicle capable of reaching depths of 35 m in remotely- controlled or autonomous operation. It is equipped with three IEEE-1394 IIDC cameras and an IMU. One of the primary uses of this vehicle is to conduct visual mapping over coral reefs. The state estimation algorithm, first proposed by Mourikis and Roumeliotis [2] uses visual input from a downward-facing camera to correct the errors resulting from the double integration of the IMU signal. II. RELATED WORK Corke et. al [3] presented experimental results of an un- derwater robot performing stereo visual odometry using the Harris feature detector and the normalized cross-correlation similarity measure (ZNCC). They reported that, after outlier rejection, 10 to 50 features were tracked from frame to frame, and those were sufficient for the stereo reconstruction of the robot’s 3D trajectory, which was a square of area 30m by 30m. Barfoot [4] described a 6DOF pose estimation technique, adapted from FastSLAM 2.0 [5], whose input was the odom- etry and stereo images of a terrestrial mobile robot. SIFT features were used as observations in the filter and were matched via best-bin-first search in a kd-tree. The author demonstrated through field experiments that the proposed approach led to 0.5% to 4% localization errors over distances of 40-120 meters.

Feature Tracking Evaluation for Pose Estimation in

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Feature Tracking Evaluation for Pose Estimation in

Feature Tracking Evaluation for Pose Estimation in Underwater Environments

Florian Shkurti, Ioannis Rekleitis, and Gregory DudekSchool of Computer Science

McGill University, Montreal, QC, Canada{florian, yiannis, dudek}@cim.mcgill.ca

Abstract—In this paper we present the computer visioncomponent of a 6DOF pose estimation algorithm to be usedby an underwater robot. Our goal is to evaluate which featuretrackers enable us to accurately estimate the 3D positions offeatures, as quickly as possible. To this end, we perform anevaluation of available detectors, descriptors, and matchingschemes, over different underwater datasets. We are interestedin identifying combinations in this search space that aresuitable for use in structure from motion algorithms, andmore generally, vision-aided localization algorithms that usea monocular camera. Our evaluation includes frame-by-framestatistics of desired attributes, as well as measures of robustnessexpressed as the length of tracked features. We compare the fitof each combination based on the following attributes: numberof extracted keypoints per frame, length of feature tracks,average tracking time per frame, number of false positivematches between frames. Several datasets were used, collectedin different underwater locations and under different lightingand visibility conditions.

Keywords-Underwater feature tracking; Sensor fusion; Map-ping; State Estimation; Performance evaluation techniques;

I. INTRODUCTION

The task of localization is one of the fundamental prob-lems in mobile robotics. It is particularly important in anyscenario where robots are required to exhibit autonomousbehavior, since many navigation and mapping algorithmsrequire, or benefit from, the existence of some prior beliefabout the robot’s pose. In the underwater domain the prob-lem of localization is particularly challenging as it is a GPSdenied, highly unstructured natural environment, withoutman-made landmarks. This paper presents an overview ofa six degree of freedom (6DOF) state estimation algorithmfor an underwater vehicle, and in particular, it presents ananalysis of the computer vision components that it relieson. The described algorithm combines measurements froman Inertial Measurement Unit (IMU) sensor and a camera inorder to accurately track the pose of the vehicle. An essentialpart of this algorithm is feature tracking using a monocularcamera, and estimation of 3D feature positions with respectto the camera frame. Our goal in this paper is to evaluatewhich feature detectors, descriptors, and matching strategiesare most suitable for integration with the rest of the stateestimation algorithm. While our focus is directed towardsthis particular algorithm, our evaluation is sufficiently in-formative for other closely related problems, such as visualodometry and structure from motion.

Figure 1. The Aqua vehicle starting a dataset collection run

The target vehicle is a robot belonging to the Aquafamily [1] of amphibious robots; see Fig. 1. It is a six-leggedvehicle capable of reaching depths of 35 m in remotely-controlled or autonomous operation. It is equipped with threeIEEE-1394 IIDC cameras and an IMU. One of the primaryuses of this vehicle is to conduct visual mapping overcoral reefs. The state estimation algorithm, first proposedby Mourikis and Roumeliotis [2] uses visual input from adownward-facing camera to correct the errors resulting fromthe double integration of the IMU signal.

II. RELATED WORK

Corke et. al [3] presented experimental results of an un-derwater robot performing stereo visual odometry using theHarris feature detector and the normalized cross-correlationsimilarity measure (ZNCC). They reported that, after outlierrejection, 10 to 50 features were tracked from frame toframe, and those were sufficient for the stereo reconstructionof the robot’s 3D trajectory, which was a square of area 30mby 30m.

Barfoot [4] described a 6DOF pose estimation technique,adapted from FastSLAM 2.0 [5], whose input was the odom-etry and stereo images of a terrestrial mobile robot. SIFTfeatures were used as observations in the filter and werematched via best-bin-first search in a kd-tree. The authordemonstrated through field experiments that the proposedapproach led to 0.5% to 4% localization errors over distancesof 40-120 meters.

Page 2: Feature Tracking Evaluation for Pose Estimation in

Newcombe and Davison [6], as well as Klein and Mur-ray [7], have presented online, accurate 3D reconstructionof small environments using a monocular camera. Theirtracking system uses the FAST detector and performs searchalong epipolar lines in order to find matching pixel neigh-borhoods.

The landscape of performance evaluations of featuredetectors and descriptors is certainly rich, as there hasbeen extensive related work in trying to experimentallyassess the degree of invariance of said detectors to affinetransformations. For example, Mikolajczyk and Schmid [8]compared the following detectors: Harris corners, Harris-Laplace, Hessian-Laplace, Harris-Affine and Hessian-Affine,based on their repeatability, in other words, their abilityto consistently detect the same keypoints in two differ-ent images of the same scene. As for feature descrip-tors, their evaluation included SIFT [9], Gradient Locationand Orientation Histogram (GLOH), cross-correlation ofneighborhood pixel values, PCA-SIFT, moment invariants,shape context, steerable filters, spin images, and differentialinvariants. Some of these descriptors are low dimensionalwhile others are high. The evaluation criterion they usedin their experimental setup was precision and recall offeature matches, where a ‘true‘ match was defined as havingsmall vector distance between the corresponding descriptors.Their dataset included scenes that had different lighting andgeometric transformations, however, the images were notsuccessive video frames of the same scene, and they did notcapture underwater environments. The main conclusion ofthe above work was that the high dimensional detectors, likeGLOH and SIFT were robust and most distinctive, regardlessof the detector used.

Closely related to our work is the quantitative comparisonof feature extractors for visual SLAM done by Klippensteinand Zhang [10]. Their experimental methodology includedcomparison of precision and recall curves of the Harris cor-ner detector, the Kanade-Lucas-Tomasi (KLT) tracker [11],and the SIFT detector. Their conclusion was that all detectorscould be tuned to perform well for visual SLAM, but inparticular, the KLT tracker was better suited for imagesequences with small baseline, compared to SIFT.

III. 6DOF STATE ESTIMATION OF AN AUTONOMOUSUNDERWATER VEHICLE

The localization algorithm [2], [12] used is based onthe extended Kalman filter (EKF) formulation. It integratesIMU measurements [13], thus providing direct (but noisy)estimates of vehicle dynamics. Furthermore, by performingfeature correspondence it estimates motion parameters, andobtains constraints, from the camera data, thereby correctingthe drifting IMU integration estimates; see Fig. 2 for aschematic of the algorithm. The innovation of this algorithmis that the state contains the latest pose of the IMU anda list of m camera poses, instead of the vehicle pose and

the detected feature positions. As such, this approach doesnot suffer from the dimensionality explosion common totraditional SLAM algorithms. More formally, the state vectorthat we want to estimate is the following:

x =[qIG bg vGI ba pGI qC1

G pGC1... qCm

G pGCm

](1)

where qIG is the quaternion that expresses the rotation froma global frame of reference to the IMU frame, which isattached on the robot 1, vGI is the velocity, and pGI is theposition of the IMU frame in coordinates of the global frame.bg and ba are the bias vectors of the gyroscope and theaccelerometer, respectively. The pairs {qCi

G pGCi} represent

the transformations between the global frame and the i-th camera frame. The number m of such transformationsembedded in the state vector is bounded above for practicalpurposes, and because features are expected to be tracked ina relatively small number of images.

The main idea behind fusing information from these twosources is that, provided we are able to track features inthe world and correctly estimate their 3D position withrespect to the camera, we can sufficiently correct the driftingIMU pose estimates, which are obtained by integration;More specifically, the above vector will be modified in threephases: (i) The propagation phase, where IMU measure-ments are integrated so as to predict the IMU pose. (ii)The augmentation phase, which occurs when an image isrecorded. The transformation from the global frame to thecamera frame is appended to the right of the state vector. (iii)The update phase, which occurs whenever a feature trackterminates. Then we can estimate its 3D position and correctthe state vector, specifically the IMU integration drift, basedon the constraints set by the motion of the feature.

In this paper we focus our attention on (iii), and inparticular, on the vision component of the system. Our goalis to evaluate which detectors, descriptors, and matchingstrategies enable us to estimate the 3D positions of features,as quickly and accurately as possible, using a monocularcamera. We aim to use this estimator in underwater envi-ronments, which are generally more challenging than mostof their indoor counterparts for the following reasons:

(a) They are prone to rapid changes in lighting conditions.The canonical example that illustrates this is the presenceof caustic patterns, which are due to refraction, and thenon-uniform reflection, and penetration of light when ittransitions from air to the rippling surface of the water.

(b) They often have limited visibility. This is due tomany factors, some of which include light scattering fromsuspended plankton and other matter, which cause blurringand “snow effects”. Another reason involves the incident

1Recall that quaternions provide a well-behaved parameterization ofrotation. We are following the JPL formulation of quaternions; see [13],[14]

Page 3: Feature Tracking Evaluation for Pose Estimation in

Images15 Hz

IMU40 Hz

Propagation

Augmentation

FeatureTracking

TerminatedFeature

Trail

3D FeaturePosition

Estimation

UpdateStateVector

Figure 2. An overview of the 6DOF state estimation algorithm. The statevector appears in Eq. ( 1). The green-colored components are examined inthis paper.

angle at which light rays hit the surface of the water. Smallerangles lead to less visibility.

(c) They impose loss of contrast and colour informationwith depth. As light travels at increasing depths, differentparts of its spectrum are absorbed. Red is the first color thatis seen as black, and eventually orange, yellow, green andblue follow. This absorption sequence refers to clear water,and is not necessarily true for other types.

A. Feature Detectors And Descriptors

Our evaluation compares the following feature detectors:SURF, Shi-Tomasi, FAST, CenSurE, and the SURF descrip-tor.

The SURF [15] detector is designed to be scale and(partially) rotation invariant. It detects keypoints at localextrema of the determinant of the image’s approximateHessian. This filter response is evaluated at different scalesso as to achieve scale invariance. The scale space is dividedinto overlapping octaves, each of which consists of filterresponses at increasing scales. Rotation invariance is due tothe assignment of a dominant orientation of Haar waveletresponses around the detected keypoint. Each keypoint ischaracterized by a SURF descriptor, which is a vector thatconsists of sums of Haar wavelet responses around its ori-ented neighborhood. Neither the detector nor the descriptoruse any color information.

The Shi-Tomasi detector [16] is invariant under affinetransformations, but not under scaling. Its keypoints areimage locations where the 2nd moment matrix has twolarge eigenvalues, indicating image intensity change in twodirections, i.e. a corner.

The FAST (Features from Accelerated Segment Test) [17]detector, on the other hand, focuses on speed of keypointdetection rather than robustness to noise and invarianceproperties. It uses a decision tree to classify a pixel as a

keypoint if a circle of pixels around it has arcs that are muchbrighter or darker than the center. The authors mention thatwhile this classifier is not very robust to noise, scaling andillumination variations, it has high repeatability and rotationinvariance.

Finally, the CenSurE (Center Surround Extrema) [18]detector aims to identify keypoints at any scale, withoutresorting to image subsampling like SIFT, or filter upsam-pling like SURF. It does this by searching for extrema ofthe image’s approximate Laplacian, using polygonal (e.g.octagonal) center-surround Haar wavelet filters. By com-parison, SURF uses box-shaped filters for its descriptor,resulting in less symmetry and partial rotation invariance.Finally, among the detected extrema, only the ones withhigh Harris response are kept, so as to eliminate keypointsdetected along edges, which might be unstable for tracking.

B. Feature Matching

The feature matching component of our evaluation isbased on Muja and Lowe’s work [19]. More specifically, wematch feature pairs based on Approximate Nearest Neighborsearch among the respective SURF descriptors. We accepta match if the distance ratio of the nearest neighbor overthe second nearest neighbor is below a certain threshold,and if the feature correspondence is bi-directional. For theApproximate Nearest Neighbor Search we experimentedwith both hierarchical K-means and randomized kd-trees.For combinations that use FAST, CenSurE, or Shi-Tomasikeypoints, we use normalized cross-correlation as a measureof similarity between image patches around said keypoints.

C. Feature Position In Camera Coordinates

Consider a single feature, f , that has been tracked inn consecutive camera frames, C1, C2, ..., Cn. In the statevector mentioned previously, each camera frame Ci is rep-resented by {qCi

G , pGCi} where G is a global reference frame,

qCi

G is the unit quaternion that expresses the rotation from theglobal frame G to the camera frame Ci, and pGCi

expressesthe position of the origin of camera frame Ci in global framecoordinates.

Now, let us denote pCi

f =[XCi

f Y Ci

f ZCi

f

]to be the

3D position of feature f expressed in camera frame Cicoordinates. This is what we are interested in estimating asaccurately as possible. Also, let R(qCi

G ) denote the rotationmatrix that represents the same rotation expressed by thequaternion qCi

G . Then, we can write the following:

R(qCi

C1) = R(qCi

G )R(qGC1) (2)

pCi

C1= R(qCi

G )(pGC1− pGCi

) (3)

about the transformation between the first frame in whichthe feature was seen and the rest. So, we have:

Page 4: Feature Tracking Evaluation for Pose Estimation in

pCi

f = R(qCi

C1)pC1

f + pCi

C1(4)

= ZC1

f

R(qCi

C1)

[XC1

f

ZC1

f

Y C1

f

ZC1

f

1

]t+

1

ZC1

f

pCi

C1

If we let αf =

XC1f

ZC1f

, βf =Y

C1f

ZC1f

, and γf = 1

ZC1f

then

pCi

f = ZC1

f

(R(qCi

C1) [αf βf 1]

t+ γfp

Ci

C1

)(5)

= ZC1

f

hi1(αf , βf , γf )hi2(αf , βf , γf )hi3(αf , βf , γf )

(6)

Now, assuming a simple pinhole camera model, anddisregarding the effects of the camera’s calibration matrix,we can model the projection of feature f on the projectionplane ZCi = 1 as follows:

zCi

f =1

hi3(αf , βf , γf )

[hi1(αf , βf , γf )hi2(αf , βf , γf )

]+ nCi

f (7)

where zCi

f is the 2 × 1 measurement, and nCi

f is noiseassociated with the process, due to miscalibration of thecamera, motion blur, and other factors. If we stack themeasurements from all cameras into a single 2n× 1 vectorzf and similarly for the projection functions hi into a 2n×1vector hf , then we will have expressed the problem ofestimating the 3D position of a feature as a nonlinear least-squares problem with 3 unknowns:

argminαf ,βf ,γf

‖zf − hf (αf , βf , γf )‖ (8)

Provided we have at least 3 measurements of featuref , i.e. provided we track it in at least 3 frames, we usethe Levenberg-Marquardt nonlinear optimization algorithmin order to get an estimate pC1

f of the true solution. Oneproblem with this approach is that nonlinear optimizationalgorithms do not guarantee a global minimum, only a localone, provided they converge. Another problem is that iffeature f is recorded around the same pixel location in allthe frames in which it appears, then the measurements wewill get are going to be linearly dependent, thus providingno information about the depth of f . In other words,feature tracks that have small baseline have to be consideredoutliers, unless we exploit other local information around thefeature to infer its depth. It is worth considering whetherwe would get better 3D feature position estimates if wetreated γf as the only unknown in the system, and we fixed[αf βf ] = zC1

f . Our analysis of empirical data suggeststhat this did not contribute to a significant improvement,and the reason is that Eq. (8) is the optimal solutiongiven the noise characteristics in Eq. (7). Finally, another

potential pitfall is that the transformations in Eq. (3) mightbe themselves noisy, which will also affect the solution ofthe least squares problem.

D. Outlier Detection

In order to identify false positive matches between twoconsecutive frames we use epipolar constraints imposed bythe fundamental matrix between said frames. We rely on twodifferent methods of estimating the fundamental matrix: oneis the RANSAC method [20], which works well as long asthere is a significant number of inliers among the matches.The other one is via direct computation [21], which is madepossible due to the presence of the IMU:

F = K−t[pCi

Ci+1

]×R(qCi

Ci+1)K−1 (9)

R(qCi+1

Ci) = R(q

Ci+1

G )R(qGCi)

pCi

Ci+1= R(qCi

G )(pGCi+1− pGCi

)

where [ ]× is the cross-product matrix, and K is the cali-bration matrix of the camera. The latter estimation method isuseful as a complement to the former, when the presence ofoutlier matches is significant. We classify a pair of keypoints,x1 and x2 as an outlier match if for either of the twoestimates of the fundamental matrix:

|xt2Fx1| > τ (10)

where τ is a threshold (in our experiments τ = 1.0 pixels).We classify a feature track as an outlier if any of itsconstituent matching pairs are outliers.

IV. EXPERIMENTAL RESULTS

Figure 3. The Aqua vehicle over the marked area.

Our experimental comparison included the followingcombinations of detectors, descriptors, and matching strate-gies: SURF keypoints annotated with the SURF descriptorand matched using Approximate Nearest Neighbours, SURFkeypoints matched using ZNCC, FAST keypoints matchedusing ZNCC, CenSurE keypoints matched using ZNCC,

Page 5: Feature Tracking Evaluation for Pose Estimation in

and Shi-Tomasi keypoints matched using ZNCC. For eachof the experiments conducted we attempted to tune theparameters of individual combinations in such a way thateach of them performs as well as possible. The outlierdetection parameters were fixed for all experiments, so thatall combinations are treated in the same way.

A. Underwater Camera And IMU Datasets

We rely on 3 different datasets of synchronized cameraand IMU input in order to conduct our evaluation. Thedatasets represent distinct underwater environments that arerelatively feature rich, and have been recorded under varyingillumination conditions. All datasets were recorded at ap-proximately 30 feet depth. The frame rate of the downward-looking camera was set at 15Hz and each image is (afterundistortion and cropping) 870 × 520 pixels. The camerais a PointGrey Dragonfly Hi-Color 1024 × 768 pixels. TheIMU sensor we used was a MicroStrain 3DM-GX1, and itssampling rate was set at 50Hz. Both sensors are on boardof one of the Aqua family of amphibious robots [1]. Morespecifically:

Figure 4. Sample images from each dataset

Dataset1 features a straight 30 meter-long trajectory,where the robot moves at approximately 0.2 meters/secforward, while preserving its depth. The sea bottom is mostlyflat, and the robot moves about 2 meters over it. A white30 meter-long tape has been placed on the bottom, both toprovide ground truth for distance travelled, and to facilitatethe straight line trajectory. Its 2600 greyscale camera frameswere collected in approximately 3 minutes.

Dataset2 is an L-shaped trajectory, where the robot goesstraight for about 7 meters, turns left, and then continuesstraight for about 3 meters, while preserving depth. Again,the sea bottom is mostly flat, and the robot moves about2 meters over it, at speed 0.3 meters/sec. The differencecompared to Dataset1 is that in this case we have placedARToolKit [22] tags on the sea bottom, in order to beable to get estimates of relative rotations and translations

0 5 10 15 20 25 30 35 40 45 500

2

4

6

8

10

12x 10

4

Track lengths

Nu

mb

er

of

fea

ture

tra

cks

SURF/ANN

0 10 20 30 40 50 60 70 80 900

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

4 FAST/ZNCC

Track length

Nu

mb

er

of

tra

cks

Figure 5. Distribution of feature track lengths for two representativecombinations. The figure on the left shows the length distribution for theSURF detector, and Approximate Nearest Neighbor matching. The one onthe right shows it for the FAST detector, matched by the normalized cross-correlation score.

between the robot and the tags, which will be useful asground truth when we try to estimate the 3D positions ofthe features tracked. Its 600 greyscale images were collectedin approximately 45 seconds.

Dataset3 features a remotely controlled trajectory over acoral reef, where the robot moves for 20 seconds at speed0.2m/s. It is the most feature-rich and blur-free among thedatasets we have used, it consists of 300 images and it differsfrom the previous two datasets in that the robot changesdepth.

B. Feature Track Lengths

The distribution of feature track lengths is shown in Fig. 5,and it is a decaying curve for all combinations. The SURFdetector seems to track most features for an average of 5camera frames, and it is virtually unable do it for morethan 25 frames, using Approximate Nearest Neighbors asthe matching method. From the datasets we observed that,given the almost constant forward velocity of the robot, eachfeature is visible in the camera’s field of view for about 25frames. This means that the combination mentioned abovecan normally track approximately 1/5th of the real trajectoryof a feature. Fig. 6 shows that this tracking is very accurate.

C. False Positives

Figure 6 demonstrates the difference in accuracy betweenthe combinations involving SURF, and all the rest. Match-ing FAST keypoints with the normalized cross-correlationsimilarity score results in as many as 9% false matches,while the same quantity appears to be 5% for SURF. Thisdifference is pronounced when the image has motion blur,since the normalized cross-correlation similarity measuredoes not take into account image intensity gradients of anysort, so one should apply very aggressive outlier detectionschemes when using it.

Page 6: Feature Tracking Evaluation for Pose Estimation in

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

SURF/ANN

SURF/ZNCC

CenSurE/ZNCC

Shi−Tomasi/ZNCC

FAST/ZNCC

Average ratio of false positive matches / all matches

Dataset1Dataset2Dataset3

Figure 6. Average ratio of false positive matches / all matches. Highvalue implies matching that is prone to outliers. SURF feature matching,and Shi-Tomasi features matched with ZNCC appear to be overall the mostrobust. All detectors did well on datasets 2 and 3, which were not blurry.FAST and CenSurE were more prone to outliers on Dataset2, which hadthe highest motion blur.

D. 3D Feature Position Estimation

Offline processing of the ARToolKit tags that we placedunderwater on the bottom of the experiment area, showedthat the robot was swimming approximately 2.5 meters overthe bottom, mostly parallel to it, so we will treat that asground truth for the depth of each feature. After removingoutlier paths using the epipolar constraints implied by Eq.(10), we rejected 3D feature position estimates that wereoutside the field of view of the robot, less than 0.5 metersor more than 10 meters away from the robot. The distributionof depths from the camera frame, for the remaining inliers isshown in Fig. 7, which suggests that the majority of inliershad depths that fell in the [1, 4] meter range.

E. Running Time

We compared the average time spent on keypoint ex-traction and matching, normalized by the total number ofkeypoints processed. The results were obtained on an IntelPentium Core 2 Duo at 1.6GHz and 4GB of RAM. TheOpenCV C++ library implementations were used for eachdetector, descriptor and matching scheme. The results inFig. 8(a) clearly show that SURF and the Shi-Tomasidetector spend approximately 1 millisecond on each feature,while CenSurE spends 3 milliseconds. The latter is mostlikely due to the fact that CenSurE computes keypointsat all scales. Fig. 8(b) depicts a very significant differ-ence between the running times of Approximate NearestNeighbor matching through either randomized kd-trees orK-means, and normalized-cross correlation matching. Thisis not surprising, as the latter has complexity O(nm) where

0 2 4 6 8 10 12 14 16 18 200

50

100

150

200

250

300

350

400

450

500

Depth of features from camera (meters)

Num

ber

of fe

atur

es

Figure 7. Distribution of estimated feature depths from the first cameraframe in which they were seen.

n,m are the number of features extracted from the previousand the current image, respectively. On the other hand, kd-trees are constructed in O(nlogn) and queried in O(logm).Therefore, if real-time processing of images is a requirement,limiting the number of features extracted by the detectorsbecomes necessary.

0 0.5 1 1.5 2 2.5 3 3.5

x 10−3

SURF

CenSurE

Shi−Tomasi

FAST

Average time / feature (sec)

Feature Extraction Times

(a)

KD−TREE K−MEANS ZNCC0

1

2

3

4

5

6

7

8

9x 10

−3

Ave

rage

tim

e / f

eatu

re (

sec)

Feature Matching Times

(b)

Figure 8. (a) Average feature extraction times per feature. CenSurE is theslowest among the detectors that we evaluated. (b) Average feature matchingtimes per feature. The normalized cross-correlation similarity score (ZNCC)is computed by a brute force comparison of all possible feature pairs, henceit requires more computation time than Approximate Nearest Neighbors.The time spent on computing the fundamental matrix is not included inthese matching times.

F. State Estimation

Figure 9 presents two illustrative examples of the stateestimation algorithm applied to Dataset3, in one case usingthe SURF detector adjusted so that it detects approximately500-600 features per image, and in the other case using thethe Shi-Tomasi detector, tuned so that it detects approxi-mately 200-300 features. Less than half of those features

Page 7: Feature Tracking Evaluation for Pose Estimation in

24

68

10

0

2

4

6−0.5

00.5

11.5

y (meters)

x (meters)

z (m

eter

s)

(a)

−2 0 2 4 6 8 10−2

02

46−6

−4

−2

0

2

y (meters)

x (meters)

z (m

eter

s)

(b)

02

46

810

0

2

4

6−0.5

00.5

11.5

y (meters)x (meters)

z (m

eter

s)

(c)

0 2 4 6 8 10

−2

0

2

4

6

−6

−5

−4

−3

−2

−1

0

1

2

y (meters)

x (meters)

z (m

eter

s)

(d)

Figure 9. (a) Estimated trajectory for Dataset3, using SURF and Approximate Nearest Neighbor matching. (b) 3D coral structure, estimated via Eq. (8)along with the trajectory. (c) Estimated trajectory for Dataset3, using Shi-Tomasi features and ZNCC matching. (d) 3D coral structure

were matched from frame to frame, yet the robot’s trajectorywas reproduced at 1Hz camera frame rate for SURF and 3Hzframe rate for Shi-Tomasi. The estimated trajectory and 3Dstructure of the coral is shown in Fig. 9.

V. CONCLUSION

We performed an evaluation of combinations of differentfeature detectors, descriptors and matching strategies forthe purposes of evaluating their suitability for our stateestimation algorithm, which we want to use in underwaterenvironments, in full 6DOF motion. We tested these combi-nations based on four criteria of interest: length of feature

tracks, running time, false positive matches, and overallability to facilitate estimation of 3D positions of feature fromthe camera frame. We used three underwater datasets that:(a) are representative of the environments that we want to runthis algorithm in, (b) have sufficient lighting and motion blurvariations, and (c) provided us with some notion of groundtruth. The main conclusion of our evaluation is that the mostsuitable combination for real-time processing would involveeither SURF keypoints matched via Approximate NearestNeighbor search, or Shi-Tomasi features matched via ZNCC.On the other hand, the FAST and CenSurE detectors wereshown to be inaccurate in image sequences with high levels

Page 8: Feature Tracking Evaluation for Pose Estimation in

of motion blur.

ACKNOWLEDGMENTS

The authors would like to thank Philippe Giguere forassistance with robot navigation; Yogesh Girdhar and ChrisPrahacs for their support as divers; Marinos Rekleitis andRosemary Ferrie for their invaluable help with the experi-ments. The authors would also like to thank NSERC for itsgenerous support.

REFERENCES

[1] J. Sattar, G. Dudek, O. Chiu, I. Rekleitis, P. Giguere, A. Mills,N. Plamondon, C. Prahacs, Y. Girdhar, M. Nahon, and J.-P. Lobos, “Enabling autonomous capabilities in underwa-ter robotics,” in Proceedings of the IEEE/RSJ InternationalConference on Intelligent Robots and Systems, IROS, Nice,France, September 2008.

[2] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraintKalman filter for vision-aided inertial navigation,” in Proceed-ings of the IEEE International Conference on Robotics andAutomation, Rome, Italy, April 10-14 2007, pp. 3565–3572.

[3] P. Corke, C. Detweiler, M. Dunbabin, M. Hamilton, D. Rus,and I. Vasilescu, “Experiments with underwater robot local-ization and tracking,” in Proc. of IEEE International Confer-ence on Robotics and Automation (ICRA), 2007, pp. 4556–4561.

[4] T. Barfoot, “Online visual motion estimation using fastslamwith sift features,” in Intelligent Robots and Systems, 2005.(IROS 2005). 2005 IEEE/RSJ International Conference on,2005, pp. 579 – 585.

[5] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit,“FastSLAM 2.0: An improved particle filtering algorithmfor simultaneous localization and mapping that provablyconverges,” in Proceedings of the Sixteenth International JointConference on Artificial Intelligence (IJCAI). Acapulco,Mexico: IJCAI, 2003.

[6] R. A. Newcombe and A. J. Davison, “Live dense reconstruc-tion with a single moving camera,” in CVPR, 2010, pp. 1498–1505.

[7] G. Klein and D. Murray, “Parallel tracking and mappingfor small AR workspaces,” in Proc. Sixth IEEE and ACMInternational Symposium on Mixed and Augmented Reality(ISMAR’07), Nara, Japan, November 2007.

[8] K. Mikolajczyk and C. Schmid, “A performance evaluation oflocal descriptors,” IEEE Transactions on Pattern Analysis &Machine Intelligence, vol. 27, no. 10, pp. 1615–1630, 2005.

[9] D. G. Lowe, “Distinctive image features from scale-invariantkeypoints,” Int. J. Comput. Vision, vol. 60, pp. 91–110,November 2004.

[10] J. Klippenstein and H. Zhang, “Quantitative evaluation offeature extractors for visual slam,” in Computer and RobotVision, 2007. CRV ’07. Fourth Canadian Conference on, May2007, pp. 157 –164.

[11] S. Baker and I. Matthews, “Lucas-kanade 20 years on: Aunifying framework: Part 1,” Robotics Institute, Pittsburgh,PA, Tech. Rep. CMU-RI-TR-02-16, July 2002.

[12] A. I. Mourikis and S. I. Roumeliotis, “A dual-layer estimatorarchitecture for long-term localization,” in Proceedings ofthe Workshop on Visual Localization for Mobile Platforms,Anchorage, AK, June 2008, pp. 1–8.

[13] N. Trawny and S. I. Roumeliotis, “Indirect kalman filter for 3dattitude estimation,” University of Minnesota, Dept. of Comp.Sci. and Eng.,, Tech. Rep. 2005-002, March 2005.

[14] W. G. Breckenridge, “Quaternions proposed standard conven-tions,” JPL, Tech. Rep. INTEROFFICE MEMORANDUMIOM 343-79-1199, 1999.

[15] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded uprobust features,” in ECCV, 2006, pp. 404–417.

[16] J. Shi and C. Tomasi, “Good features to track,” in 1994 IEEEConference on Computer Vision and Pattern Recognition(CVPR’94), 1994, pp. 593 – 600.

[17] E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in European Conference on Com-puter Vision, vol. 1, May 2006, pp. 430–443.

[18] M. Agrawal, K. Konolige, and M. R. Blas, “Censure: Centersurround extremas for realtime feature detection and match-ing,” in ECCV (4), 2008, pp. 102–115.

[19] M. Muja and D. G. Lowe, “Fast approximate nearest neigh-bors with automatic algorithm configuration,” in InternationalConference on Computer Vision Theory and ApplicationVISSAPP’09). INSTICC Press, 2009, pp. 331–340.

[20] R. C. B. M. A. Fischler, “Random sample consensus: Aparadigm for model fitting with applications to image analysisand automated cartography,” Comm. of the ACM,, vol. 24, pp.381–395, 1981.

[21] R. I. Hartley and A. Zisserman, Multiple View Geometryin Computer Vision, 2nd ed. Cambridge University Press,ISBN: 0521540518, 2004, pp. 245–246.

[22] [Online]. Available: http://www.hitl.washington.edu/artoolkit/