MonoSLAM using SURF · code of MonoSLAM (Davison, Reid, Molton, & Stasse, 2007) has been used as basis to analyze this specific method. During the project the Good Features to Track

8th November 2013

MonoSLAM using SURF Experimentation Project

Student Rik Bosma

3864669

Supervisor Robby Tan

MonoSLAM using SURF November 2013

Rik Bosma University of Utrecht Page 1 of 28

Abstract

Augmented reality systems require positional and orientation information of the used

camera. Robotics studies refer to this problem as localization and multiple researchers

solved this problem using simultaneous localization and mapping methods. Since

augmented reality commonly uses cameras it is worth taking a look at visual simultaneous

localization and mapping methods to acquire more knowledge about localization and

computer vision.

The method considered during the experimentation project is called MonoSLAM. The

extraordinary part of this method is one camera is used to locate recognizable points in

3D. This is achieved by estimating the depth of a feature over multiple frames. However

this process of feature tracking and depth estimation fail after some time because the

feature detection and matching is not robust enough.

To deal with this problem speeded up robust features are implemented trying to achieve

a more robust feature detector and matcher, which results in a more robust localization.

The conclusions are positive because the feature detector and matching are more robust,

but mismatches might occur when lots of features are considered for matching.

* State the main problems, why the current existing methods fail (drawbacks), how you do the experiments/evaluation, what your hypotheses, and what the conclusions from the experiments.



Acknowledgement

optional



Contents

Chapter 1: Introduction ........................................................................ 4

Chapter 2: Related work ....................................................................... 5

Chapter 3: Theory .............................................................................. 6

MonoSLAM ..................................................................................... 6

Limits ........................................................................................ 11

Hypotheses .................................................................................. 12

SURF.......................................................................................... 13

Chapter 4: Experimental results and evaluation ......................................... 15

MonoSLAM using SURF Implementation ................................................. 15

Observations ................................................................................ 16

Performance ................................................................................ 17

Experiments ................................................................................. 20

Results ....................................................................................... 21

Evaluation ................................................................................... 23

Chapter 5: Conclusion ........................................................................ 24

Appendix ....................................................................................... 27

References ..................................................................................... 28



Chapter 1: Introduction

In the field of computer vision one of the trends is mixed reality. Mixed reality is about

merging the real world and a virtual world. Either to be able to handle user input in

virtual worlds (augmented virtuality) like the PS Move and Kinect as well as merging a

virtual environment into the real world (augmented reality) like Layar applications. The

ideal situation of augmented reality is a user is able to wander around into an augmented

environment freely. To do this an absolute positioning system in six degrees of freedom

is required. If such a positioning system would work accurately the possibilities are

limited by computer power and human creativity, because we would be able to create

our own reality within the real world. Imagine the possibilities!

The main goal of the project was to acquire more knowledge of computer vision with

a focus on visual simultaneous localization and mapping (SLAM). The paper and source

code of MonoSLAM (Davison, Reid, Molton, & Stasse, 2007) has been used as basis to

analyze this specific method. During the project the Good Features to Track (GFTT)

features have been replaced by SURF features to analyze whether those features perform

better in detecting and matching.

If this method would be improved so it could handle larger environments AR

applications using a single camera could be realized. This could contribute a lot to mobile

applications like games and navigation apps.

First the details of the MonoSLAM algorithm are analyzed, especially the part about

the depth retrieval. Once this was done the major weaknesses of the algorithm has been

specified. To improve the algorithm a more advanced feature detection method has been

implemented.

The experiments will focus on the SURF features namely to prove the hypotheses.

First of all there will be a focus on the robustness in means of repeatability of the features.

During the experiments the robustness of the feature matching during axial translations

is analyzed.

In the next chapter related work is considered. Chapter 3 is about the theory behind

the MonoSLAM algorithm and the SURF features. Also the implementation of the SURF

features in the algorithm is explained. In chapter 4 the experiments are stated and the

results are shown. The limits of the MonoSLAM and the conclusion about the thesis are

discussed in chapter 5.



Chapter 2: Related work

For several decades navigation is a trending problem within the field of robotics. A

number of solutions are developed over the past years, one of the directions a lot of

researchers are looking into is simultaneous localization and mapping (SLAM) as proposed

by (Leonard & Durrant-Whyte, 1991).

The idea is measured points are used to build a map of the environment and using

those points and the change of those measured points relative to the robot the position

of the robot is approximated.

MonoSLAM focuses on visual SLAM where the measured points are provided by an

algorithm which uses images as sensor to determine points to measure. The points in an

image are selected using a feature detector as described by (Shi & Tomasi, 1994) with

additional prediction algorithms to predict where the feature should be based on an

approximated velocity (Davison, Real-time simultaneous localisation and mapping with a

single camera, 2003).

However the feature detector used in the MonoSLAM algorithm is old and several

other feature detectors might outperform the current one. For example the speeded up

robust feature (SURF) detector (Bay, Tuytelaars, & van Gool, 2006) is analyzed in this

report.



Chapter 3: Theory

The MonoSLAM algorithm and the SURF features are discussed in this chapter. MonoSLAM

is the main reference used during the experimentation project and will be explained in

full detail. Trying to deal with some of the limitations SURF features are implemented

and will be briefly discussed here.

MonoSLAM

In this paragraph the MonoSLAM algorithm and each of its steps are discussed. Each

feature its depth is estimated therefor there are two type of states for the features,

partially initialized features which the depth is not yet properly estimated and fully

initialized features which do have a properly estimated depth.

Algorithm

The algorithm main part is an endless loop where an image is retrieved from a webcam,

then processed by the MonoSLAM sub algorithm and at last everything is rendered. The

focus here will be on the MonoSLAM sub algorithm as shown in #REF#.

When an image frame from the webcam is retrieved a MonoSLAM step is performed.

First the prediction step is performed to predict the next state of the camera. To do the

prediction a motion model which is a constant velocity, constant angular velocity model

is used. The camera represents our position in the world.

Then based on the prediction a number of features is selected to do the

measurements for the SLAM part of the algorithm. The features selection is based on the

prediction just made. Each fully initialized feature is tested whether it should be visible;

if so then the feature is selected until a specified amount of features is selected.

For the selected features the image is searched to find them within a certain search

radius. If the feature is found the feature’s position is updated. When all selected

features have been processed the map is updated.

Then the Kalman filter is updated and the camera state is approximated and

normalized. After that new features are initialized if the amount of visible features is

too sparse. At last the partially initialized features are updated.



Virtual representation

The virtual representation of the real world consists of a camera model which represents

the webcam and a combination of a covariance matrix and a state vector filled with

features and the camera state represents the map.

Camera

The camera object contains some attributes to be able to do calculations for projecting

and unprojecting operations. The camera is represented by a state defined as:

𝒚𝑖𝐶 = 𝑹𝐶𝑊(𝒚𝑖

𝑊 − 𝒓𝑊) (1)

Figure 1 MonoSLAM algorithm

Kalman filter prediction step

Select features to do

measurement

Predict locations of selected

features

Kalman filter update step

Enough features visible?No

Initialize new

feature(s)

Yes

Update partially initialized

features



Where:

𝒚𝑖𝑊 = initial camera state,

𝒓𝑊 = displacement w.r.t. 𝒚𝑖𝑤,

𝑹𝐶𝑊 = camera orientation.

The initial camera state is the first camera state of an entire sequence. All camera

states are determined relative to this state.

The camera localization is based on image plane measurement. In #REF# a possible

case is shown, the numbers on the red triangles represent the sequenced camera states.

During both states the depth of the features is known then taking into account the

displacement of the features the new position of the camera can be estimated. The

accuracy of the estimation heavily relies on the error in depth, but the impact of this

error might be reduced by using more features to do this calculations. The image plane

measurements are executed during the Kalman filter update and the features used are

the selected fully initialized features as provided by the second step in the algorithm.

For the localization a pinhole calibrated camera model is used. Which means the

camera state is the sum of all previous camera displacements including a noise term. The

mathematical definition is: 𝒛𝑖 = ∏𝒚𝑖𝐶 + 𝑛𝑧.



State

A state which represents an entire state (including camera state) in the covariance matrix

is represented by the vector:

�̂� = (

𝑟𝑊

𝑞𝑊𝐶

𝑣𝑊

𝜔𝐶

) (2)

Where:

𝒒𝑊𝐶 = 𝑹𝐶𝑊 represented as a quaternion,

𝒗𝑊 = velocity vector,

𝜔𝐶 = angular velocity vector.

The state represents the camera state and adds a velocity and angular velocity

component to it. Those components are predicted by the motion model during the Kalman

filter prediction step.

Map

The probabilistic 3D map is represented by a vector and a covariance matrix. The state

vector consists of the estimated camera state and all fully initialized features. The

mathematical representation is:

�̂� =

(

�̂�𝑣�̂�1�̂�2⋮�̂�𝑛)

(3)

�̂�𝑖 is a 3D position vector representing a feature its position.

Feature tracking

An important part of the MonoSLAM algorithm is feature tracking. The accuracy of

the localization heavily relies on the robustness of the tracking. To improve the

robustness of the feature detector and matching a motion model is designed to predict

where the features “should” be. Robustness also increases when the time step gets

smaller therefor map management is required.



Motion model

The motion model is a constant velocity, constant angular velocity motion model.

Which means the current average velocities are considered and the accelerations are

taken into account by a using a Gaussian function.

The mathematical representation of this model is as follows:

[

𝒓𝑛𝑒𝑤

𝒒𝑛𝑒𝑤

𝒗𝑛𝑒𝑤

𝜔𝑛𝑒𝑤

] = [

𝑟 + (𝑣 + 𝑛𝑟)∆𝑡

𝑞⨂𝑞((𝜔 + 𝑛𝜔)Δ𝑡𝑣 + 𝑛𝑟𝜔 + 𝑛𝜔

] (4)

Where 𝑛𝑟 and 𝑛𝜔 are noise terms.

The main goal of the motion model is to make the feature matching more robust.

However because the model uses the previous state to estimate the next step a

cumulative error occurs which results in drift. When running an implementation of this

algorithm this noise can be noticed very quickly resulting in a lot of unmatched features.

Unmatched features means less accurate localization and there an infinite loop is entered

because this results in even more unmatched features. On this point the localization is

corrupt and not usable anymore. This phenomenon might occur within a minute especially

when the calibration target is partially visible.

Feature detection

The feature detector is based on the work of (Shi & Tomasi, 1994) extended with an

identifier and the motion model for a more robust matching. However several

adjustments have been applied.

The corner detection works exactly the same as with the GFTT detector, but instead

of taking the whole image into account a small selection is processed. The selection has

a size of 80x60 pixels and is placed at a random position within the image. Within this

selected area the best feature is detected. This process repeats to detect the best 𝑛

features when new features are required by the algorithm. 𝑛 is a number given by the

user representing the number of features the algorithm is allowed to initialize during one

frame.

When a feature is detected it gets initialized retrieving its identifier and becomes a

partially initialized feature as long as the depth needs to be estimated.



Feature matching

The matching is performed by searching the patches’ identifiers within a search region.

Those identifiers are image patches of 11x11 pixels. To avoid mismatches the feature its

new position within the image is estimated using the motion model. Then a small area is

searched for the feature using the identifier.

The correlation of the identifier to the search region is determined to relocate the

feature. Using the relocation data of all tracked features the new location of the camera

is estimated.

Feature depth estimation

During the initialization of the feature the position and direction relative to the camera

are stored. Along the direction of the feature a set of 1D particles is distributed over a

certain range. The 1D particles represent a probability density in one dimension. By

default there are 100 particles distributed between 0.5m and 5.0m. The identifier is

retrieved from the image and stored in the feature object.

Next few frames the algorithm tries to track the feature and updates each particle’s

probability. After a few frames a peak arises and when the ratio of the standard deviation

of depth to the estimated depth is below a certain threshold the feature will be

transformed from a partially initialized feature to a fully initialized feature. The fully

initialized feature’s depth is still uncertain but during measurements this uncertainty is

reduced.

Map management

An import aspect of the MonoSLAM algorithm is map management. Especially to keep the

algorithm real-time a sparse set of features is required. To manage the map and decide

which features can stay and which should be delete “natural selection” is deduced to the

features. When a fully initialized feature is selected to be measured but fail a number of

attempts the feature will be deleted.

Limits

Theoretical limits of the applications could be:

Use of monochrome images instead of color images (ignoring information)

Good features for tracking detector (robustness in repeatability and speed)



Depth estimation (uncertainty might be crucial for localization)

Monochrome images

The features are matched by an identifier which is an image patch of 11x11 pixels. This

identifier contains more information than a default corner identifier. However only the

pixel intensity can be taken into account while matching identifiers, while color

information is also available. (Davison, Reid, Molton, & Stasse, 2007) note color patches

could also be used, but grayscale images are used for performance.

Feature detector

As stated by (Davison, Reid, Molton, & Stasse, 2007) the robustness repeatability is

essential for estimating the depth of a feature. However the robustness of the matching

algorithm is also very important, as unmatched features can ruin the localization and

thus the algorithm. The whole algorithm heavily relies on the localization even feature

matching since the motion model uses the current state to predict the features location.

Meaning if the localization fails the prediction fails and a lot of more features become

unmatched. Any error in the localization accumulates over time resulting in a drift which

leads to even more unmatched features. The robustness of the feature detection and

matching is of extreme importance.

Depth estimation

When a partially initialized feature gets converted to a fully initialized feature the

uncertainty in an initialized feature is large. Also the accuracy of the estimated depth is

based on the matched feature positions and those are projected from the camera.

Projecting pixels is inaccurate and combined with some unmatched features due to the

motion model its prediction, which takes depth into account, this inaccuracy can lead to

the same failing as the feature detector does.

Hypotheses

The major limitation on the MonoSLAM algorithm is when mismatches occur. This happens

when either the depth is not estimated properly or when the feature matching does not

work robust enough.

Depth estimation is based on matching the features identifier. Thus when mismatches

occur the depth estimation is wrong as well. Therefor considering another feature

detector is most beneficial for the algorithm.



The hypotheses are:

1. The SURF detector is more robust in repeatability then MonoSLAM its customized

GFTT detector.

2. SURF matching results in less unmatched features then MonoSLAM its matching

algorithm.

The GFTT detector doesn’t threat a whole frame at once to initialize new features.

It randomly chooses a non-overlapping region and picks the “best” spot to become a

feature. This means not all features can be initialized at once and strong features might

be skipped during the initialization process, because the region where the feature is isn’t

“randomly” selected.

The SURF detector threats a whole frame each time and describes all features above

a given threshold. Resulting in a much more deterministic and robust detection without

features being skipped.

The SURF detector might also resolve unmatched features issues for the feature matching.

The current matching method doesn’t take advantage of the feature detector but search

for correlations with the features identifier in the image. Possibly a better method would

be if all features of two images are matched using their descriptors and calculate a

distance from there. For matching the SURF features to prove the second hypothesis the

L1-norm distance is used.

SURF

In the few years after the publication of GFTT researchers developed several new feature

detectors. One of them is the SURF detector, which is inspired on SIFT features created

by (Bay, Tuytelaars, & van Gool, 2006). The strengths of this detector is the features are

invariant to scale, rotation and contrast. The detector is very fast as well because of the

use of an integral image and the Fast-Hessian algorithm.

Integral image

The integral image is built from the given image, if for example the picture is in RGB

format all three elements are multiplied by a constant and added to the values of the

pixel above and the pixel left of the current pixel.



Also smaller boxes from within the integral can be calculated very efficiently, the

sum of pixel intensities can be calculated by:

∑ =𝐴 − 𝐵 − 𝐶 + 𝐷 (5)

Where A, B, C and D are the pixels bounding the box as shown in #REF#.

Fast-Hessian detector

The detector determines interest points using the integral image. Using the integral boxes

only 8 operations are required to compute the value independent of the box’s size instead

of 𝑁x𝑀 operations. To calculate a response value for a potential interest point Haar

wavelets are calculated using several integral box operations. Calculating Haar wavelets

happens a lot and is a serious bottleneck in the algorithm if not performed efficiently.

A B

C D

O

Figure 2 Integral image



Chapter 4: Experimental results and evaluation

In this chapter the implementation of the code, the experiments and the results will be

discussed. Also the hypotheses will be evaluated. The computer used to test the

algorithm uses an Intel Core i7-2760M CPU @ 2.4 GHz, as Ubuntu 12.04 LTS only uses one

core a lot of CPU power is wasted.

MonoSLAM using SURF Implementation

The source code did not need to be built up from scratch because an open source demo

implemented by Davison himself is available for download. However Davison recommends

to use an updated version of his code implemented by Hanme Kim (Kim, 2013). This

implementation is also available for download.

The main reason to use the updated version for the experimentation project is

because the implementation supports USB cameras instead of firewire cameras. Also the

older Linux libraries are updated. The implementation is also tested in an Ubuntu 12.04

LTS distribution. Therefor all code is implemented and tested on the same distribution.

Also the experiments are executed using the same distribution to be sure the code is

working as described by (Davison, Reid, Molton, & Stasse, 2007).

SURF

During the experimentation project a major mistake has been made by implementing the

SURF detector itself. Taking two weeks of time to work properly, while OpenCV has a

very efficient implementation ready to use with only several lines of code. Two lessons

are learned here. First check OpenCV if you are doing computer vision and second the

inner details of SURF are fully understood.

Using OpenCV’s SURF detector and descriptor for each frame the features are

extracted and provided to the MonoSLAM algorithm as a list of key points and a matrix

filled with descriptors. For matching OpenCV’s Fast Library for Approximate Nearest

Neighbors (FLANN) based matcher is used. Notice the CPU versions of the SURF detector,

SURF descriptor and FLANN based matcher are used and not the GPU versions.

Within the MonoSLAM algorithm a list of initialized SURF features is stored. Thus when

new features should be initialized all initialized features are matched to the key points

and descriptors as provided by the SURF detector. The matching is done using the FLANN

matcher implementation in OpenCV. The FLANN matcher uses L1-norm distances by



default and matches the descriptors of each feature. If available the best three matches

are selected and the identifier of each patch is matched by a very simple method to avoid

mismatches.

Observations

Several (re)implementations of the MonoSLAM are available. The first implementation

tested was a C# reimplementation ([email protected], 2008). However during runtime

strange artifacts appeared. Those appeared as result of mutilated input frames. The

mutilated frames are most likely a bug in the code but to be sure the code is verified to

the implementation of Hanme Kim. Multiple major differences were occurring, therefor

the C++ reimplementation by Hanme Kim is used for further analysis of the algorithm and

for running experiments.

Image input

The input image of the C++ implementation does not get mutilated. However the

implementation is not suitable for real augmented reality applications. The image

grabber runs in a separate thread which is a good thing, but it fills a buffer. Each frame

in this buffer should be processed, which on the other hand is better for the localization

algorithm. Each image gets converted to grayscale and is resized to 320x240 pixels.

Localization

The major goal of the MonoSLAM algorithm is to provide a method to do localization.

However when the demo program as is provided is running the camera is jumping along

the axis of its own view direction. This can be easily resolved by tweaking the MonoSLAM

parameters, especially the amount of features that should be selected do affect the

positioning a lot. Also the difference between a normal USB webcam (Logitech C210) and

a wide-angle camera (Logitech C930e) is noticeable, because less features seem to be

needed.

After a few seconds of motion it is clearly visible the motion model isn’t estimating

the new location anymore and fails to measure a feature which is just a little outside of

the search eclipse region. Those search eclipses can be shown in the screen if the check

is selected in the user interface. However the search eclipse drawing method seems to

be very expensive since the frame rate drops below 2 frames per second. A low frame



rate leads to skipped frames and the algorithm is not able to deal with it so the

localization will get corrupted very easily.

There are more ways to get the localization corrupt, for example when the camera

slowly turns around and the calibration target is lost the matching algorithm seems to be

trying to match its corners on the edge of the image. It seems the motion model then

tries to predict the next location which obviously is wrong since the feature is not visible

at all.

At the point where the localization gets corrupt, all features start spinning around

more and more rapidly each time step. The only thing explaining this is the motion model

which predicts features to be somewhere where they are not resulting in all the features

being unmatched. The angular velocity is estimated incorrect and then magnifies itself

each time step. This phenomena has probably something to do with the depth estimation

which gets corrupt due to the unmatched and wrongly predicted features.

Depth estimation

The depth estimation is essential to the localization and heavily relies on the matching

capabilities of the algorithm. Like state in (Davison, Reid, Molton, & Stasse, 2007) any

mismatches are crucial to the localization, because the position of the feature is wrong.

Leaving both the depth estimation and the motion model with a significant error. Then

it is likely the localization gets corrupt like described in the paragraph above.

Experimentation scene

During several test rounds it became clearer the MonoSLAM algorithm is not suited for

every arbitrary space limited to a range 5 m. The scene should contain enough potential

features and also enough deviation in depth between the visible surfaces. Both constrains

seem to be very important to be able to do the localization properly when the camera is

translating. Rotating the camera like in the kitchen video (imperialrobotvision, 2010)

referenced by (Davison, Reid, Molton, & Stasse, 2007) seems almost impossible.

Performance

The experiment as shown in the kitchen video (imperialrobotvision, 2010) referenced by

(Davison, Reid, Molton, & Stasse, 2007) uses a small environment with a lot of deviation

in depth between surfaces. Also a lot of recognizable objects are placed within the scene.

For the experiments done during the project a room has been redecorated trying to get



deviations in depth as well. A picture of the room is shown in #REF#. There are two

calibration targets in the scene. That is a test case for both algorithms to see its

repeatability and to see the effects on the localization.

Feature detection

The implementation of the MonoSLAM algorithm by Hanme Kim confirms the limitation

of the feature detector and tracking. In #REF# the feature detecting and matching

algorithm is shown.

First a region is selected by randomly positioning a region within the image. The

region is valid if no features are inside this region. If a region contains a feature the

region is randomly repositioned to another location up to five times. If there is no region

found within five tries no feature is initialized.

If a valid region is selected the algorithm checks each pixel within the region for a

change in gradients with pixel left and above. The greatest value is the strongest corner

and this position is selected as key point.

When the key point is selected an identifier in the form of an image patch of 11x11

pixels is stored for matching purposes. This identifier is comparable with the descriptors

of a SURF feature.

Figure 3 The experimentation room



What may be noticed is only one feature per region selection can be retrieved at a

maximum. This is done to reduce the time needed to detect points within the image and

keep the algorithm real-time. Hence the algorithm is very slow if several of identifiers

should be detected at once. As stated in (Davison, Reid, Molton, & Stasse, 2007) a PC

with a 1.6 GHz Pentium M processor needs 4ms to do one feature initialization search.

Using the PC with the 2.4 GHz CPU one feature initialization search takes about 1ms.

Using the SURF detector and descriptor on an input image of 320x240 pixels perform

the detection within 6ms for ~200 features with a minimal Hessian threshold of 800.

However the MonoSLAM algorithm is too slow to be able to handle 200 features real-time.

Therefor a threshold of 8000 is used resulting in detecting and describing ~30 features

within 4ms.

Feature matching

An image correlation search takes 3ms for 12 features. With an image correlation search

the algorithm searches the identifier of each feature within a small search region on the

location predicted with the motion model. For matching 12 known features against ~30

detected features the FLANN based matcher takes less than 1ms. The matcher also

considers the entire image.

The FLANN based matcher can be extended with some methods to reduce mismatches.

First of all a radius match can be applied, which only considers two features a match if

the distance between the descriptors is less than a certain threshold. Another method is

to determine the ratio of the best two matches their distances and consider it a valid

match when this ratio is above a certain threshold.

Figure 4 (a) Selecting region (b) Picking feature key point (c) Retrieve identifier



Experiments

Experiments have been conducted to determine the performance of both GFTT and

SURF detectors and the two matching algorithms. The goal of the experiments will be

explained and validated. The results will be shown and the hypotheses will be evaluated.

Goal

The goal of the experiments is to find out whether the SURF features are superior to the

GFTT features in the MonoSLAM application. Therefor the GFTT feature detector and the

SURF feature detector are compared. Also the customized GFTT feature matching

algorithm will be compared with a FLANN based matching algorithm.

Validation

The experiments are conducted in a conditioned room, the room is shown in #REF#. There

is variation in contrast, depth and the calibration target is available at a distance of 60cm.

The initialization file for the MonoSLAM algorithm stores the corners of this calibration

target and their identifiers as well. Those are matched the first frame and the camera

relocates in the virtual scene correctly.

The first experiment is about the detectors performance. The question here is how

deterministic is are the detectors. Therefor the algorithm runs 30 frames and the

positions of the features are outputted by the program. For the SURF detector 1 frame

might produce the same results as 30 frames, but the GFTT detector in the MonoSLAM

algorithm only detects 1 feature each frame.

The test sequence has been executed 10 times for each detector. The results are

stored in an Excel sheet, which can be found in the Appendix #REF#. The data is sorted

by position and sorted so unique features are arranged per column.

The second experiment is about the matching performance. The question here is how

well does the matching algorithm performs when subjected to viewport changes. Which

are most common in an augmented reality application. Therefor for each degree of

freedom and for not changing the viewport at all the same experiment ran 3 times.

The test sequence is ready to start as soon as 12 features are initialized. Those 12

features are being matched each frame. The MonoSLAM might predict the feature is not

visible and therefor it will not try to match the feature. To deal with this problem the



amount of attempts and successful matches are outputted by the program. Based on

those values the ratio of successful and unsuccessful matching attempts can be measured.

However for the FLANN based matcher counting matching attempts is a problem. The

FLANN based matcher is not implemented yet, due to difficulties with the depth retrieval

in the MonoSLAM code and the very limited time. This means the MonoSLAM algorithm

does not predict for each feature whether it is visible. If the feature is not visible it

counts as an unsuccessful match. Also there are several ways to cope with mismatches

for the FLANN based matcher.

The simplest way to deal with mismatches is the radius match as used in the

experiments. Radius matching searches the best matches within a certain threshold if

none are found an unsuccessful attempt is counted. Another common way to reduce

mismatches is calculating the ratio of the two best matches and subject it to a threshold

value. The last method eliminates weak matches and is more reliable, then again time

did not allowed to implement this method properly.

The current SURF implementation uses the detector to determine the new position

of a feature and partially bypasses the motion model. Then the identifiers are searched

on the location given by the SURF detector.

Results

First of all the data of the experiments is discussed. Then the results are shown and those

are discussed as well.

Detector data

When analyzing the data the first thing noticed is the GFTT algorithm picks features

which already exists, resulting in doubles. Analyzing the code the existence of features

which are detected and partially initialized is confirmed. The code seem to not take

partially initialized features into account when searching for a salient feature.

Another issue arises here as well, because it is not possible to know which of the

doubles are detected in the test sequences. The benefit is for the GFTT detector as “a

match” is considered a match of the first detected match.

The SURF detector also has its limits. Due to the performance of the depth

initialization the threshold at which features are passed as detected is very high. The

minimum Hessian value threshold is set to 8000. Therefor it is clearly some features only



are detected a few times. When the 12 features would have been initialized with a

threshold of 8000 and relocated with a threshold of 7500 the feature detector would most

likely detected each feature in each frame for this experiment. However this is the way

it is implemented and therefor the extremely high threshold is used.

The main difference between the data is the features detected by GFTT are

assembled over 30 frames. The features of the SURF detector are detected in the 30th

frame only.

Matching data

The matching data provides the amount of attempts to match a feature and the amount

of successful attempts measured during one sequence. Each sequence might have a

different length, so all measurements are normalized and expressed as percentage values

of the successfully matched features.

Mismatches are not considered, because the identifier of the feature can be used to

exclude mismatches for the FLANN based matcher. Also during one sequence one feature

is not attempted to track at all. This feature can be identified in #REF# as (B7 Run 2 F11)

both in the “Analysis” and “Data – B – GFTT” tab. This results in a calculation 0

0= 𝑒𝑟𝑟𝑜𝑟,

and thus this feature for that specific run is not considered in the statistics.

Detector statistics

The GFTT detector shows doubles, SURF does not. In #REF# and #REF# the amount of

detected features is shown for each detector each run. The total amount of unique

features for GFTT detector are 12 over all runs. For the SURF detector there are 27 unique

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10

Count

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9 10

Count Uniques Doubles

Figure 6 GFTT features detected during 30 frames Figure 6 SURF features detected 30th frame



features. More importantly on average (unique features considered only) the GFTT

detector has an average detection rate of 74,2% and the SURF detector has an average

detection rate of 79,6%, which is significantly better even though it is taken over one

frame.

The average count of is 8,9 features with a maximum deviation of 4,9 for the GFTT

detector. For the SURF detector the average count is 21,5 with a maximum deviation of

4,5. Giving a deviation of 55,1% for GFTT against a deviation of 20,9% for SURF.

… to be continued…

… e. Show the results of the quantitative evaluation

c. Discuss why the results are good or bad

Evaluation

Discuss the hypotheses regarding the algorithm supported by the experiments



Chapter 5: Conclusion

a. State what the thesis are all about c. State the drawbacks of the algorithm d. State and discuss your hypotheses in analyzing the algorithm/drawbacks.



Chapter 6: Future work

Future work. Mismatches oplossen, en depth retrieval fixen.





Appendix



References

Bay, H., Tuytelaars, T., & van Gool, L. (2006). Surf: Speeded up robust features.

Computer Vision-ECCV (pp. 404-417). Springer Berlin Heidelberg.

Davison, A. J. (2003). Real-time simultaneous localisation and mapping with a single

camera. Computer Vision, 2003. Proceedings. Ninth IEEE International

Conference on (pp. 1403-1410 vol. 2). IEEE.

Davison, A. J., Reid, I., Molton, N., & Stasse, O. (2007). MonoSLAM: Real-time single

camera SLAM. Pattern Analysis and Machine Intelligence, IEEE Transactions on,

1052-1067.

imperialrobotvision. (2010, November 29). MonoSLAM: Real-Time SIngle Camera SLAM

[video]. Retrieved from Youtube: http://youtu.be/mimAWVm-0qA

Kim, H. (2013). SceneLib2 - MonoSLAM open-source library. Retrieved from

hanmekim.blogspot.com: http://hanmekim.blogspot.com/2012/10/scenelib2-

monoslam-open-source-library.html

Leonard, J., & Durrant-Whyte, H. (1991). Simultaneous map building and localization for

an autonomous mobile robot. Intelligent Robots and Systems' 91.'Intelligence for

Mechanical Systems, Proceedings IROS'91. IEEE/RSJ International Workshop on (pp.

1442-1447). IEEE.

Shi, J., & Tomasi, C. (1994). Good features to track. Computer Vision and Pattern

Recognition, 1994. Proceedings CVPR'94., 1994 IEEE Computer Society Conference

on (pp. 593-600). IEEE.

Documents

MonoSLAM using SURF · code of MonoSLAM (Davison, Reid, Molton, & Stasse, 2007) has been used as basis to analyze this specific method. During the project the Good Features to Track