Keypoint Filtering using Machine Learningkeypoints an application has to process by learning which keypoints are ben-e cial for the speci c application and using this knowledge to

TEL AVIV UNIVERSITY

Raymond and Beverly SacklerFaculty of Exact Sciences

The Blavatnik School of Computer Science

Keypoint Filteringusing

Machine Learning

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in

Computer Science

by

Shahar Jamshy

Thesis Supervisor:

Professor Yehezkel Yeshurun

May 2009

ii

Abstract

Keypoints are high dimensional descriptors for local features of an image or

an object. Keypoint extraction is the first task in various computer vision

algorithms, where the keypoints are then stored in a database used as the

basis for comparing images or image features. Keypoints may be based on

image features extracted by feature detection operators or on a dense grid of

features. Both ways produce a large number of features per image, causing

both time and space performance challenges when upscaling the problem.

In this thesis I propose a novel framework for reducing the number of

keypoints an application has to process by learning which keypoints are ben-

eficial for the specific application and using this knowledge to filter out a

large portion of the keypoints. I demonstrate this approach on an object

recognition application that uses a keypoint database. I perform numerous

experiments, trying to reduce both the size of the database and the number

of queries required for each test image. I show that I can significantly reduce

the number of keypoints with relatively small reduction in performance.

Shahar Jamshy.

[email protected]

iii

Abstract

iv

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Saliency Operators . . . . . . . . . . . . . . . . . . . . 3

1.1.2 Keypoint Representation . . . . . . . . . . . . . . . . 7

1.1.3 Metafeatures and Feature Selection . . . . . . . . . . . 8

1.1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

v

Table of Contents

2 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 SIFT Saliency Operator . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Scale-Space Extrema Detection . . . . . . . . . . . . . 14

2.1.2 Accurate Keypoint Localization . . . . . . . . . . . . . 16

2.1.3 Orientation Assignment . . . . . . . . . . . . . . . . . 19

2.1.4 Keypoint Descriptor . . . . . . . . . . . . . . . . . . . 21

2.2 Amsterdam Library of Object Images . . . . . . . . . . . . . 23

2.3 Application Structure . . . . . . . . . . . . . . . . . . . . . . 25

2.3.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Application Performance . . . . . . . . . . . . . . . . . . . . 28

3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 29

3.1 Filtering Keypoints using K Nearest Neighbors . . . . . . . . 29

3.1.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Reducing Keypoint Database Size using Additional Training

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Reducing Keypoint Database Size using Leave One Out . . . 42

3.3.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 44

vi

4 Conclusions and Future work . . . . . . . . . . . . . . . . . . 49

4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

vii

Table of Contents

viii

List of Tables

2.1 General Statistics of Test Databases . . . . . . . . . . . . . . . 28

3.1 Selected Meta-features . . . . . . . . . . . . . . . . . . . . . . 34

ix

List of Tables

x

List of Figures

1.1 General scheme of an application that uses keypoints . . . . . 10

1.2 The Approach in the Training Stage . . . . . . . . . . . . . . 10

1.3 The Approach in the Filtering Stage . . . . . . . . . . . . . . 11

2.1 Example operation of SIFT on an image . . . . . . . . . . . . 16

2.2 Example of SIFT keypoint orientation . . . . . . . . . . . . . 20

2.3 Example of SIFT descriptor representation . . . . . . . . . . . 22

2.4 Example objects from ALOI data set . . . . . . . . . . . . . . 24

2.5 Object Viewpoint Collection for Object Number 307 . . . . . 25

2.6 Example of correct matching . . . . . . . . . . . . . . . . . . . 27

3.1 Experiment 1: Meta-features density, location based meta-

features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Experiment 1: Meta-features density, value distribution meta-

features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Experiment 1: Percent of matched images vs. filtering of the

database for 500 objects . . . . . . . . . . . . . . . . . . . . . 35

xi

List of Figures


database when filtering 70%-90% . . . . . . . . . . . . . . . . 36








database for 100 and 200 objects . . . . . . . . . . . . . . . . 41

3.9 Experiment 2: Percent of matched descriptors relative to the

random reference vs. filtering of the database . . . . . . . . . 41


random reference vs. filtering of the database, with databases

of different size . . . . . . . . . . . . . . . . . . . . . . . . . . 42


database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44




random reference vs. filtering of the database . . . . . . . . . 46


database for 100 and 200 objects . . . . . . . . . . . . . . . . 47

xii


random reference vs. filtering of the database, with databases

of different size . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xiii

List of Figures

xiv

Acknowledgments

First, I would like to thank my supervisor, Hezy Yeshurun, for guiding me

through this new, frustrating, wonderful, and fascinating experience in sci-

entific research. Giving me both directions when I was lost and freedom to

find my own way the rest of the time.

I would also like to thank Eyal Krupka and Ariel Tankus for their valu-

able suggestions, and for introducing me to many methods and secrets of

researching in computer science.

In addition, I would like to thank Eddie Aronovich and the CS System

Team for installing and supporting the Condor System [12] which made all

the computations needed for this research possible.

Last but not least, I would like to thank Tania Barski-Kopilov for the

moral support, countless coffee breaks, and for not letting me panic. Tania,

I am lucky to have you as my friend.

xv

Acknowledgments

xvi

This thesis is dedicated to my grandparents. My mother’s parents, the late

Eti and Yechiel Veinrib and my father’s parents, Hedva and Moshe

Jamshy who have gone through many hardships immigrating from Poland

and Iraq in order to build a new home in Israel. You have built a loving,

caring and supporting home for me to grow in. I love you very much.

xvii

Dedication

0

Chapter 1

Introduction

1.1 Background

Computer Vision is a subfield of Artificial Intelligence, whose purpose is to

program the computer to perceive the world through vision. There are many

different problems related to this field, such as:

1. Object class recognition [5, 6] is the task of finding what objects are

present in an image, and is based on building some model of the object

class (for example: house, car, bicycle, or horse). Object class recogni-

tion remains challenging in large part due to the significant variations

exhibited by real-world images. Partial occlusions, viewpoint changes,

varying illumination, cluttered backgrounds, and intra-category ap-

pearance variations all make it necessary to develop exceedingly robust

models of the different categories.

2. Object recognition [13, 22] is the task of finding specific given ob-

jects in an image or a video sequence. In this field of computer vision

we are interested in finding specific a priory known objects such as a

1

Chapter 1. Introduction

specific person or persons. Applications in this field usually try to find

distinctive features of the specific object in order to identify it in sub-

sequent images. This field, much like object class recognition, faces the

challenges of dealing with real-world images in which the object may

appear cluttered, occluded, or poorly illuminated.

3. Object tracking [28] is the of task locating a moving object (or

several ones) in time using a camera. The main difficulty in video

tracking is to associate target locations in consecutive video frames,

especially when the objects are moving fast relative to the frame rate.

Here, video tracking systems usually employ a motion model which

describes how the image of the target might change for different possible

motions of the object to track.

4. Matching of stereo pairs [23, 16] is the task of matching between

pairs of points projected from the same physical position onto two

different images taken from different positions, usually in order to gain

some knowledge about the three dimensional position of the original

point in the physical world.

All of these problems may benefit from the selection of salient areas in

the image as the first step of their processing. The selection of salient areas

focuses the task on similar areas in different images thus reducing compu-

tational complexity and increasing accuracy. These salient areas are often

referred to as keypoints.

2

1.1. Background

1.1.1 Saliency Operators

There is no universal or exact definition of what constitutes a salient area, and

the exact definition often depends on the problem or the type of application.

Given that, a salient area is defined as an ”interesting” part of an image, and

salient areas are used as a starting point for many computer vision algorithms.

Saliency detection is a low-level image processing operation. That is, it is

usually performed as the first operation on an image. If this is part of a

larger algorithm, then the algorithm will typically only examine the image in

the salient areas. Therefore, the overall algorithm will often only be as good

as its saliency detector.

Various algorithms have been suggested in order to find salient areas

in the image. These algorithms use different techniques in order to detect

interest points in an image. such as:

1. Edge Density Detector [4] - This algorithm uses edge detection in

order to find interest points. It calculates the average edge density in

the picture and then selects areas where the edge density varies the

most from the average of the picture. It is motivated from biological

studies that showed that edge density is one of the features that human

vision calculates preattentively.

The algorithm works by creating an edge map, and then convolving the

edge map with a Gaussian filter in order to create an edge density map.

The mean density of the image is then calculated and areas where the

3


edge density differs the most from the mean are considered interesting.

2. Corner detection [9] - Corners are commonly used as salient areas.

For example, a corner based saliency algorithm may use the gradient

of intensity estimation in order to find corners in the picture, using

the fact that corners exhibit strong gradient changes in two orthogonal

directions.

For each pixel the algorithm will first calculate the directional deriva-

tives of the intensity along the x and y axes and then computes the

covariance matrix of the two values. If the smallest eigenvalue of the

covariance matrix is big enough (exceeds a certain threshold) it will im-

ply strong gradient changes in two orthogonal directions and therefore

a corner.

3. Local symmetry detection [25, 15] - This algorithm tries to find

areas in the picture that are centers of local symmetries. The more

symmetric features are found around a point in the picture the more

interesting it is deemed. The radial symmetry detector works on the

intensity differences in the picture.

The algorithm works by calculating for each point in the picture the

size and direction of the gradient of the intensity in that point. Then,

looking inside a small circle around each point in the picture, it tries

to find pairs of matching points in the same distance and opposite

directions from that point, which exhibit similar size and direction of

4

1.1. Background

the gradient of the intensity. A complex measure is then calculated

taking into account the distance between the paired points, the size

and direction of the gradient of the intensity of both points in the pair.

The higher sum of the measures of all pairs in found inside a small

circle around a point the more interesting it is.

4. Convexity estimation [27, 26] - This algorithm finds areas of con-

vexity in the picture, using the fact that many salient areas (the hu-

man face for example) feature a three dimensional convex structure.

By managing to detect three dimensional convex structures from a two

dimensional image this algorithm can find interest points even in noisy

or strongly textured background.

The algorithm works by estimating the argument of the gradient of the

intensity. Since a convex object exhibits a continuous gradient change

there must be a discontinuous ray in the argument of the gradient of the

intensity for that object. The algorithm uses Gaussian filters in order

to estimate the argument of the gradient and find the discontinuous

rays. By rotating the original picture and doing the same calculation

again it creates a discontinuous ray somewhere else along the convex

object. Since all the discontinuous rays start from the center of the

object by doing this process several times (four times, for rotations of

0, 90, 180 and 270 degrees were shown to be sufficient) and summing

the results the algorithm will get a strong response in the middle of the

5


convex object.

5. Blob detection [16, 17, 18, 14] - Blobs are points or regions in the

image that are either brighter or darker than the surrounding. There

are two main classes of blob detectors (i) differential methods based

on derivative expressions and (ii) methods based on local extrema in

the intensity landscape. In Sec. 2.1 I give a detailed description of the

SIFT [14] operator which uses the difference of gaussians differential

method.

While some of these algorithms are intended for general keypoint detec-

tion others are specialized at specific class of interest points (human faces for

example). Nevertheless, even amongst the general purpose algorithms each

algorithm has been shown to detect more easily a different set of regions.

For example, edge based algorithms will usually locate more areas that have

strong edges and will likely miss convex objects in which the edges are usually

weak or blurred, while a convexity estimation algorithm will locate convex

objects easily but will encounter difficulties locating square objects such as

tables or doors.

Requirements from these saliency algorithms are that they will be in-

variant to changes in illumination, scale, rotation, affine transformations,

perspective, and viewing angle (see [20] for comparison). Because achiev-

ing all of these requirements is very hard (and computationally complex),

in practice, one may usually do with only some of them. Modern saliency

6

1.1. Background

operators achieve at least invariance to changes in illumination, scale and

rotation.

1.1.2 Keypoint Representation

Since most applications compare keypoints from different images, where the

same object may appear with different illumination, scale, orientation, or

background, keypoints must be represented in a way that will be invariant

to these differences. This representation is called keypoint descriptor (see

[14, 1, 29] and [19] for comparison). For example, SIFT [14] describes the

keypoint using a weighted histogram of the orientation of the gradient in the

area of the keypoint. In order to compare keypoints from different objects

and images the keypoints are stored in a labeled database and are then used

as the basis for comparing, recognizing and tracking objects.

Even though there are many saliency operators intended to focus an image

processing algorithm on salient areas in the image, state of the art operators

(when used with parameters recommended by the authors) produce hundreds

of keypoint for a single image which does not simplify the problem enough.

Since different operators have different strengths and weaknesses, it is com-

mon practice to use a combination of two or more operators [24], yielding

more keypoints. Furthermore, it is another common practice to use a dense

grid of keypoints (see [2, 3]) instead of using a saliency operator, yielding an

even larger number of keypoints.

When the number of images and objects grow the keypoints database be-

7


comes very large, which causes both time and space performance problems.

In practice, a large number of the keypoints discovered are not actually help-

ful to the actual application (For example, if they belong to the background

or to features common to many images and objects). Filtering the database

and removing these redundant features will reduce both time and space com-

plexity of the application.

1.1.3 Metafeatures and Feature Selection

In some of my experiments I used feature selection and feature extraction

techniques. Feature selection is the task of choosing a small subset of fea-

tures that is sufficient to predict the target labels well. The main motivations

for feature selection are computational complexity, reducing the cost of mea-

suring features, improved classification accuracy and problem understanding

(see [8] for introduction on feature selection). Feature selection is also a

crucial component in the context of feature extraction (see [10]).

In feature extraction the original input features (for example, keypoint

descriptor vector values) are used to generate new, more complicated fea-

tures, referred to as meta-features (for example logical AND of subsets of

the descriptor vector). Feature extraction is a very useful tool for producing

sophisticated classification rules using simple classifiers. One main problem

here is that the potential number of additional meta-features one can extract

is huge, and the learner needs to decide which of them to include in the

model.

8

1.2. The Approach

1.1.4 Related Work

Most applications deal with the problems of large keypoint databases ei-

ther by using a small scale implementation (order of hundreds of objects)

to demonstrate their approach [22, 6], or by reducing the complexity of the

keypoint itself. A common approach (see [5]) uses Vector Quantizations and

K-Means in order to reduce each keypoint to a single word in relatively small

dictionary.

Another approach, described in [14], uses a hash function to approximate

nearest-neighbor lookup. While this approach improves the time performance

of the nearest neighbor search it does not reduce the memory required for a

large database of keypoints.

Despite the large amount of literature on finding and describing keypoints

little care has yet been given to the problem of directly reducing the number

of keypoints or to working with databases that contain thousands of images.

1.2 The Approach

I introduce a framework for filtering keypoints which is suitable for many

computer vision tasks. The main notion of this framework is that an appli-

cation can rank individual keypoints based on their usefulness. I use these

ranks in order to learn the characteristics of keypoints useful to the ap-

plication. Figure 1.1 shows a general scheme of an application that uses

keypoints. First, keypoints are extracted from the image, usually by using a

9


saliency operator. The keypoints are then coded into descriptors, and then

some application specific processing is done.

Figure 1.1: General scheme of an application that uses keypoints

The framework works in two stages: a training stage and a filtering stage.

In the training stage the target application ranks each keypoint according to

its usefulness. The ranks and the keypoints are used in order to train a

keypoint filter, as shown in Fig. 1.2. For example, in an object recognition

application, the application can rank the keypoints according to their ability

to distinguish between the different objects. Highly distinctive keypoints will

receive high grades and less distinctive keypoints will receive lower grades.

Figure 1.2: The Approach in the Training Stage

In the filtering stage I use the rank based keypoint filter I have built in

the training stage in order to filter out less useful keypoints which reduces the

number of keypoints the application needs to process, as shown in Fig. 1.3.

10

1.3. Structure

Figure 1.3: The Approach in the Filtering Stage

1.3 Structure

The rest of the thesis is structured as follows: Chapter 2 describes in detail

the experimental setting I used, the specific algorithms for keypoint extrac-

tion and representation, the target object recognition application and the

datasets used. Chapter 3 describe my experiments with filtering keypoints

from test images as well as filtering the database itself and thus reducing its

size. I conclude in Chapter 4, summarizing my work and giving some further

research directions.

11


12

Chapter 2

Experimental Setting

In order to demonstrate approach described in Section 1.2 I created an object

recognition application roughlly based on Lowe’s SIFT application described

in [14]. In this application keypoints are matched using a nearest neighbor

database of keypoint descriptors, where the ratio between the distance from

nearest neighbor and the first nearest neighbor from any other class is used

to asses the distinctiveness of the match. For each test image I used the

majority of the distinctive keypoints in order to find its label.

I used the ALOI dataset [7] in order to train a database of labeled de-

scriptors. I used four training images for each object, taken at 90 degrees

difference as the training set, and another image for each object taken at 45

degrees from one of the training images as the test set.

2.1 SIFT Saliency Operator

In this section I describe in detail the SIFT algorithm1. SIFT is an algorithm

for local feature extraction and descriptor representation, the keypoints are

1This section was adapted fromhttp://en.wikipedia.org/wiki/Scale-invariant feature transform

13

http://en.wikipedia.org/wiki/Scale-invariant_feature_transform

Chapter 2. Experimental Setting

invariant to image scale and rotation. They are also robust to changes in

illumination, noise, and minor changes in viewpoint. In addition to these

properties, they are highly distinctive, relatively easy to extract, allowing for

correct object identification with low probability of mismatch and are easy

to match against a (large) database of local features. Object description by

set of SIFT features is also robust to partial occlusion. It has been shown

that as few as 3 SIFT features from an object are enough to compute its

location and pose.

The algorithm has four stages.

2.1.1 Scale-Space Extrema Detection

In the first stage keypoints are detected. In order to this, the image is

convolved with Gaussian filters at different scales, and then the difference

of successive Gaussian-blurred images are taken, producing a Difference of

Gaussians filter which reacts to blobs in the image. Keypoints are then

taken as maxima/minima of the Difference of Gaussians (DoG) that occur

at multiple scales. A DoG image D(x, y, σ) is given by

D(x, y, σ) = L(x, y, kiσ)− L(x, y, kjσ) (2.1)

where L(x, y, kσ) is the original image I(x, y) convolved with the Gaussian

14

2.1. SIFT Saliency Operator

blur G(x, y, kσ) at scale kσ, i.e.,

L(x, y, kσ) = G(x, y, kσ) ∗ I(x, y) (2.2)

This means that a DoG image between scales kiσ and kjσ is just the

difference of the Gaussian-blurred images at scales kiσ and kjσ. For scale-

space extrema detection in the SIFT algorithm, the image is first convolved

with Gaussian-blurs at different scales. The convolved images are grouped

by octave (an octave corresponds to doubling the value of sigma), and the

value of ki is selected so that we obtain a fixed number of convolved images

per octave. Then the Difference-of-Gaussian images are taken from adjacent

Gaussian-blurred images per octave.

Once DoG images have been obtained, keypoints are identified as local

minima/maxima of the DoG images across scales. This is done by comparing

each pixel in the DoG images to its eight neighbors at the same scale and

nine corresponding neighboring pixels in each of the neighboring scales. If

the pixel value is the maximum or minimum among all compared pixels, it

is selected as a candidate keypoint.

This first stage in the algorithm is an approximation of earlier blob de-

tection methods that work by detecting scale-space extrema of the scale nor-

malized Laplacian [11], that is detecting points that are local extrema with

respect to both space and scale. The difference of Gaussians operator can

be seen as an approximation to the Laplacian. Figure 2.1 shows a sample

15


object with keypoints extracted by SIFT from two different angles.

Figure 2.1: Example operation of SIFT on an image, circles mark the key-points, showing their scales

2.1.2 Accurate Keypoint Localization

Scale-space extrema detection produces too many keypoint candidates, some

of which are unstable. The second step in the algorithm is to perform a

detailed fit to the nearby data for accurate location, scale, and ratio of prin-

cipal curvatures. This information allows points to be rejected that have low

contrast (and are therefore sensitive to noise) or are poorly localized along

an edge.

Interpolation of nearby data for accurate position

First, for each candidate keypoint, interpolation of nearby data is used to

accurately determine its position. The algorithm calculates the interpolated

16


location of the maximum, which substantially improves matching and stabil-

ity [14]. The interpolation is done using the quadratic Taylor expansion of

the Difference-of-Gaussian scale-space function, D(x, y, σ) with the candidate

keypoint as the origin. This Taylor expansion is given by:

D(x) = D +∂DT

∂xx +

1

2xT∂2D

∂x2x (2.3)

where D and its derivatives are evaluated at the candidate keypoint and

x = (x, y, σ) is the offset from this point. The location of the extremum,

x, is determined by taking the derivative of this function with respect to x

and setting it to zero. If the offset x is larger than 0.5 in any dimension,

then that’s an indication that the extremum lies closer to another candidate

keypoint. In this case, the candidate keypoint is changed and the interpola-

tion performed instead about that point. Otherwise the offset is added to its

candidate keypoint to get the interpolated estimate for the location of the

extremum.

Discarding low-contrast keypoints

To discard the keypoints with low contrast, the value of the second-order

Taylor expansion D(x) is computed at the offset x. If this value is less than

0.03, the candidate keypoint is discarded. Otherwise it is kept, with final

location y + x and scale σ, where y is the original location of the keypoint

at scale σ.

17


Eliminating edge responses

The DoG function will have strong responses along edges, even if the can-

didate keypoint is unstable to small amounts of noise. Therefore, in order

to increase stability, we need to eliminate the keypoints that have poorly

determined locations but have high edge responses.

For poorly defined peaks in the DoG function, the principal curvature

across the edge would be much larger than the principal curvature along it.

Finding these principal curvatures amounts to solving for the eigenvalues of

the second-order Hessian matrix, H:

H =

Dxx Dxy

Dxy Dyy

(2.4)

The eigenvalues of H are proportional to the principal curvatures of D.

It turns out that the ratio of the two eigenvalues, say α is the larger one, and

β the smaller one, with ratio r = αβ, is sufficient for SIFT’s purposes. The

trace of H, i.e. Dxx + Dyy, gives us the sum of the two eigenvalues, while

its determinant, i.e DxxDyy −D2xy, yields the product. The ratio R = Tr(H)2

Det(H)

can be shown to be equal to (r+1)2

r, which depends only on the ratio of the

eigenvalues rather than their individual values. R is minimum when the

eigenvalues are equal to each other. Therefore the higher the absolute differ-

ence between the two eigenvalues, which is equivalent to a higher absolute

difference between the two principal curvatures of D, the higher the value of

R. It follows that, for some threshold eigenvalue ratio rth, if R for a candidate

18


keypoint is larger than (rth+1)2

rth, that keypoint is poorly localized and hence

rejected. The current approach uses rth = 10.

2.1.3 Orientation Assignment

In this step, each keypoint is assigned one or more orientations based on

local image gradient directions. This is the key step in achieving invariance

to rotation as the keypoint descriptor can be represented relative to this

orientation and therefore achieve invariance to image rotation.

First, the Gaussian-smoothed image L (x, y, σ) at the keypoint’s scale σ

is taken so that all computations are performed in a scale-invariant manner.

For an image sample L (x, y) at scale σ, the gradient magnitude, m (x, y),

and orientation, θ (x, y), are precomputed using pixel differences:

m (x, y) =

√(L (x+ 1, y)− L (x− 1, y))2 + (L (x, y + 1)− L (x, y − 1))2

(2.5)

θ (x, y) = tan−1

(L (x, y + 1)− L (x, y − 1)

L (x+ 1, y)− L (x− 1, y)

)(2.6)

The magnitude and direction calculations for the gradient are done for

every pixel in a neighboring region around the keypoint in the Gaussian-

blurred image L. An orientation histogram with 36 bins is formed, with each

bin covering 10 degrees. Each sample in the neighboring window added to

19


a histogram bin is weighted by its gradient magnitude and by a Gaussian-

weighted circular window with a σ that is 1.5 times that of the scale of the

keypoint. The peaks in this histogram correspond to dominant orientations.

Once the histogram is filled, the orientations corresponding to the highest

peak and local peaks that are within 80% of the highest peaks are assigned

to the keypoint. In the case of multiple orientations being assigned, an

additional keypoint is created having the same location and scale as the

original keypoint for each additional orientation.

Figure 2.2 shows the same sample object from Figure 2.1, this time show-

ing the orientation of the keypoints.

Figure 2.2: Example of SIFT keypoint orientation, arrows mark the scaleand orientation of the keypoint

20


2.1.4 Keypoint Descriptor

The previous stage found keypoint locations at particular scales and assigned

orientations to them. This ensured invariance to image location, scale and

rotation. The final stage computes descriptor vectors for these keypoints

such that the descriptors are highly distinctive and partially invariant to the

remaining variations, like illumination, 3D viewpoint, etc. This step is pretty

similar to the Orientation Assignment step.

The feature descriptor is computed as a set of orientation histograms on

(4 x 4) pixel neighborhoods. The orientation histograms are relative to the

keypoint orientation and the orientation data comes from the Gaussian image

closest in scale to the keypoint’s scale. Just like before, the contribution of

each pixel is weighted by the gradient magnitude, and by a Gaussian with

σ 1.5 times the scale of the keypoint. Histograms contain 8 bins each, and

each descriptor contains a 4x4 array of 16 histograms around the keypoint.

This leads to a SIFT feature vector with (4 x 4 x 8 = 128 elements). This

vector is normalized to enhance invariance to changes in illumination.

Although the dimension of the descriptor, i.e. 128, seems high, descriptors

with lower dimension than this don’t perform as well across the range of

matching tasks. Longer descriptors continue to do better but not by much

and there is an additional danger of increased sensitivity to distortion and

occlusion. It is also shown that feature matching accuracy is above 50%

for viewpoint changes of up to 50 degrees. Therefore SIFT descriptors are

invariant to minor affine changes.

21


Figure 2.3 shows an example of a SIFT descriptor and the corresponding

image patch it was calculated from.

Figure 2.3: Example of SIFT descriptor representation. Top left image showsthe descriptor, arrows marking the magnitude of descriptor values in eacharea and direction. Bottom left image shows the original image with thekeypoint marked as a circle. Top right image shows the image patch used tocalculate the descriptor. Bottom right image shows the same image patchafter blurring with the relevant gaussian.

22

2.2. Amsterdam Library of Object Images

2.2 Amsterdam Library of Object Images

Amsterdam Library of Object Images (ALOI) [7] is a color image collection

of one-thousand small objects, recorded for scientific purposes. In order to

capture the sensory variation in object recordings, ALOI systematically varies

viewing angle, illumination angle, and illumination color for each object, and

additionally captures wide-baseline stereo images. ALOI consists of over a

hundred images of each object, yielding a total of 110,250 images for the

collection, occupying 140GB of disk space (uncompressed tiff, 60GB lossless

compressed png). Figure 2.4 shows some example objects from the ALOI

dataset.

The data set offers a testing and evaluation ground for a variety of com-

puter vision algorithms, amongst others: object recognition, pose estimation,

color constancy, invariant feature extraction, stereo algorithms, super reso-

lution from multiple recordings, and image retrieval systems.

In my research I used the object viewpoint collection part of the ALOI

data set. Object viewpoint collection consists of 72 images of each object

taken with the same illumination from 72 directions. A frontal camera was

used to record 72 aspects of the objects by rotating the object in the plane

at 5 degree resolution. This collection is similar to the COIL [21] collection.

Figure 2.5 shows an example of the object view point collection for one object.

Since my application expects gray scale images I used the gray scale

23


Figure 2.4: Example objects from ALOI data set

variant of the ALOI dataset. Due to SIFT performance and computational

reasons I found the images with resolution of 364 x 288 pixels best suited for

my needs.

24

2.3. Application Structure

Figure 2.5: Object Viewpoint Collection for Object Number 307

2.3 Application Structure

2.3.1 Training

The object recognition application I used works as follows: Let Ii,j be the

j training image of object i. I use SIFT in order to extract keypoints from

the image and represent the as keypoint descriptors. I denote D(Ii,j)

the

descriptors extracted from image Ii,j. For each descriptor d ∈ D(Ii,j)

in the

training set I define the correct label of the keypoint l(d) by

l(d) = i ⇐⇒ d ∈ D(Ii,j)

for some j (2.7)

25


Let n be the number of objects and k be the number of training images

for each object I define

T =⋃

i=1..n,j=1..k

D(Ii,j)

(2.8)

to be the training set, and

DBT ={

(d, l(d))∣∣d ∈ T} (2.9)

to be the training database, a database of keypoint descriptors and their

respective labels.

2.3.2 Testing

Given a test image I ti I use SIFT in order to extract keypoints from the

image and represent them as keypoint descriptors. I denote D(I ti)

to be the

descriptors extracted from I ti and find the label of I ti in the following way:

1. For each d ∈ D(I ti)

I denote l(T, d) the label of the nearest neighbor of

d in T , and find it by searching DBT .


I calculate the distinctiveness ratio suggested in

[14]:

r(d) =distance to the nearest descriptor in T labeled l(T, d)

distance to the nearest descriptor in T not labeled l(T, d)

(2.10)

26

2.3. Application Structure

3. I calculate the label of I ti

l(I ti ) = majority{l(T, d)

∣∣∣d ∈ D(I ti) and r(d) > 1.5}

(2.11)

For each descriptor d in the test set, if r(d) > 1.5 I say that d was

matched. If d was matched and l(T, d) = l(d) I say that d was matched

correctly. Figure 2.6 shows an example of two correctly matching descriptors

and their corresponding images.

Figure 2.6: Example of correct matching. Top images showing the descrip-tors. Bottom image showing the original image with the keypoint marked asa circle

27


2.4 Application Performance

Let’s look at the performance of the application without filtering or reducing

the database size. Table 2.1 shows general statistics of the databases for 100,

200, and 500 objects. We can see that the percentage of matched keypoints is

16%-19% and the precent of correctly matched keypoints is 13%-18%. This

shows that a large precenteage of keypoints that were matched had a correct

match. The distinctiveness ratio described in Eq. 2.10 is responsible for this

high accuracy. We can also see that at most 20% of the keypoints were

matched, showing the potential of using filtering in order to improve the

performance of the application.

Table 2.1: General Statistics of Test Databases. Columns 1-3: quantities ofobjects, images and descriptors. Column 4-6: percent of matched descriptors,percent of correctly matched descriptors, and percent of matched imagesusing distinctiveness ratio of 1.5.

# of # of # of % % % ofobjects images descriptors descriptors descriptors images

matched matched matchedcorrectly correctly

100 400 71214 19.7 17.6 78200 800 122252 21.2 18.6 68.5500 2000 278571 16.3 13.6 61.4

28

Chapter 3

Experiments and Results

3.1 Filtering Keypoints using K Nearest

Neighbors

In this section I describe the first experiment I made. In this experiment

I used extra training data in order to train a K-Nearest Neighbor database

that I then used in order to filter out keypoints from subsequent test data.

Although one can argue about the usefulness of using one nearest neighbor

search in order to save work for another, all I intend to show in this ex-

periment is the potential of filtering keypoints. I show that using the extra

filtering step can decrease the number of keypoints the original application

has to deal with with small reduction in performance.

3.1.1 Experiment

I start with the application described in Sec. 2.3, after finishing the training

stage (and building the training set T and the training database DBT ) I

introduce an additional stage. I select one additional image from each object

29

Chapter 3. Experiments and Results

I+i and use SIFT in order to extract descriptors from the image and represent

them as image descriptors.

For each descriptor d ∈ D(I+i

)I find l(T, d), the label of the nearest

neighbor of d in T and define the following correctness measure:

mc(d) =

1 l(T, d) = l(d)

−1 l(T, d) 6= l(d)(3.1)

First, I tried to identify meta-features of the descriptor that highly corre-

late with the correctness measure mc(d), I did this by calculating correlations

between the correctness measure and different properties of of the descrip-

tor, such as average, median, or variance of different subsets of the descriptor

values. I combined several meta-features into a meta-descriptor ~e(d), notice

that for ~e(d) = d the descriptor itself can be considered as a specific case of

a trivial meta-descriptor.

I then use the set{

(~e(d),mc(d))}

of meta-descriptors and their correct-

ness measure from all additional training images{I+i

}i=1..n

as a basis for a

K nearest neighbor regression database DB+ (I tested the algorithm with

databases of 3 and 5 nearest neighbors).

Now, for each test image I ti I do the following:


I calculate ~e(d) and find the regression value m+(d)

using DB+.

2. I filter out all the descriptors whose regression value is lower than some

30

3.1. Filtering Keypoints using K Nearest Neighbors

constant:

D+(I ti)

={d ∈ D

(I ti)∣∣m+(d) > φ

}for some φ (3.2)

3. I then perform the rest of the testing stage described in Sec. 2.3 sub-

stituting D+(I ti)

for D(I ti), calculating l(T, d) and r(d) for each d ∈

D+(I ti)

in order to find l(I ti ).

3.1.2 Results

In order to identify meta-features that correlate with the correctness mea-

sure I systematically looked at the correlation for the following properties of

descriptors:

1. Descriptor values with intensity bigger or smaller than some value.

2. Descriptor values whose values are in some ranged (between a and b,

for some fixed a and b).

3. Descriptor values which represent gradient in a specific direction, or a

combination of directions.

4. Descriptor values which where taken from a specific area, or a combi-

nation or directions (from the 16 subareas the descriptor composed of,

see Sec. 2.1).

31


I used these properties in combinations with various statistical properties

of these descriptor values:

1. Number of descriptor values with the properties.

2. Sum of the intensities of descriptor values with the properties.

3. Sum of the square of the intensities of descriptor values with the prop-

erties.

4. Median of the intensities of descriptor values with the properties.

These combinations produced thousands of meta-features candidates, from

which I used those with the highest correlation values. Unfortunately, cor-

relation values were not very high, but since the number of descriptors used

to compute the correlation was very high even lower correlation values were

significant. Figure. 3.1 and Figure. 3.2 show densities of meta-feature val-

ues for correctly and incorrectly matched descriptors, although two different

distributions can be seen the separation is not good enough for a single meta-

feature. I hoped that using a K nearest neighbor database of a number of

features will produce better results.

Although I used many combinations of meta-features as meta-descriptors,

result of actually using descriptor values were generally better. Nevertheless

I believe meta-features may be used in order to reduce the dimensionality

of the K nearest neighbor database (from the descriptor dimension of 128)

32


Figure 3.1: Experiment 1: Meta-features density of correctly and incorrectlymatched descriptors. Left - median value of all descriptor values that repre-sent gradient in directions up, down, left, or right, in all areas (Correlation= 0.23). Right - sum of all descriptor values in the 12 peripheral areas, inall directions of gradient (Correlation = 0.18).

but this is out of the scope of this thesis. In order to demonstrate the use

of meta-descriptors when presenting the results of this experiment, I chose

one specific meta-descriptor which showed good results, Table. 3.1 details

the specific meta-features used in the meta-descriptor (which are the same

meta-features whose density is shown in Fig. 3.1 and Fig. 3.2).

33


Figure 3.2: Experiment 1: Meta-features density of correctly and incorrectlymatched descriptors. Left - Number of descriptor values in the range 0.02 to0.24 (Correlation = 0.21). Right - Sum if squares of descriptor values biggerthan 0.12 (Correlation = -0.19).

Table 3.1: Meta-features with relatively high correlation value to correctnessof match, used in order to build the meta-descriptor. Corr. - Correlation ofmeta-feature to correctness measure. P-Value - Probability of the hypothesisof no correlation.

Function Corr. P-ValueMedian of gradient in directions up, down, left,or right, in all areas 0.23 0.00Sum of descriptor values in the 12 peripheral areas,in all directions of gradient 0.18 0.00Number of descriptor values in the range 0.02 to 0.24 0.21 0.00Sum of square of descriptor values bigger than 0.12 -0.19 0.00

In order to assess my results I used an average of 20 random filtering

of image descriptors as reference. When looking at the result of the object

recognition application described in Sec. 3.2 for 500 objects, we can see in

Fig. 3.3 that random filtering, in average, performs rather well by itself,

losing less than 10% accuracy when filtering 70% of the database showing

34


the potential for filtering keypoints. We can also see that when filtering up

to 60% of the database my approach gives similiar results in a predictable

way.

Figure 3.3: Experiment 1: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors, using descriptor valuesand meta-features, compared to averaged random filtering

In Fig. 3.4 we can see that when filtering 70%-90% percent of the database

using my approach we achieve the same accuracy of the average random

filtering with 1/2 of the descriptors (twice as much filtering). For example,

in order to match 47% of the images random filtering can filter 77% of the

database leaving 23% of the descriptors, while using my approach I can filter

88% leaving only 12%.

Figure. 3.5 shows similar results for 200 objects.

35


Figure 3.4: Experiment 1: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors, using descriptor valuesand meta-features, compared to averaged random filtering, when filtering70%-90% of the database, showing I achieve the same accuracy of the averagerandom filtering with less than 1/2 of the descriptors.

Figure 3.5: Experiment 1: Percent of matched images vs. filtering of thedatabase for 200 objects, 3 and 5 nearest neighbors, using descriptor valuesand meta-features, compared to averaged random filtering

36

3.2. Reducing Keypoint Database Size using Additional Training Data

3.2 Reducing Keypoint Database Size using

Additional Training Data

In this section I describe the second experiment I made. I used the same extra

training data as in the last experiment but this time, instead of preforming

filtering on the test images, I filtered the original database itself. Filtering

the database is only performed one time and therefore the complexity of the

filtering stage itself is less important, suitable for the computationally heavy

K Nearest Neighbor algorithm. Again, I show I can significantly reduce the

database size with small reduction in performance.

Following the results of the last experiment I have abandoned the use

of meta-descriptors used only the descriptor vector as the basis for the K

nearest neighbor database.

3.2.1 Experiment

I start by building DB+ as described in Sec. 3.1, this time using only ~e(d) = d

as the database key. Then, I build the reduced training set T+ and training

database DBT+ in the following way:

1. For each descriptor d in the training set T I find the regression value

m+(d) using DB+.

37


2. I build the reduced training set T+:

T+ ={d ∈ T

∣∣m+(d) > φ}

for some φ (3.3)

3. I build the reduced training database DBT+ :

DBT+ ={

(d, l(d))∣∣d ∈ T+

}(3.4)

I then perform the testing stage described in Sec. 2.3 substituting T+ for

T , calculating l(T+, d) and r(d) for each d ∈ D(I ti)


3.2.2 Results

In order to assess our results I used an average of 20 random filtering of the

descriptor database as reference. When looking at the result of the object

recognition application described in Sec. 3.2 for 500 objects, we can see in

Fig. 3.6 again that random filtering, in average, performs rather well by

itself, losing only 5% accuracy when filtering 70% of the database showing

the potential for reducing the database size. We can also see that when

filtering up to 70% of the database my approach gives similiar results in a

predictable way.

In Fig. 3.7 we can see that when filtering 70%-95% percent of the database

using my approach we achieve the same accuracy of the average random

filtering with 2/3 the database size. For example, in order to match 48% of

38


Figure 3.6: Experiment 2: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors compared to averagedrandom filtering

the images random filtering can filter 81% of the database leaving 19% of the

descriptors, while using my approach I can filter 89% leaving only 11%.

Figure. 3.8 shows that for 200 objects the results were very similar to 500

objects, but when doing the experiment with 100 objects I did not achieve

any significant improvement over the random reference, this result concurs

with our assumption that keypoint filtering is beneficial for large keypoint

databases.

Next, let’s look at how filtering has affected the number of correctly

matched descriptors. Figure 3.13 shows the result of filtering for 500 objects

in relative percentage over the random reference. The main result these

39


Figure 3.7: Experiment 2: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors compared to averagedrandom filtering, when filtering 70%-95% of the database, showing I achievethe same accuracy of the average random filtering with less than 2/3 thedatabase size.

figures show is that when filtering 70%-95% of the database my filtering

technique gives an increase of 5%-10% percent correctlly matched descriptors

relative to random filtering. Figure. 3.10 shows similar results for 100 and

200 objects.

40


Figure 3.8: Experiment 2: Percent of matched images vs. filtering of thedatabase for 100 objects (left) and 200 object (right), 3 and 5 nearest neigh-bors compared to averaged random filtering

Figure 3.9: Experiment 2: Percent of matched descriptors relative to therandom reference vs. filtering of the database for 500 objects, 3 and 5 nearestneighbors

41


Figure 3.10: Experiment 2: Percent of matched descriptors relative to therandom reference vs. filtering of the database for 100, 200, and 500 objects,3 nearest neighbors

3.3 Reducing Keypoint Database Size using

Leave One Out

Looking at the last experiment it can be argued that we don’t always have

the extra training data, or if we do, we had better use it to enrich the original

training set. In this section I describe an experiment that performs database

size reduction without using extra training data. I use leave one out cross

validation in order to train a K nearest neighbor database and then use it in

order to reduce the database size.

42

3.3. Reducing Keypoint Database Size using Leave One Out

3.3.1 Experiment

I start by building the training set T and training database DBT as described

in Sec. 2.3. For each j I then break T into two parts: Rj - the descriptors

extracted from the j training image for each object, and Tj - the rest of the

descriptors in the training set. Formally:

Rj =⋃i=1..n

D(Ii,j)

and Tj = T \Rj

I calculate mcj(d) for each d ∈ Rj using Tj for training set as described in

(3.1). At this stage I use the set{

(d,mcj(d))

}of descriptors in Rj and their

correctness measure as a basis for a K nearest neighbor regression database

(again, I tested the algorithm with 3 and 5 nearest neighbors) and find the

regression value mj(d) for each d ∈ Tj.

Finally, I calculate the mark m(d) for each descriptor d ∈ Ii,j:

m(d) =∑j′ 6=j

mj′(d) (3.5)

I then create the filtered training set T ′ from T in the following way:

T ′ ={d ∈ T

∣∣m(d) > φ}

for some φ (3.6)

And the filtered training database DBT ′ :

DBT ′ ={

(d, l(d))∣∣d ∈ T ′} (3.7)

43


I then perform the testing stage described in Sec. 2.3 substituting T ′ for

T , calculating l(T ′, d) and r(d) for each d ∈ D(I ti)


3.3.2 Results

Figure 3.11: Experiment 3: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors compared to averagedrandom filtering

Again, in order to assess our results I used the same average of 20 random

filtering of descriptors as reference. When looking at the result of the object

recognition application described in Sec. 3.3.1 for 500 objects, we can see that

in Fig. 3.11 that this experiment shows similar results to the last one - when

filtering up to 70% of the database my approach gives similiar results in a

predictable way. In Fig. 3.12 we can see that when filtering 70%-95% percent

44


Figure 3.12: Experiment 3: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors compared to averagedrandom filtering, when filtering 70%-95% of the database, showing I achievethe same accuracy of the average random filtering with less than 2/3 thedatabase size.

of the database using my approach we achieve, again, the same accuracy of

the average random filtering with 2/3 the database size. For example, in

order to match 45% of the images random filtering can filter 84% of the

database leaving 16% of the descriptors, while using my approach I can filter

91% leaving only 9%.

Next, let’s look at how filtering has affected the number of correctly

matched descriptors. Figure 3.13 shows the result of filtering for 500 objects

in relative percentage over the random reference. The main result these

figures show is that when filtering 70%-95% of the database my filtering

technique give an increase of 5%-20% percent correctlly matched descriptors

45


Figure 3.13: Experiment 3: Percent of matched descriptors relative to therandom reference vs. filtering of the database for 500 objects, 3 and 5 nearestneighbors

relative to random filtering.

Finally, when looking at the effect of the method on databases of differ-

ent sizes, Fig. 3.14 shows that for 100 and 200 objects this experiment did

not produce results that were significantly better than those of the random

reference. Figure. 3.15 compares results for databases of different size, show-

ing that when database size was increased to 500 objects the experiment

showed much better results then with database size of 100 or 200 objects,

emphasizing the benefit of my approach for large databases.

46


Figure 3.14: Experiment 3: Percent of matched images vs. filtering of thedatabase for 100 objects (left) and 200 object (right), 3 and 5 nearest neigh-bors compared to averaged random filtering

Figure 3.15: Experiment 3: Percent of matched descriptors relative to therandom reference vs. filtering of the database for 100, 200, and 500 objects,3 nearest neighbors

47


48

Chapter 4

Conclusions and Future work

4.1 Conclusions

In this thesis I have proposed a new approach for reducing the complexity of

keypoint based computer vision applications by filtering keypoints based on

learning which descriptors are beneficial to the application. I performed three

experiments demonstrating this approach, based on an object recognition

application that uses a keypoint database:

1. Filtering keypoints using a K nearest neighbors database that repre-

sents the ability of the keypoint to produce correct result for the ap-

plication.

2. Filtering keypoints from the keypoint database, thus reducing its size,

using a K nearest neighbors database that requires additional training

data.

3. Filtering keypoints from the keypoint database, thus reducing its size,

using a K nearest neighbors database based on a leave one out method,

without using additional training data.

49

Chapter 4. Conclusions and Future work

In all of the experiments I showed that I can significantly reduce the

number of keypoints used, with small reduction in performance. I also showed

that my approach can achieve the same performance results with 2/3 the

database size compared to the average of random filtering.

4.2 Future Work

In the future, this approach can be further developed in the following ways:

1. Adapting the approach to other applications in computer vision, two

promising candidate areas are object class recognition and object track-

ing.

2. Using the approach with keypoint databases that are used for more than

one purpose, utilizing a different filter for each purpose. In this case the

purposes can be similar (such as two different object recognition tasks,

which recognize different types of objects) or completely different (such

as an object recognition task and an object tracking task). It is my

assumption that the two different tasks will require a different subsets

of the database, which this approach can provide them with.

3. Upscaling the database size even further. It would be interesting to see

how this application will behave with databases of millions of descrip-

tor, in these cases filtering is crucial due to the large amount of space

needed to store the descriptors.

50

4.2. Future Work

4. Using more time and space efficient machine learning techniques. Most

of my research concentrated on K nearest neighbor learning as a simple

way to asses the potential of my approach. It would be interesting to

see how more elaborate and efficient machine learning techniques, such

as support vector machines, or more elaborate use of meta-features will

perform with regard to my approach.

51

Chapter 4. Conclusions and Future work

52

Bibliography

[1] Herbert Bay, Tinne Tuytelaars, and Luc J. Van Gool. SURF: Speeded

up robust features. In Ales Leonardis, Horst Bischof, and Axel Pinz,

editors, ECCV (1), volume 3951 of Lecture Notes in Computer Science,

pages 404–417. Springer, 2006.

[2] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Scene classification

via pLSA. In Ales Leonardis, Horst Bischof, and Axel Pinz, editors,

ECCV (4), volume 3954 of Lecture Notes in Computer Science, pages

517–530. Springer, 2006.

[3] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Scene classification

using a hybrid generative/discriminative approach. IEEE Trans. Pattern

Anal. Mach. Intell, 30(4):712–727, 2008.

[4] Eric Bourque and Gregory Dudek. Viewpoint selection – an autonomous

robotic system for virtual environment creation, January 10 1998.

[5] Chris Dance, Jutta Willamowski, Lixin Fan, Cedric Bray, and Gabriela

Csurka. Visual categorization with bags of keypoints. In ECCV Inter-

national Workshop on Statistical Learning in Computer Vision, 2004.

53

Chapter 4. Bibliography

[6] Robert Fergus, Pietro Perona, and Andrew Zisserman. Object class

recognition by unsupervised scale-invariant learning. In CVPR, pages

264–271. IEEE Computer Society, 2003.

[7] J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. The

Amsterdam library of object images. Int. J. Comput. Vis., 61(1):103–

112, 2005.

[8] Isabelle Guyon and Andre Elisseeff. An introduction to variable and

feature selection. Journal of Machine Learning Research, 3:1157–1182,

2003.

[9] C. Harris and M. Stephans. A combined corner and edge detector. In

Proc 4th Alvey Vision Conf, pages 189–192, August 1988. Manchester.

[10] Eyal Krupka, Amir Navot, and Naftali Tishby. Learning to select fea-

tures using their properties. Journal of Machine Learning Research,

9:2349–2376, October 2008.

[11] Tony Lindeberg. Feature detection with automatic scale selection. In-

ternational Journal of Computer Vision, 30(2):79–116, 1998.

[12] Michael Litzkow, Miron Livny, and Matthew Mutka. Condor - a hunter

of idle workstations. In Proceedings of the 8th International Conference

of Distributed Computing Systems, June 1988.

[13] David G. Lowe. Object recognition from local scale-invariant features.

In ICCV, pages 1150–1157, 1999.

54

[14] David G. Lowe. Distinctive image features from scale-invariant key-

points. International Journal of Computer Vision, 60(2):91–110, 2004.

[15] Gareth Loy and Alexander Zelinsky. Fast radial symmetry for detecting

points of interest. IEEE Trans. Pattern Anal. Mach. Intell, 25(8):959–

973, 2003.

[16] Jirı Matas, Ondrej Chum, Martin Urban, and Tomas Pajdla. Robust

wide baseline stereo from maximally stable extremal regions. In Paul L.

Rosin and David Marshall, editors, Proceedings of the British Machine

Vision Conference, volume 1, pages 384–393, London, UK, September

2002. BMVA.

[17] Krystian Mikolajczyk and Cordelia Schmid. An affine invariant interest

point detector. In Anders Heyden, Gunnar Sparr, Mads Nielsen, and

Peter Johansen, editors, ECCV (1), volume 2350 of Lecture Notes in

Computer Science, pages 128–142. Springer, 2002.

[18] Krystian Mikolajczyk and Cordelia Schmid. Scale & affine invariant

interest point detectors. International Journal of Computer Vision,

60(1):63–86, 2004.

[19] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation

of local descriptors. IEEE Transactions on Pattern Analysis & Machine

Intelligence, 27(10):1615–1630, 2005.

55


[20] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zis-

serman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and Luc J. Van

Gool. A comparison of affine region detectors. International Journal of

Computer Vision, 65(1-2):43–72, 2005.

[21] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library

(COIL-100). In Columbia University, 1996.

[22] Stepan Obdrzalek and Jiri Matas. Object recognition using local affine

frames on distinguished regions. In Paul L. Rosin and A. David Marshall,

editors, BMVC. British Machine Vision Association, 2002.

[23] Philip Pritchett and Andrew Zisserman. Wide baseline stereo matching.

In ICCV, pages 754–760, 1998.

[24] Arnau Ramisa, Ramon Lopez de Mantaras, David Aldavert, and Ri-

cardo Toledo. Comparing combinations of feature regions for panoramic

VSLAM. In Janan Zaytoon, Jean-Louis Ferrier, Juan Andrade-Cetto,

and Joaquim Filipe, editors, ICINCO-RA (2), pages 292–297. INSTICC

Press, 2007.

[25] D. Reisfeld, H. Wolfson, and Y. Yeshurun. Detection of interest points

using symmetry. In ICCV, pages 62–65, 1990.

[26] Ariel Tankus and Yehezkel Yeshurun. Convexity-based visual camou-

flage breaking. Computer Vision and Image Understanding, 82(3):208–

237, 2001.

56

[27] Ariel Tankus, Yehezkel Yeshurun, and Nathan Intrator. Face detection

by direct convexity estimation. Pattern Recognition Letters, 18(9):913–

922, 1997.

[28] P. Tissainayagam and D. Suter. Object tracking in image sequences

using point features. In APRS Workshop on Digital Image Computing

Online Proceedings, pages 1197–1203, 2003.

[29] Simon A. J. Winder and Matthew Brown. Learning local image descrip-

tors. In CVPR. IEEE Computer Society, 2007.

57


58

Documents

Keypoint Filtering using Machine Learningkeypoints an application has to process by learning which keypoints are ben-e cial for the speci c application and using this knowledge to