Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
TEL AVIV UNIVERSITY
Raymond and Beverly SacklerFaculty of Exact Sciences
The Blavatnik School of Computer Science
Keypoint Filteringusing
Machine Learning
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
in
Computer Science
by
Shahar Jamshy
Thesis Supervisor:
Professor Yehezkel Yeshurun
May 2009
ii
Abstract
Keypoints are high dimensional descriptors for local features of an image or
an object. Keypoint extraction is the first task in various computer vision
algorithms, where the keypoints are then stored in a database used as the
basis for comparing images or image features. Keypoints may be based on
image features extracted by feature detection operators or on a dense grid of
features. Both ways produce a large number of features per image, causing
both time and space performance challenges when upscaling the problem.
In this thesis I propose a novel framework for reducing the number of
keypoints an application has to process by learning which keypoints are ben-
eficial for the specific application and using this knowledge to filter out a
large portion of the keypoints. I demonstrate this approach on an object
recognition application that uses a keypoint database. I perform numerous
experiments, trying to reduce both the size of the database and the number
of queries required for each test image. I show that I can significantly reduce
the number of keypoints with relatively small reduction in performance.
Shahar Jamshy.
iii
Abstract
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Saliency Operators . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Keypoint Representation . . . . . . . . . . . . . . . . 7
1.1.3 Metafeatures and Feature Selection . . . . . . . . . . . 8
1.1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
v
Table of Contents
2 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 SIFT Saliency Operator . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Scale-Space Extrema Detection . . . . . . . . . . . . . 14
2.1.2 Accurate Keypoint Localization . . . . . . . . . . . . . 16
2.1.3 Orientation Assignment . . . . . . . . . . . . . . . . . 19
2.1.4 Keypoint Descriptor . . . . . . . . . . . . . . . . . . . 21
2.2 Amsterdam Library of Object Images . . . . . . . . . . . . . 23
2.3 Application Structure . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Application Performance . . . . . . . . . . . . . . . . . . . . 28
3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 29
3.1 Filtering Keypoints using K Nearest Neighbors . . . . . . . . 29
3.1.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Reducing Keypoint Database Size using Additional Training
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Reducing Keypoint Database Size using Leave One Out . . . 42
3.3.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vi
4 Conclusions and Future work . . . . . . . . . . . . . . . . . . 49
4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
vii
Table of Contents
viii
List of Tables
2.1 General Statistics of Test Databases . . . . . . . . . . . . . . . 28
3.1 Selected Meta-features . . . . . . . . . . . . . . . . . . . . . . 34
ix
List of Tables
x
List of Figures
1.1 General scheme of an application that uses keypoints . . . . . 10
1.2 The Approach in the Training Stage . . . . . . . . . . . . . . 10
1.3 The Approach in the Filtering Stage . . . . . . . . . . . . . . 11
2.1 Example operation of SIFT on an image . . . . . . . . . . . . 16
2.2 Example of SIFT keypoint orientation . . . . . . . . . . . . . 20
2.3 Example of SIFT descriptor representation . . . . . . . . . . . 22
2.4 Example objects from ALOI data set . . . . . . . . . . . . . . 24
2.5 Object Viewpoint Collection for Object Number 307 . . . . . 25
2.6 Example of correct matching . . . . . . . . . . . . . . . . . . . 27
3.1 Experiment 1: Meta-features density, location based meta-
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Experiment 1: Meta-features density, value distribution meta-
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Experiment 1: Percent of matched images vs. filtering of the
database for 500 objects . . . . . . . . . . . . . . . . . . . . . 35
xi
List of Figures
3.4 Experiment 1: Percent of matched images vs. filtering of the
database when filtering 70%-90% . . . . . . . . . . . . . . . . 36
3.5 Experiment 1: Percent of matched images vs. filtering of the
database for 200 objects . . . . . . . . . . . . . . . . . . . . . 36
3.6 Experiment 2: Percent of matched images vs. filtering of the
database for 500 objects . . . . . . . . . . . . . . . . . . . . . 39
3.7 Experiment 2: Percent of matched images vs. filtering of the
database when filtering 70%-95% . . . . . . . . . . . . . . . . 40
3.8 Experiment 2: Percent of matched images vs. filtering of the
database for 100 and 200 objects . . . . . . . . . . . . . . . . 41
3.9 Experiment 2: Percent of matched descriptors relative to the
random reference vs. filtering of the database . . . . . . . . . 41
3.10 Experiment 2: Percent of matched descriptors relative to the
random reference vs. filtering of the database, with databases
of different size . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.11 Experiment 3: Percent of matched images vs. filtering of the
database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.12 Experiment 3: Percent of matched images vs. filtering of the
database when filtering 70%-95% . . . . . . . . . . . . . . . . 45
3.13 Experiment 3: Percent of matched descriptors relative to the
random reference vs. filtering of the database . . . . . . . . . 46
3.14 Experiment 3: Percent of matched images vs. filtering of the
database for 100 and 200 objects . . . . . . . . . . . . . . . . 47
xii
3.15 Experiment 3: Percent of matched descriptors relative to the
random reference vs. filtering of the database, with databases
of different size . . . . . . . . . . . . . . . . . . . . . . . . . . 47
xiii
List of Figures
xiv
Acknowledgments
First, I would like to thank my supervisor, Hezy Yeshurun, for guiding me
through this new, frustrating, wonderful, and fascinating experience in sci-
entific research. Giving me both directions when I was lost and freedom to
find my own way the rest of the time.
I would also like to thank Eyal Krupka and Ariel Tankus for their valu-
able suggestions, and for introducing me to many methods and secrets of
researching in computer science.
In addition, I would like to thank Eddie Aronovich and the CS System
Team for installing and supporting the Condor System [12] which made all
the computations needed for this research possible.
Last but not least, I would like to thank Tania Barski-Kopilov for the
moral support, countless coffee breaks, and for not letting me panic. Tania,
I am lucky to have you as my friend.
xv
Acknowledgments
xvi
This thesis is dedicated to my grandparents. My mother’s parents, the late
Eti and Yechiel Veinrib and my father’s parents, Hedva and Moshe
Jamshy who have gone through many hardships immigrating from Poland
and Iraq in order to build a new home in Israel. You have built a loving,
caring and supporting home for me to grow in. I love you very much.
xvii
Dedication
0
Chapter 1
Introduction
1.1 Background
Computer Vision is a subfield of Artificial Intelligence, whose purpose is to
program the computer to perceive the world through vision. There are many
different problems related to this field, such as:
1. Object class recognition [5, 6] is the task of finding what objects are
present in an image, and is based on building some model of the object
class (for example: house, car, bicycle, or horse). Object class recogni-
tion remains challenging in large part due to the significant variations
exhibited by real-world images. Partial occlusions, viewpoint changes,
varying illumination, cluttered backgrounds, and intra-category ap-
pearance variations all make it necessary to develop exceedingly robust
models of the different categories.
2. Object recognition [13, 22] is the task of finding specific given ob-
jects in an image or a video sequence. In this field of computer vision
we are interested in finding specific a priory known objects such as a
1
Chapter 1. Introduction
specific person or persons. Applications in this field usually try to find
distinctive features of the specific object in order to identify it in sub-
sequent images. This field, much like object class recognition, faces the
challenges of dealing with real-world images in which the object may
appear cluttered, occluded, or poorly illuminated.
3. Object tracking [28] is the of task locating a moving object (or
several ones) in time using a camera. The main difficulty in video
tracking is to associate target locations in consecutive video frames,
especially when the objects are moving fast relative to the frame rate.
Here, video tracking systems usually employ a motion model which
describes how the image of the target might change for different possible
motions of the object to track.
4. Matching of stereo pairs [23, 16] is the task of matching between
pairs of points projected from the same physical position onto two
different images taken from different positions, usually in order to gain
some knowledge about the three dimensional position of the original
point in the physical world.
All of these problems may benefit from the selection of salient areas in
the image as the first step of their processing. The selection of salient areas
focuses the task on similar areas in different images thus reducing compu-
tational complexity and increasing accuracy. These salient areas are often
referred to as keypoints.
2
1.1. Background
1.1.1 Saliency Operators
There is no universal or exact definition of what constitutes a salient area, and
the exact definition often depends on the problem or the type of application.
Given that, a salient area is defined as an ”interesting” part of an image, and
salient areas are used as a starting point for many computer vision algorithms.
Saliency detection is a low-level image processing operation. That is, it is
usually performed as the first operation on an image. If this is part of a
larger algorithm, then the algorithm will typically only examine the image in
the salient areas. Therefore, the overall algorithm will often only be as good
as its saliency detector.
Various algorithms have been suggested in order to find salient areas
in the image. These algorithms use different techniques in order to detect
interest points in an image. such as:
1. Edge Density Detector [4] - This algorithm uses edge detection in
order to find interest points. It calculates the average edge density in
the picture and then selects areas where the edge density varies the
most from the average of the picture. It is motivated from biological
studies that showed that edge density is one of the features that human
vision calculates preattentively.
The algorithm works by creating an edge map, and then convolving the
edge map with a Gaussian filter in order to create an edge density map.
The mean density of the image is then calculated and areas where the
3
Chapter 1. Introduction
edge density differs the most from the mean are considered interesting.
2. Corner detection [9] - Corners are commonly used as salient areas.
For example, a corner based saliency algorithm may use the gradient
of intensity estimation in order to find corners in the picture, using
the fact that corners exhibit strong gradient changes in two orthogonal
directions.
For each pixel the algorithm will first calculate the directional deriva-
tives of the intensity along the x and y axes and then computes the
covariance matrix of the two values. If the smallest eigenvalue of the
covariance matrix is big enough (exceeds a certain threshold) it will im-
ply strong gradient changes in two orthogonal directions and therefore
a corner.
3. Local symmetry detection [25, 15] - This algorithm tries to find
areas in the picture that are centers of local symmetries. The more
symmetric features are found around a point in the picture the more
interesting it is deemed. The radial symmetry detector works on the
intensity differences in the picture.
The algorithm works by calculating for each point in the picture the
size and direction of the gradient of the intensity in that point. Then,
looking inside a small circle around each point in the picture, it tries
to find pairs of matching points in the same distance and opposite
directions from that point, which exhibit similar size and direction of
4
1.1. Background
the gradient of the intensity. A complex measure is then calculated
taking into account the distance between the paired points, the size
and direction of the gradient of the intensity of both points in the pair.
The higher sum of the measures of all pairs in found inside a small
circle around a point the more interesting it is.
4. Convexity estimation [27, 26] - This algorithm finds areas of con-
vexity in the picture, using the fact that many salient areas (the hu-
man face for example) feature a three dimensional convex structure.
By managing to detect three dimensional convex structures from a two
dimensional image this algorithm can find interest points even in noisy
or strongly textured background.
The algorithm works by estimating the argument of the gradient of the
intensity. Since a convex object exhibits a continuous gradient change
there must be a discontinuous ray in the argument of the gradient of the
intensity for that object. The algorithm uses Gaussian filters in order
to estimate the argument of the gradient and find the discontinuous
rays. By rotating the original picture and doing the same calculation
again it creates a discontinuous ray somewhere else along the convex
object. Since all the discontinuous rays start from the center of the
object by doing this process several times (four times, for rotations of
0, 90, 180 and 270 degrees were shown to be sufficient) and summing
the results the algorithm will get a strong response in the middle of the
5
Chapter 1. Introduction
convex object.
5. Blob detection [16, 17, 18, 14] - Blobs are points or regions in the
image that are either brighter or darker than the surrounding. There
are two main classes of blob detectors (i) differential methods based
on derivative expressions and (ii) methods based on local extrema in
the intensity landscape. In Sec. 2.1 I give a detailed description of the
SIFT [14] operator which uses the difference of gaussians differential
method.
While some of these algorithms are intended for general keypoint detec-
tion others are specialized at specific class of interest points (human faces for
example). Nevertheless, even amongst the general purpose algorithms each
algorithm has been shown to detect more easily a different set of regions.
For example, edge based algorithms will usually locate more areas that have
strong edges and will likely miss convex objects in which the edges are usually
weak or blurred, while a convexity estimation algorithm will locate convex
objects easily but will encounter difficulties locating square objects such as
tables or doors.
Requirements from these saliency algorithms are that they will be in-
variant to changes in illumination, scale, rotation, affine transformations,
perspective, and viewing angle (see [20] for comparison). Because achiev-
ing all of these requirements is very hard (and computationally complex),
in practice, one may usually do with only some of them. Modern saliency
6
1.1. Background
operators achieve at least invariance to changes in illumination, scale and
rotation.
1.1.2 Keypoint Representation
Since most applications compare keypoints from different images, where the
same object may appear with different illumination, scale, orientation, or
background, keypoints must be represented in a way that will be invariant
to these differences. This representation is called keypoint descriptor (see
[14, 1, 29] and [19] for comparison). For example, SIFT [14] describes the
keypoint using a weighted histogram of the orientation of the gradient in the
area of the keypoint. In order to compare keypoints from different objects
and images the keypoints are stored in a labeled database and are then used
as the basis for comparing, recognizing and tracking objects.
Even though there are many saliency operators intended to focus an image
processing algorithm on salient areas in the image, state of the art operators
(when used with parameters recommended by the authors) produce hundreds
of keypoint for a single image which does not simplify the problem enough.
Since different operators have different strengths and weaknesses, it is com-
mon practice to use a combination of two or more operators [24], yielding
more keypoints. Furthermore, it is another common practice to use a dense
grid of keypoints (see [2, 3]) instead of using a saliency operator, yielding an
even larger number of keypoints.
When the number of images and objects grow the keypoints database be-
7
Chapter 1. Introduction
comes very large, which causes both time and space performance problems.
In practice, a large number of the keypoints discovered are not actually help-
ful to the actual application (For example, if they belong to the background
or to features common to many images and objects). Filtering the database
and removing these redundant features will reduce both time and space com-
plexity of the application.
1.1.3 Metafeatures and Feature Selection
In some of my experiments I used feature selection and feature extraction
techniques. Feature selection is the task of choosing a small subset of fea-
tures that is sufficient to predict the target labels well. The main motivations
for feature selection are computational complexity, reducing the cost of mea-
suring features, improved classification accuracy and problem understanding
(see [8] for introduction on feature selection). Feature selection is also a
crucial component in the context of feature extraction (see [10]).
In feature extraction the original input features (for example, keypoint
descriptor vector values) are used to generate new, more complicated fea-
tures, referred to as meta-features (for example logical AND of subsets of
the descriptor vector). Feature extraction is a very useful tool for producing
sophisticated classification rules using simple classifiers. One main problem
here is that the potential number of additional meta-features one can extract
is huge, and the learner needs to decide which of them to include in the
model.
8
1.2. The Approach
1.1.4 Related Work
Most applications deal with the problems of large keypoint databases ei-
ther by using a small scale implementation (order of hundreds of objects)
to demonstrate their approach [22, 6], or by reducing the complexity of the
keypoint itself. A common approach (see [5]) uses Vector Quantizations and
K-Means in order to reduce each keypoint to a single word in relatively small
dictionary.
Another approach, described in [14], uses a hash function to approximate
nearest-neighbor lookup. While this approach improves the time performance
of the nearest neighbor search it does not reduce the memory required for a
large database of keypoints.
Despite the large amount of literature on finding and describing keypoints
little care has yet been given to the problem of directly reducing the number
of keypoints or to working with databases that contain thousands of images.
1.2 The Approach
I introduce a framework for filtering keypoints which is suitable for many
computer vision tasks. The main notion of this framework is that an appli-
cation can rank individual keypoints based on their usefulness. I use these
ranks in order to learn the characteristics of keypoints useful to the ap-
plication. Figure 1.1 shows a general scheme of an application that uses
keypoints. First, keypoints are extracted from the image, usually by using a
9
Chapter 1. Introduction
saliency operator. The keypoints are then coded into descriptors, and then
some application specific processing is done.
Figure 1.1: General scheme of an application that uses keypoints
The framework works in two stages: a training stage and a filtering stage.
In the training stage the target application ranks each keypoint according to
its usefulness. The ranks and the keypoints are used in order to train a
keypoint filter, as shown in Fig. 1.2. For example, in an object recognition
application, the application can rank the keypoints according to their ability
to distinguish between the different objects. Highly distinctive keypoints will
receive high grades and less distinctive keypoints will receive lower grades.
Figure 1.2: The Approach in the Training Stage
In the filtering stage I use the rank based keypoint filter I have built in
the training stage in order to filter out less useful keypoints which reduces the
number of keypoints the application needs to process, as shown in Fig. 1.3.
10
1.3. Structure
Figure 1.3: The Approach in the Filtering Stage
1.3 Structure
The rest of the thesis is structured as follows: Chapter 2 describes in detail
the experimental setting I used, the specific algorithms for keypoint extrac-
tion and representation, the target object recognition application and the
datasets used. Chapter 3 describe my experiments with filtering keypoints
from test images as well as filtering the database itself and thus reducing its
size. I conclude in Chapter 4, summarizing my work and giving some further
research directions.
11
Chapter 1. Introduction
12
Chapter 2
Experimental Setting
In order to demonstrate approach described in Section 1.2 I created an object
recognition application roughlly based on Lowe’s SIFT application described
in [14]. In this application keypoints are matched using a nearest neighbor
database of keypoint descriptors, where the ratio between the distance from
nearest neighbor and the first nearest neighbor from any other class is used
to asses the distinctiveness of the match. For each test image I used the
majority of the distinctive keypoints in order to find its label.
I used the ALOI dataset [7] in order to train a database of labeled de-
scriptors. I used four training images for each object, taken at 90 degrees
difference as the training set, and another image for each object taken at 45
degrees from one of the training images as the test set.
2.1 SIFT Saliency Operator
In this section I describe in detail the SIFT algorithm1. SIFT is an algorithm
for local feature extraction and descriptor representation, the keypoints are
1This section was adapted fromhttp://en.wikipedia.org/wiki/Scale-invariant feature transform
13
Chapter 2. Experimental Setting
invariant to image scale and rotation. They are also robust to changes in
illumination, noise, and minor changes in viewpoint. In addition to these
properties, they are highly distinctive, relatively easy to extract, allowing for
correct object identification with low probability of mismatch and are easy
to match against a (large) database of local features. Object description by
set of SIFT features is also robust to partial occlusion. It has been shown
that as few as 3 SIFT features from an object are enough to compute its
location and pose.
The algorithm has four stages.
2.1.1 Scale-Space Extrema Detection
In the first stage keypoints are detected. In order to this, the image is
convolved with Gaussian filters at different scales, and then the difference
of successive Gaussian-blurred images are taken, producing a Difference of
Gaussians filter which reacts to blobs in the image. Keypoints are then
taken as maxima/minima of the Difference of Gaussians (DoG) that occur
at multiple scales. A DoG image D(x, y, σ) is given by
D(x, y, σ) = L(x, y, kiσ)− L(x, y, kjσ) (2.1)
where L(x, y, kσ) is the original image I(x, y) convolved with the Gaussian
14
2.1. SIFT Saliency Operator
blur G(x, y, kσ) at scale kσ, i.e.,
L(x, y, kσ) = G(x, y, kσ) ∗ I(x, y) (2.2)
This means that a DoG image between scales kiσ and kjσ is just the
difference of the Gaussian-blurred images at scales kiσ and kjσ. For scale-
space extrema detection in the SIFT algorithm, the image is first convolved
with Gaussian-blurs at different scales. The convolved images are grouped
by octave (an octave corresponds to doubling the value of sigma), and the
value of ki is selected so that we obtain a fixed number of convolved images
per octave. Then the Difference-of-Gaussian images are taken from adjacent
Gaussian-blurred images per octave.
Once DoG images have been obtained, keypoints are identified as local
minima/maxima of the DoG images across scales. This is done by comparing
each pixel in the DoG images to its eight neighbors at the same scale and
nine corresponding neighboring pixels in each of the neighboring scales. If
the pixel value is the maximum or minimum among all compared pixels, it
is selected as a candidate keypoint.
This first stage in the algorithm is an approximation of earlier blob de-
tection methods that work by detecting scale-space extrema of the scale nor-
malized Laplacian [11], that is detecting points that are local extrema with
respect to both space and scale. The difference of Gaussians operator can
be seen as an approximation to the Laplacian. Figure 2.1 shows a sample
15
Chapter 2. Experimental Setting
object with keypoints extracted by SIFT from two different angles.
Figure 2.1: Example operation of SIFT on an image, circles mark the key-points, showing their scales
2.1.2 Accurate Keypoint Localization
Scale-space extrema detection produces too many keypoint candidates, some
of which are unstable. The second step in the algorithm is to perform a
detailed fit to the nearby data for accurate location, scale, and ratio of prin-
cipal curvatures. This information allows points to be rejected that have low
contrast (and are therefore sensitive to noise) or are poorly localized along
an edge.
Interpolation of nearby data for accurate position
First, for each candidate keypoint, interpolation of nearby data is used to
accurately determine its position. The algorithm calculates the interpolated
16
2.1. SIFT Saliency Operator
location of the maximum, which substantially improves matching and stabil-
ity [14]. The interpolation is done using the quadratic Taylor expansion of
the Difference-of-Gaussian scale-space function, D(x, y, σ) with the candidate
keypoint as the origin. This Taylor expansion is given by:
D(x) = D +∂DT
∂xx +
1
2xT∂2D
∂x2x (2.3)
where D and its derivatives are evaluated at the candidate keypoint and
x = (x, y, σ) is the offset from this point. The location of the extremum,
x, is determined by taking the derivative of this function with respect to x
and setting it to zero. If the offset x is larger than 0.5 in any dimension,
then that’s an indication that the extremum lies closer to another candidate
keypoint. In this case, the candidate keypoint is changed and the interpola-
tion performed instead about that point. Otherwise the offset is added to its
candidate keypoint to get the interpolated estimate for the location of the
extremum.
Discarding low-contrast keypoints
To discard the keypoints with low contrast, the value of the second-order
Taylor expansion D(x) is computed at the offset x. If this value is less than
0.03, the candidate keypoint is discarded. Otherwise it is kept, with final
location y + x and scale σ, where y is the original location of the keypoint
at scale σ.
17
Chapter 2. Experimental Setting
Eliminating edge responses
The DoG function will have strong responses along edges, even if the can-
didate keypoint is unstable to small amounts of noise. Therefore, in order
to increase stability, we need to eliminate the keypoints that have poorly
determined locations but have high edge responses.
For poorly defined peaks in the DoG function, the principal curvature
across the edge would be much larger than the principal curvature along it.
Finding these principal curvatures amounts to solving for the eigenvalues of
the second-order Hessian matrix, H:
H =
Dxx Dxy
Dxy Dyy
(2.4)
The eigenvalues of H are proportional to the principal curvatures of D.
It turns out that the ratio of the two eigenvalues, say α is the larger one, and
β the smaller one, with ratio r = αβ, is sufficient for SIFT’s purposes. The
trace of H, i.e. Dxx + Dyy, gives us the sum of the two eigenvalues, while
its determinant, i.e DxxDyy −D2xy, yields the product. The ratio R = Tr(H)2
Det(H)
can be shown to be equal to (r+1)2
r, which depends only on the ratio of the
eigenvalues rather than their individual values. R is minimum when the
eigenvalues are equal to each other. Therefore the higher the absolute differ-
ence between the two eigenvalues, which is equivalent to a higher absolute
difference between the two principal curvatures of D, the higher the value of
R. It follows that, for some threshold eigenvalue ratio rth, if R for a candidate
18
2.1. SIFT Saliency Operator
keypoint is larger than (rth+1)2
rth, that keypoint is poorly localized and hence
rejected. The current approach uses rth = 10.
2.1.3 Orientation Assignment
In this step, each keypoint is assigned one or more orientations based on
local image gradient directions. This is the key step in achieving invariance
to rotation as the keypoint descriptor can be represented relative to this
orientation and therefore achieve invariance to image rotation.
First, the Gaussian-smoothed image L (x, y, σ) at the keypoint’s scale σ
is taken so that all computations are performed in a scale-invariant manner.
For an image sample L (x, y) at scale σ, the gradient magnitude, m (x, y),
and orientation, θ (x, y), are precomputed using pixel differences:
m (x, y) =
√(L (x+ 1, y)− L (x− 1, y))2 + (L (x, y + 1)− L (x, y − 1))2
(2.5)
θ (x, y) = tan−1
(L (x, y + 1)− L (x, y − 1)
L (x+ 1, y)− L (x− 1, y)
)(2.6)
The magnitude and direction calculations for the gradient are done for
every pixel in a neighboring region around the keypoint in the Gaussian-
blurred image L. An orientation histogram with 36 bins is formed, with each
bin covering 10 degrees. Each sample in the neighboring window added to
19
Chapter 2. Experimental Setting
a histogram bin is weighted by its gradient magnitude and by a Gaussian-
weighted circular window with a σ that is 1.5 times that of the scale of the
keypoint. The peaks in this histogram correspond to dominant orientations.
Once the histogram is filled, the orientations corresponding to the highest
peak and local peaks that are within 80% of the highest peaks are assigned
to the keypoint. In the case of multiple orientations being assigned, an
additional keypoint is created having the same location and scale as the
original keypoint for each additional orientation.
Figure 2.2 shows the same sample object from Figure 2.1, this time show-
ing the orientation of the keypoints.
Figure 2.2: Example of SIFT keypoint orientation, arrows mark the scaleand orientation of the keypoint
20
2.1. SIFT Saliency Operator
2.1.4 Keypoint Descriptor
The previous stage found keypoint locations at particular scales and assigned
orientations to them. This ensured invariance to image location, scale and
rotation. The final stage computes descriptor vectors for these keypoints
such that the descriptors are highly distinctive and partially invariant to the
remaining variations, like illumination, 3D viewpoint, etc. This step is pretty
similar to the Orientation Assignment step.
The feature descriptor is computed as a set of orientation histograms on
(4 x 4) pixel neighborhoods. The orientation histograms are relative to the
keypoint orientation and the orientation data comes from the Gaussian image
closest in scale to the keypoint’s scale. Just like before, the contribution of
each pixel is weighted by the gradient magnitude, and by a Gaussian with
σ 1.5 times the scale of the keypoint. Histograms contain 8 bins each, and
each descriptor contains a 4x4 array of 16 histograms around the keypoint.
This leads to a SIFT feature vector with (4 x 4 x 8 = 128 elements). This
vector is normalized to enhance invariance to changes in illumination.
Although the dimension of the descriptor, i.e. 128, seems high, descriptors
with lower dimension than this don’t perform as well across the range of
matching tasks. Longer descriptors continue to do better but not by much
and there is an additional danger of increased sensitivity to distortion and
occlusion. It is also shown that feature matching accuracy is above 50%
for viewpoint changes of up to 50 degrees. Therefore SIFT descriptors are
invariant to minor affine changes.
21
Chapter 2. Experimental Setting
Figure 2.3 shows an example of a SIFT descriptor and the corresponding
image patch it was calculated from.
Figure 2.3: Example of SIFT descriptor representation. Top left image showsthe descriptor, arrows marking the magnitude of descriptor values in eacharea and direction. Bottom left image shows the original image with thekeypoint marked as a circle. Top right image shows the image patch used tocalculate the descriptor. Bottom right image shows the same image patchafter blurring with the relevant gaussian.
22
2.2. Amsterdam Library of Object Images
2.2 Amsterdam Library of Object Images
Amsterdam Library of Object Images (ALOI) [7] is a color image collection
of one-thousand small objects, recorded for scientific purposes. In order to
capture the sensory variation in object recordings, ALOI systematically varies
viewing angle, illumination angle, and illumination color for each object, and
additionally captures wide-baseline stereo images. ALOI consists of over a
hundred images of each object, yielding a total of 110,250 images for the
collection, occupying 140GB of disk space (uncompressed tiff, 60GB lossless
compressed png). Figure 2.4 shows some example objects from the ALOI
dataset.
The data set offers a testing and evaluation ground for a variety of com-
puter vision algorithms, amongst others: object recognition, pose estimation,
color constancy, invariant feature extraction, stereo algorithms, super reso-
lution from multiple recordings, and image retrieval systems.
In my research I used the object viewpoint collection part of the ALOI
data set. Object viewpoint collection consists of 72 images of each object
taken with the same illumination from 72 directions. A frontal camera was
used to record 72 aspects of the objects by rotating the object in the plane
at 5 degree resolution. This collection is similar to the COIL [21] collection.
Figure 2.5 shows an example of the object view point collection for one object.
Since my application expects gray scale images I used the gray scale
23
Chapter 2. Experimental Setting
Figure 2.4: Example objects from ALOI data set
variant of the ALOI dataset. Due to SIFT performance and computational
reasons I found the images with resolution of 364 x 288 pixels best suited for
my needs.
24
2.3. Application Structure
Figure 2.5: Object Viewpoint Collection for Object Number 307
2.3 Application Structure
2.3.1 Training
The object recognition application I used works as follows: Let Ii,j be the
j training image of object i. I use SIFT in order to extract keypoints from
the image and represent the as keypoint descriptors. I denote D(Ii,j)
the
descriptors extracted from image Ii,j. For each descriptor d ∈ D(Ii,j)
in the
training set I define the correct label of the keypoint l(d) by
l(d) = i ⇐⇒ d ∈ D(Ii,j)
for some j (2.7)
25
Chapter 2. Experimental Setting
Let n be the number of objects and k be the number of training images
for each object I define
T =⋃
i=1..n,j=1..k
D(Ii,j)
(2.8)
to be the training set, and
DBT ={
(d, l(d))∣∣d ∈ T} (2.9)
to be the training database, a database of keypoint descriptors and their
respective labels.
2.3.2 Testing
Given a test image I ti I use SIFT in order to extract keypoints from the
image and represent them as keypoint descriptors. I denote D(I ti)
to be the
descriptors extracted from I ti and find the label of I ti in the following way:
1. For each d ∈ D(I ti)
I denote l(T, d) the label of the nearest neighbor of
d in T , and find it by searching DBT .
2. For each d ∈ D(I ti)
I calculate the distinctiveness ratio suggested in
[14]:
r(d) =distance to the nearest descriptor in T labeled l(T, d)
distance to the nearest descriptor in T not labeled l(T, d)
(2.10)
26
2.3. Application Structure
3. I calculate the label of I ti
l(I ti ) = majority{l(T, d)
∣∣∣d ∈ D(I ti) and r(d) > 1.5}
(2.11)
For each descriptor d in the test set, if r(d) > 1.5 I say that d was
matched. If d was matched and l(T, d) = l(d) I say that d was matched
correctly. Figure 2.6 shows an example of two correctly matching descriptors
and their corresponding images.
Figure 2.6: Example of correct matching. Top images showing the descrip-tors. Bottom image showing the original image with the keypoint marked asa circle
27
Chapter 2. Experimental Setting
2.4 Application Performance
Let’s look at the performance of the application without filtering or reducing
the database size. Table 2.1 shows general statistics of the databases for 100,
200, and 500 objects. We can see that the percentage of matched keypoints is
16%-19% and the precent of correctly matched keypoints is 13%-18%. This
shows that a large precenteage of keypoints that were matched had a correct
match. The distinctiveness ratio described in Eq. 2.10 is responsible for this
high accuracy. We can also see that at most 20% of the keypoints were
matched, showing the potential of using filtering in order to improve the
performance of the application.
Table 2.1: General Statistics of Test Databases. Columns 1-3: quantities ofobjects, images and descriptors. Column 4-6: percent of matched descriptors,percent of correctly matched descriptors, and percent of matched imagesusing distinctiveness ratio of 1.5.
# of # of # of % % % ofobjects images descriptors descriptors descriptors images
matched matched matchedcorrectly correctly
100 400 71214 19.7 17.6 78200 800 122252 21.2 18.6 68.5500 2000 278571 16.3 13.6 61.4
28
Chapter 3
Experiments and Results
3.1 Filtering Keypoints using K Nearest
Neighbors
In this section I describe the first experiment I made. In this experiment
I used extra training data in order to train a K-Nearest Neighbor database
that I then used in order to filter out keypoints from subsequent test data.
Although one can argue about the usefulness of using one nearest neighbor
search in order to save work for another, all I intend to show in this ex-
periment is the potential of filtering keypoints. I show that using the extra
filtering step can decrease the number of keypoints the original application
has to deal with with small reduction in performance.
3.1.1 Experiment
I start with the application described in Sec. 2.3, after finishing the training
stage (and building the training set T and the training database DBT ) I
introduce an additional stage. I select one additional image from each object
29
Chapter 3. Experiments and Results
I+i and use SIFT in order to extract descriptors from the image and represent
them as image descriptors.
For each descriptor d ∈ D(I+i
)I find l(T, d), the label of the nearest
neighbor of d in T and define the following correctness measure:
mc(d) =
1 l(T, d) = l(d)
−1 l(T, d) 6= l(d)(3.1)
First, I tried to identify meta-features of the descriptor that highly corre-
late with the correctness measure mc(d), I did this by calculating correlations
between the correctness measure and different properties of of the descrip-
tor, such as average, median, or variance of different subsets of the descriptor
values. I combined several meta-features into a meta-descriptor ~e(d), notice
that for ~e(d) = d the descriptor itself can be considered as a specific case of
a trivial meta-descriptor.
I then use the set{
(~e(d),mc(d))}
of meta-descriptors and their correct-
ness measure from all additional training images{I+i
}i=1..n
as a basis for a
K nearest neighbor regression database DB+ (I tested the algorithm with
databases of 3 and 5 nearest neighbors).
Now, for each test image I ti I do the following:
1. For each d ∈ D(I ti)
I calculate ~e(d) and find the regression value m+(d)
using DB+.
2. I filter out all the descriptors whose regression value is lower than some
30
3.1. Filtering Keypoints using K Nearest Neighbors
constant:
D+(I ti)
={d ∈ D
(I ti)∣∣m+(d) > φ
}for some φ (3.2)
3. I then perform the rest of the testing stage described in Sec. 2.3 sub-
stituting D+(I ti)
for D(I ti), calculating l(T, d) and r(d) for each d ∈
D+(I ti)
in order to find l(I ti ).
3.1.2 Results
In order to identify meta-features that correlate with the correctness mea-
sure I systematically looked at the correlation for the following properties of
descriptors:
1. Descriptor values with intensity bigger or smaller than some value.
2. Descriptor values whose values are in some ranged (between a and b,
for some fixed a and b).
3. Descriptor values which represent gradient in a specific direction, or a
combination of directions.
4. Descriptor values which where taken from a specific area, or a combi-
nation or directions (from the 16 subareas the descriptor composed of,
see Sec. 2.1).
31
Chapter 3. Experiments and Results
I used these properties in combinations with various statistical properties
of these descriptor values:
1. Number of descriptor values with the properties.
2. Sum of the intensities of descriptor values with the properties.
3. Sum of the square of the intensities of descriptor values with the prop-
erties.
4. Median of the intensities of descriptor values with the properties.
These combinations produced thousands of meta-features candidates, from
which I used those with the highest correlation values. Unfortunately, cor-
relation values were not very high, but since the number of descriptors used
to compute the correlation was very high even lower correlation values were
significant. Figure. 3.1 and Figure. 3.2 show densities of meta-feature val-
ues for correctly and incorrectly matched descriptors, although two different
distributions can be seen the separation is not good enough for a single meta-
feature. I hoped that using a K nearest neighbor database of a number of
features will produce better results.
Although I used many combinations of meta-features as meta-descriptors,
result of actually using descriptor values were generally better. Nevertheless
I believe meta-features may be used in order to reduce the dimensionality
of the K nearest neighbor database (from the descriptor dimension of 128)
32
3.1. Filtering Keypoints using K Nearest Neighbors
Figure 3.1: Experiment 1: Meta-features density of correctly and incorrectlymatched descriptors. Left - median value of all descriptor values that repre-sent gradient in directions up, down, left, or right, in all areas (Correlation= 0.23). Right - sum of all descriptor values in the 12 peripheral areas, inall directions of gradient (Correlation = 0.18).
but this is out of the scope of this thesis. In order to demonstrate the use
of meta-descriptors when presenting the results of this experiment, I chose
one specific meta-descriptor which showed good results, Table. 3.1 details
the specific meta-features used in the meta-descriptor (which are the same
meta-features whose density is shown in Fig. 3.1 and Fig. 3.2).
33
Chapter 3. Experiments and Results
Figure 3.2: Experiment 1: Meta-features density of correctly and incorrectlymatched descriptors. Left - Number of descriptor values in the range 0.02 to0.24 (Correlation = 0.21). Right - Sum if squares of descriptor values biggerthan 0.12 (Correlation = -0.19).
Table 3.1: Meta-features with relatively high correlation value to correctnessof match, used in order to build the meta-descriptor. Corr. - Correlation ofmeta-feature to correctness measure. P-Value - Probability of the hypothesisof no correlation.
Function Corr. P-ValueMedian of gradient in directions up, down, left,or right, in all areas 0.23 0.00Sum of descriptor values in the 12 peripheral areas,in all directions of gradient 0.18 0.00Number of descriptor values in the range 0.02 to 0.24 0.21 0.00Sum of square of descriptor values bigger than 0.12 -0.19 0.00
In order to assess my results I used an average of 20 random filtering
of image descriptors as reference. When looking at the result of the object
recognition application described in Sec. 3.2 for 500 objects, we can see in
Fig. 3.3 that random filtering, in average, performs rather well by itself,
losing less than 10% accuracy when filtering 70% of the database showing
34
3.1. Filtering Keypoints using K Nearest Neighbors
the potential for filtering keypoints. We can also see that when filtering up
to 60% of the database my approach gives similiar results in a predictable
way.
Figure 3.3: Experiment 1: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors, using descriptor valuesand meta-features, compared to averaged random filtering
In Fig. 3.4 we can see that when filtering 70%-90% percent of the database
using my approach we achieve the same accuracy of the average random
filtering with 1/2 of the descriptors (twice as much filtering). For example,
in order to match 47% of the images random filtering can filter 77% of the
database leaving 23% of the descriptors, while using my approach I can filter
88% leaving only 12%.
Figure. 3.5 shows similar results for 200 objects.
35
Chapter 3. Experiments and Results
Figure 3.4: Experiment 1: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors, using descriptor valuesand meta-features, compared to averaged random filtering, when filtering70%-90% of the database, showing I achieve the same accuracy of the averagerandom filtering with less than 1/2 of the descriptors.
Figure 3.5: Experiment 1: Percent of matched images vs. filtering of thedatabase for 200 objects, 3 and 5 nearest neighbors, using descriptor valuesand meta-features, compared to averaged random filtering
36
3.2. Reducing Keypoint Database Size using Additional Training Data
3.2 Reducing Keypoint Database Size using
Additional Training Data
In this section I describe the second experiment I made. I used the same extra
training data as in the last experiment but this time, instead of preforming
filtering on the test images, I filtered the original database itself. Filtering
the database is only performed one time and therefore the complexity of the
filtering stage itself is less important, suitable for the computationally heavy
K Nearest Neighbor algorithm. Again, I show I can significantly reduce the
database size with small reduction in performance.
Following the results of the last experiment I have abandoned the use
of meta-descriptors used only the descriptor vector as the basis for the K
nearest neighbor database.
3.2.1 Experiment
I start by building DB+ as described in Sec. 3.1, this time using only ~e(d) = d
as the database key. Then, I build the reduced training set T+ and training
database DBT+ in the following way:
1. For each descriptor d in the training set T I find the regression value
m+(d) using DB+.
37
Chapter 3. Experiments and Results
2. I build the reduced training set T+:
T+ ={d ∈ T
∣∣m+(d) > φ}
for some φ (3.3)
3. I build the reduced training database DBT+ :
DBT+ ={
(d, l(d))∣∣d ∈ T+
}(3.4)
I then perform the testing stage described in Sec. 2.3 substituting T+ for
T , calculating l(T+, d) and r(d) for each d ∈ D(I ti)
in order to find l(I ti ).
3.2.2 Results
In order to assess our results I used an average of 20 random filtering of the
descriptor database as reference. When looking at the result of the object
recognition application described in Sec. 3.2 for 500 objects, we can see in
Fig. 3.6 again that random filtering, in average, performs rather well by
itself, losing only 5% accuracy when filtering 70% of the database showing
the potential for reducing the database size. We can also see that when
filtering up to 70% of the database my approach gives similiar results in a
predictable way.
In Fig. 3.7 we can see that when filtering 70%-95% percent of the database
using my approach we achieve the same accuracy of the average random
filtering with 2/3 the database size. For example, in order to match 48% of
38
3.2. Reducing Keypoint Database Size using Additional Training Data
Figure 3.6: Experiment 2: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors compared to averagedrandom filtering
the images random filtering can filter 81% of the database leaving 19% of the
descriptors, while using my approach I can filter 89% leaving only 11%.
Figure. 3.8 shows that for 200 objects the results were very similar to 500
objects, but when doing the experiment with 100 objects I did not achieve
any significant improvement over the random reference, this result concurs
with our assumption that keypoint filtering is beneficial for large keypoint
databases.
Next, let’s look at how filtering has affected the number of correctly
matched descriptors. Figure 3.13 shows the result of filtering for 500 objects
in relative percentage over the random reference. The main result these
39
Chapter 3. Experiments and Results
Figure 3.7: Experiment 2: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors compared to averagedrandom filtering, when filtering 70%-95% of the database, showing I achievethe same accuracy of the average random filtering with less than 2/3 thedatabase size.
figures show is that when filtering 70%-95% of the database my filtering
technique gives an increase of 5%-10% percent correctlly matched descriptors
relative to random filtering. Figure. 3.10 shows similar results for 100 and
200 objects.
40
3.2. Reducing Keypoint Database Size using Additional Training Data
Figure 3.8: Experiment 2: Percent of matched images vs. filtering of thedatabase for 100 objects (left) and 200 object (right), 3 and 5 nearest neigh-bors compared to averaged random filtering
Figure 3.9: Experiment 2: Percent of matched descriptors relative to therandom reference vs. filtering of the database for 500 objects, 3 and 5 nearestneighbors
41
Chapter 3. Experiments and Results
Figure 3.10: Experiment 2: Percent of matched descriptors relative to therandom reference vs. filtering of the database for 100, 200, and 500 objects,3 nearest neighbors
3.3 Reducing Keypoint Database Size using
Leave One Out
Looking at the last experiment it can be argued that we don’t always have
the extra training data, or if we do, we had better use it to enrich the original
training set. In this section I describe an experiment that performs database
size reduction without using extra training data. I use leave one out cross
validation in order to train a K nearest neighbor database and then use it in
order to reduce the database size.
42
3.3. Reducing Keypoint Database Size using Leave One Out
3.3.1 Experiment
I start by building the training set T and training database DBT as described
in Sec. 2.3. For each j I then break T into two parts: Rj - the descriptors
extracted from the j training image for each object, and Tj - the rest of the
descriptors in the training set. Formally:
Rj =⋃i=1..n
D(Ii,j)
and Tj = T \Rj
I calculate mcj(d) for each d ∈ Rj using Tj for training set as described in
(3.1). At this stage I use the set{
(d,mcj(d))
}of descriptors in Rj and their
correctness measure as a basis for a K nearest neighbor regression database
(again, I tested the algorithm with 3 and 5 nearest neighbors) and find the
regression value mj(d) for each d ∈ Tj.
Finally, I calculate the mark m(d) for each descriptor d ∈ Ii,j:
m(d) =∑j′ 6=j
mj′(d) (3.5)
I then create the filtered training set T ′ from T in the following way:
T ′ ={d ∈ T
∣∣m(d) > φ}
for some φ (3.6)
And the filtered training database DBT ′ :
DBT ′ ={
(d, l(d))∣∣d ∈ T ′} (3.7)
43
Chapter 3. Experiments and Results
I then perform the testing stage described in Sec. 2.3 substituting T ′ for
T , calculating l(T ′, d) and r(d) for each d ∈ D(I ti)
in order to find l(I ti ).
3.3.2 Results
Figure 3.11: Experiment 3: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors compared to averagedrandom filtering
Again, in order to assess our results I used the same average of 20 random
filtering of descriptors as reference. When looking at the result of the object
recognition application described in Sec. 3.3.1 for 500 objects, we can see that
in Fig. 3.11 that this experiment shows similar results to the last one - when
filtering up to 70% of the database my approach gives similiar results in a
predictable way. In Fig. 3.12 we can see that when filtering 70%-95% percent
44
3.3. Reducing Keypoint Database Size using Leave One Out
Figure 3.12: Experiment 3: Percent of matched images vs. filtering of thedatabase for 500 objects, 3 and 5 nearest neighbors compared to averagedrandom filtering, when filtering 70%-95% of the database, showing I achievethe same accuracy of the average random filtering with less than 2/3 thedatabase size.
of the database using my approach we achieve, again, the same accuracy of
the average random filtering with 2/3 the database size. For example, in
order to match 45% of the images random filtering can filter 84% of the
database leaving 16% of the descriptors, while using my approach I can filter
91% leaving only 9%.
Next, let’s look at how filtering has affected the number of correctly
matched descriptors. Figure 3.13 shows the result of filtering for 500 objects
in relative percentage over the random reference. The main result these
figures show is that when filtering 70%-95% of the database my filtering
technique give an increase of 5%-20% percent correctlly matched descriptors
45
Chapter 3. Experiments and Results
Figure 3.13: Experiment 3: Percent of matched descriptors relative to therandom reference vs. filtering of the database for 500 objects, 3 and 5 nearestneighbors
relative to random filtering.
Finally, when looking at the effect of the method on databases of differ-
ent sizes, Fig. 3.14 shows that for 100 and 200 objects this experiment did
not produce results that were significantly better than those of the random
reference. Figure. 3.15 compares results for databases of different size, show-
ing that when database size was increased to 500 objects the experiment
showed much better results then with database size of 100 or 200 objects,
emphasizing the benefit of my approach for large databases.
46
3.3. Reducing Keypoint Database Size using Leave One Out
Figure 3.14: Experiment 3: Percent of matched images vs. filtering of thedatabase for 100 objects (left) and 200 object (right), 3 and 5 nearest neigh-bors compared to averaged random filtering
Figure 3.15: Experiment 3: Percent of matched descriptors relative to therandom reference vs. filtering of the database for 100, 200, and 500 objects,3 nearest neighbors
47
Chapter 3. Experiments and Results
48
Chapter 4
Conclusions and Future work
4.1 Conclusions
In this thesis I have proposed a new approach for reducing the complexity of
keypoint based computer vision applications by filtering keypoints based on
learning which descriptors are beneficial to the application. I performed three
experiments demonstrating this approach, based on an object recognition
application that uses a keypoint database:
1. Filtering keypoints using a K nearest neighbors database that repre-
sents the ability of the keypoint to produce correct result for the ap-
plication.
2. Filtering keypoints from the keypoint database, thus reducing its size,
using a K nearest neighbors database that requires additional training
data.
3. Filtering keypoints from the keypoint database, thus reducing its size,
using a K nearest neighbors database based on a leave one out method,
without using additional training data.
49
Chapter 4. Conclusions and Future work
In all of the experiments I showed that I can significantly reduce the
number of keypoints used, with small reduction in performance. I also showed
that my approach can achieve the same performance results with 2/3 the
database size compared to the average of random filtering.
4.2 Future Work
In the future, this approach can be further developed in the following ways:
1. Adapting the approach to other applications in computer vision, two
promising candidate areas are object class recognition and object track-
ing.
2. Using the approach with keypoint databases that are used for more than
one purpose, utilizing a different filter for each purpose. In this case the
purposes can be similar (such as two different object recognition tasks,
which recognize different types of objects) or completely different (such
as an object recognition task and an object tracking task). It is my
assumption that the two different tasks will require a different subsets
of the database, which this approach can provide them with.
3. Upscaling the database size even further. It would be interesting to see
how this application will behave with databases of millions of descrip-
tor, in these cases filtering is crucial due to the large amount of space
needed to store the descriptors.
50
4.2. Future Work
4. Using more time and space efficient machine learning techniques. Most
of my research concentrated on K nearest neighbor learning as a simple
way to asses the potential of my approach. It would be interesting to
see how more elaborate and efficient machine learning techniques, such
as support vector machines, or more elaborate use of meta-features will
perform with regard to my approach.
51
Chapter 4. Conclusions and Future work
52
Bibliography
[1] Herbert Bay, Tinne Tuytelaars, and Luc J. Van Gool. SURF: Speeded
up robust features. In Ales Leonardis, Horst Bischof, and Axel Pinz,
editors, ECCV (1), volume 3951 of Lecture Notes in Computer Science,
pages 404–417. Springer, 2006.
[2] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Scene classification
via pLSA. In Ales Leonardis, Horst Bischof, and Axel Pinz, editors,
ECCV (4), volume 3954 of Lecture Notes in Computer Science, pages
517–530. Springer, 2006.
[3] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Scene classification
using a hybrid generative/discriminative approach. IEEE Trans. Pattern
Anal. Mach. Intell, 30(4):712–727, 2008.
[4] Eric Bourque and Gregory Dudek. Viewpoint selection – an autonomous
robotic system for virtual environment creation, January 10 1998.
[5] Chris Dance, Jutta Willamowski, Lixin Fan, Cedric Bray, and Gabriela
Csurka. Visual categorization with bags of keypoints. In ECCV Inter-
national Workshop on Statistical Learning in Computer Vision, 2004.
53
Chapter 4. Bibliography
[6] Robert Fergus, Pietro Perona, and Andrew Zisserman. Object class
recognition by unsupervised scale-invariant learning. In CVPR, pages
264–271. IEEE Computer Society, 2003.
[7] J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. The
Amsterdam library of object images. Int. J. Comput. Vis., 61(1):103–
112, 2005.
[8] Isabelle Guyon and Andre Elisseeff. An introduction to variable and
feature selection. Journal of Machine Learning Research, 3:1157–1182,
2003.
[9] C. Harris and M. Stephans. A combined corner and edge detector. In
Proc 4th Alvey Vision Conf, pages 189–192, August 1988. Manchester.
[10] Eyal Krupka, Amir Navot, and Naftali Tishby. Learning to select fea-
tures using their properties. Journal of Machine Learning Research,
9:2349–2376, October 2008.
[11] Tony Lindeberg. Feature detection with automatic scale selection. In-
ternational Journal of Computer Vision, 30(2):79–116, 1998.
[12] Michael Litzkow, Miron Livny, and Matthew Mutka. Condor - a hunter
of idle workstations. In Proceedings of the 8th International Conference
of Distributed Computing Systems, June 1988.
[13] David G. Lowe. Object recognition from local scale-invariant features.
In ICCV, pages 1150–1157, 1999.
54
[14] David G. Lowe. Distinctive image features from scale-invariant key-
points. International Journal of Computer Vision, 60(2):91–110, 2004.
[15] Gareth Loy and Alexander Zelinsky. Fast radial symmetry for detecting
points of interest. IEEE Trans. Pattern Anal. Mach. Intell, 25(8):959–
973, 2003.
[16] Jirı Matas, Ondrej Chum, Martin Urban, and Tomas Pajdla. Robust
wide baseline stereo from maximally stable extremal regions. In Paul L.
Rosin and David Marshall, editors, Proceedings of the British Machine
Vision Conference, volume 1, pages 384–393, London, UK, September
2002. BMVA.
[17] Krystian Mikolajczyk and Cordelia Schmid. An affine invariant interest
point detector. In Anders Heyden, Gunnar Sparr, Mads Nielsen, and
Peter Johansen, editors, ECCV (1), volume 2350 of Lecture Notes in
Computer Science, pages 128–142. Springer, 2002.
[18] Krystian Mikolajczyk and Cordelia Schmid. Scale & affine invariant
interest point detectors. International Journal of Computer Vision,
60(1):63–86, 2004.
[19] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation
of local descriptors. IEEE Transactions on Pattern Analysis & Machine
Intelligence, 27(10):1615–1630, 2005.
55
Chapter 4. Bibliography
[20] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zis-
serman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and Luc J. Van
Gool. A comparison of affine region detectors. International Journal of
Computer Vision, 65(1-2):43–72, 2005.
[21] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library
(COIL-100). In Columbia University, 1996.
[22] Stepan Obdrzalek and Jiri Matas. Object recognition using local affine
frames on distinguished regions. In Paul L. Rosin and A. David Marshall,
editors, BMVC. British Machine Vision Association, 2002.
[23] Philip Pritchett and Andrew Zisserman. Wide baseline stereo matching.
In ICCV, pages 754–760, 1998.
[24] Arnau Ramisa, Ramon Lopez de Mantaras, David Aldavert, and Ri-
cardo Toledo. Comparing combinations of feature regions for panoramic
VSLAM. In Janan Zaytoon, Jean-Louis Ferrier, Juan Andrade-Cetto,
and Joaquim Filipe, editors, ICINCO-RA (2), pages 292–297. INSTICC
Press, 2007.
[25] D. Reisfeld, H. Wolfson, and Y. Yeshurun. Detection of interest points
using symmetry. In ICCV, pages 62–65, 1990.
[26] Ariel Tankus and Yehezkel Yeshurun. Convexity-based visual camou-
flage breaking. Computer Vision and Image Understanding, 82(3):208–
237, 2001.
56
[27] Ariel Tankus, Yehezkel Yeshurun, and Nathan Intrator. Face detection
by direct convexity estimation. Pattern Recognition Letters, 18(9):913–
922, 1997.
[28] P. Tissainayagam and D. Suter. Object tracking in image sequences
using point features. In APRS Workshop on Digital Image Computing
Online Proceedings, pages 1197–1203, 2003.
[29] Simon A. J. Winder and Matthew Brown. Learning local image descrip-
tors. In CVPR. IEEE Computer Society, 2007.
57
Chapter 4. Bibliography
58