An a ne invariant salient region detector · 2004. 11. 26. · Timor Kadir, Andrew Zisserman, and Michael Brady Department of Engineering Science, University of Oxford, Oxford, UK

An affine invariant salient region detector

Timor Kadir, Andrew Zisserman, and Michael Brady

Department of Engineering Science,University of Oxford,

Oxford, UK.timork,az,[email protected]

Abstract. In this paper we describe a novel technique for detectingsalient regions in an image. The detector is a generalization to affineinvariance of the method introduced by Kadir and Brady [10]. The de-tector deems a region salient if it exhibits unpredictability in both itsattributes and its spatial scale.The detector has significantly different properties to operators based onkernel convolution, and we examine three aspects of its behaviour: invari-ance to viewpoint change; insensitivity to image perturbations; and re-peatability under intra-class variation. Previous work has, on the whole,concentrated on viewpoint invariance. A second contribution of this pa-per is to propose a performance test for evaluating the two other aspects.We compare the performance of the saliency detector to other stan-dard detectors including an affine invariance interest point detector. Itis demonstrated that the saliency detector has comparable viewpointinvariance performance, but superior insensitivity to perturbations andintra-class variation performance for images of certain object classes.

1 Introduction

The selection of a set of image regions forms the first step in many computervision algorithms, for example for computing image correspondences [2, 17, 19,20, 22], or for learning object categories [1, 3, 4, 23]. Two key issues face the al-gorithm designer: the subset of the image selected for subsequent analysis andthe representation of the subset. In this paper we concentrate on the first ofthese issues. The optimal choice for region selection depends on the applica-tion. However, there are three broad classes of image change under which goodperformance may be required:1. Global transformations. Features should be repeatable across the expectedclass of global image transformations. These include both geometric and pho-tometric transformations that arise due to changes in the imaging conditions.For example, region detection should be covariant with viewpoint as illustratedin Figure 1. In short, we require the segmentation to commute with viewpointchange.2. Local perturbations. Features should be insensitive to classes of semi-localimage disturbances. For example, a feature responding to the eye of a human faceshould be unaffected by any motion of the mouth. A second class of disturbance is

2 Timor Kadir et al.

Fig. 1. Detected regions, illustrated by a centre point and boundary, should commutewith viewpoint change – here represented by the transformation H.

where a region neighbours a foreground/background boundary. The detector canbe required to detect the foreground region despite changes in the background.

3. Intra-class variations. Features should capture corresponding object partsunder intra-class variations in objects. For example, the headlight of a car fordifferent brands of car (imaged from the same viewpoint).

In this paper we make two contributions. First, in Section 2 we describe ex-tensions to the region detector developed by Kadir and Brady [10]. The exten-sions include covariance to affine transformations (the first of the requirementsabove), and an improved implementation which takes account of anti-aliasing.The performance of the affine covariant region detector is assessed in Section 3on standard test images, and compared to other state of the art detectors.

The second contribution is in specifying a performance measure for the twoother requirements above, namely tolerance to local image perturbations and tointra-class variation. This measure is described in Section 4 and, again, perfor-mance is compared against other standard region operators.

Previous methods of region detection have largely concentrated on the firstrequirement. So-called corner features or interest points have had wide applica-tion for matching and recognition [7, 21]. Recently, inspired by the pioneeringwork of Lindeberg [14], scale and affine adapted versions have been developed[2, 18–20]. Such methods have proved to be robust to significant variations inviewpoint. However, they operate with relatively large support regions and arepotentially susceptible to semi-local variations in the image; for example, move-ments of objects in a scene. They fail on criterion 2.

Moreover, such methods adopt a relatively narrow definition of saliency andscale; scale is usually defined with respect to a convolution kernel (typically aGaussian) and saliency to an extremum in filter response. While it is certainlythe case that there are many useful image features that can be defined in such amanner, efforts to generalise such methods to capture a broader range of salientimage regions have had limited success.

An affine invariant salient region detector 3

Entropy 2.8411

0 5 10 15 20 250

1

2

3

Scale

Ent

ropy

Entropy 2.8411

0 5 10 15 20 250

1

2

3

Scale

Ent

ropy

Entropy 2.8411

0 5 10 15 20 250

1

2

3

Scale

Ent

ropy

Entropy 2.8411

0 5 10 15 20 250

1

2

3

Scale

Ent

ropy

(a) (b)

Fig. 2. (a) Complex regions, such as the eye, exhibit unpredictable local intensity hencehigh entropy. Image from NIST Special Database 18, Mugshot Identification Database.However, entropy is invariant to permutations of the local patch (b).

Other methods have extracted affine covariant regions by analysing the imageisocontours directly [17, 22] in a manner akin to watershed segmentation. Relatedmethods have been used previously to extract features from mammograms [13].Such methods have the advantage that they do not rely on excessive smoothingof the image and hence capture precise object boundaries. Scale here is defined interms of the image isocontours rather than with respect to a convolution kernelor sampling window.

2 Information theoretic saliency

In this section we describe the saliency region detector. First, we review theapproach of Kadir and Brady [10], then in Section 2.2 we extend the method tobe affine invariant, and give implementation details in Sections 2.3 and 2.4.

2.1 Similarity invariant saliency

The key principle underlying the Kadir and Brady approach [10] is that salientimage regions exhibit unpredictability, or ‘surprise’, in their local attributes and

over spatial scale. The method consists of three steps: I. Calculation of Shannonentropy of local image attributes (e.g. intensity or colour) over a range of scales— HD(s); II. Select scales at which the entropy over scale function exhibits apeak — sp; III. Calculate the magnitude change of the PDF as a function ofscale at each peak — WD(s). The final saliency is the product of HD(s) andWD(s) at each peak. The histogram of pixel values within a circular window ofradius s, is used as an estimate of the local PDF. Steps I and III measure thefeature-space and the inter-scale predictability respectively, while step II selectsoptimal scales. We discuss each of these steps next.


0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7Entropy

scale

HD

(s,x

)

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8Saliency over scale

scale

WD

(s,x

)

YD(s,x) = 0.79

YD(s,x) = 0.13

(a) (b) (c)

Fig. 3. The two entropy peaks shown in (a) correspond to the centre (in blue) andedge (in red) points in top image. Both peaks occur at similar magnitudes.

The entropy of local attributes measures the predictability of a region withrespect to an assumed model of simplicity. In the case of entropy of pixel in-tensities, the model of simplicity corresponds to a piecewise constant region.For example, in Figure 2(a), at the particular scales shown, the PDF of inten-sities in the cheek region is peaked. This indicates that most of these pixels arehighly predictable, hence entropy is low. However, the PDF in the eye region isflatter which indicates that here, pixel values are highly unpredictable and thiscorresponds to high entropy.

In step II, scales are selected at which the entropy is peaked. Through search-ing for such extrema, the feature-space saliency is locally optimised. Moreover,since entropy is maximised when the PDF is flat, i.e. all present attribute valuesare in equal proportion, such peaks typically occur at scales where the statisticsof two (or more) different pixel populations contribute equally to the PDF esti-mate. Figure 3(b) shows entropy as a function of scale for two points in Figure3(a). The peaks in entropy occur at scales for which there are equal proportionsof black and white pixels present. These significant, or salient scales, in the en-tropy function (analogous to the ‘critical-points’ in Gaussian scale-space [11, 15])serve as useful reference points since they are covariant with isotropic scaling,invariant to rotation and translation, and robust to small affine shears.

Note however, that the peaks for both points in Figure 3(b) attain an almostidentical magnitude. This is to be expected since both patches contain almostidentical proportions of black and white pixels. In fact, since histogrammingdestroys all local ordering information all permutations of the local patch donot affect its entropy. Figure 2(b) shows the entropy over scale function for animage patch taken from 2(a) and three permutations of its pixels: a linear ramp,a random reordering and a radial gradient. The entropy at the maximum scale(that of the whole patch) is the same for all permutations. However, the shapeof the entropy function is quite different for each case.

The role of Step III, the inter-scale unpredictability measure WD, is to weightthe entropy value such that some permutations are preferred over others. It is


defined as the magnitude change of the PDF as a function of scale, thereforethose orderings that are statistically self-dissimilar over scale are ranked higherthan those that exhibit stationarity.

Figure 3(c) shows WD as a function of scale. It can be seen that the plotcorresponding to the edge point has a much lower value than the one for thecentre point at the selected scale value. In essence, it is a normalised measure ofscale localisation. For example, in a noise image the pixel values are highly un-predictable at any one scale but over scale the statistics are stationary. However,a noise patch against a plain background would be salient due to the change instatistics.

In the continuous case, the saliency measure YD, a function of scale s andposition x, is defined as:

YD(sp,x) , HD(sp,x)WD(sp,x) (1)

i.e. for each point x the set of scales sp, at which entropy peaks, is obtained,then the saliency is determined by weighting the entropy at these scales by WD.Entropy, HD, is given by:

I HD(s,x) , −∫

p(I, s,x) log2p(I, s,x) dI (2)

where p(I, s,x) is the probability density of the intensity I as a function of scales and position x. The set of scales sp is defined by:

II sp ,

{

s :∂HD(s,x)

∂s= 0,

∂2HD(s,x)

∂s2< 0

}

(3)

The inter-scale saliency measure, WD(s,x), is defined by:

III WD(s,x) , s

∫∣

∣

∣

∣

∂

∂sp(I, s,x)

∣

∣

∣

∣

dI (4)

In this paper, entropy is measured for the grey level image intensity but otherattributes, e.g. colour or orientation, may be used instead; see [8] for examples.

This approach has a number of attractive properties. It offers a more generalmodel of feature saliency and scale compared to conventional feature detectiontechniques. Saliency is defined in terms of spatial unpredictability; scale by thesampling window and its parameterisation. For example, a blob detector im-plemented using a convolution of multiple scale Laplacian-of-Gaussian (LoG)functions [14], whilst responding to a number of different feature shapes, maxi-mally responds only to LoG function itself (or its inverse); in other words, it actsas a matched filter1. Many convolution based approaches to feature detectionexhibit the same bias, i.e. a preference towards certain features. This specificityhas a detrimental effect on the quality of the features and scales selected. In

1 This property is somewhat alleviated by the tendency of blurring to smooth imagestructures into LoG like functions.


contrast, the saliency approach responds equally to the LoG and all other per-mutations of its pixels provided that the constraint on WD(s) is satisfied. Thisproperty enables the method to perform well over intra-class variations as isdemonstrated in Section 4.

2.2 Affine invariant saliency

In the original formulation of [10], the method was invariant to the similaritygroup of geometric transformations and to photometric shifts. In this section,we develop the method to be fully affine invariant to geometric transformations.In principle, the modification is quite straightforward and may be achieved byreplacing the circular sampling window by an ellipse: under an affine transfor-mation, circles map onto ellipses. The scale parameter s is replaced by a vectors = (s, ρ, θ), where ρ is the axis ratio and θ the orientation of the ellipse. Undersuch a scheme, the major and minor axes of the ellipse are given by s/

√ρ and

s√

ρ respectively.

Increasing the dimensionality of the sampling window creates the possibilityof degenerate cases. For example, in the case of a dark circle against a whitebackground (see Figure 3(a)) any elliptical sampling window that contains anequal number of black and white pixels (HD constraint) but does not excludeany black pixels at the previous scale (WD constraint) will be considered equallysalient. Such cases are avoided by requiring that the inter-scale saliency, WD,is smooth across a number of scales. A simple way to achieve this is to apply a3-tap averaging filter to WD over scale.

2.3 Local Search

The complexity of a full search can be significantly reduced by adopting a localstrategy in the spirit of [2, 19, 20]. Our approach is to start the search onlyat seeds points (positions and scales) found by applying the original similarityinvariant search. Each seed circle is then locally adapted in order to maximise twocriteria, HD (entropy) and WD (inter-scale saliency). WD is maximised when theratio and orientation match that of the local image patch [9] at the correct scale,defined by a peak in HD. Therefore, we adopt an iterative refinement approach.The ratio and orientation are adjusted in order to maximise WD, then the scaleis adjusted such that HD is peaked. The search is stopped when neither the scalenor shape change (or a maximum iteration count is exceeded).

The final set of regions are chosen using a greedy clustering algorithm whichoperates from the most salient feature down (highest value of YD) and clusterstogether all features within the support region of the current feature. A globalthreshold on value or number is used.

The performance of this local method is compared to exhaustive search inSection 3.


0

5

10

15

20

25

30

35

0

20

40

60

80

1000

0.2

0.4

0.6

0.8

1

scalex

WD

(x,s

)

0

5

10

15

20

25

30

35

0

20

40

60

80

100

0

0.5

1

scalex

WD

(a) (b) (c)

Fig. 4. WD as a function of x and scale for image shown in (a) at the y-positionindicated by the dashed line using standard sampling (b), and anti-aliased sampling(c).

2.4 Anti-aliased sampling

The simplest method for estimating local PDFs from images is to use histogram-ming over a local neighbourhood, for example a circular region; pixels inside theregion are counted whilst those outside are not. However, this binary approachgives rise to step changes in the histogram as the scale is increased. WD is es-pecially sensitive to this since it measures the difference between two concentricsampling windows. For example, Figure 4(b) shows the variation of WD as afunction of x and scale for the image shown in 4(a). The surface is taken at apoint indicated by the dashed line. Somewhat surprisingly the surface is highlyirregular and noisy even for this ideal noise-free image, consequently, so is thesaliency space. Intuitively, the solution to this problem lies with a smoothertransition between the pixels that are included in the histogram and the onesthat are not.

The underlying problem is, in fact, an instance of aliasing. Restated from asampling perspective, the binary representation of the window is sampling with-out pre-filtering. Evidently, this results in severe aliasing. This problem has longbeen recognised in the Computer Graphics community and numerous methodshave been devised to better represent primitives on a discrete display [5].

To overcome this problem we use a smooth sampling window (i.e. a filteredversion of the ideal sampling window). However, in contrast to the CG applica-tion, here, the window weights the contributions of the pixels to the histogramnot the pixel values themselves; pixels near the edge contribute less to the countthan ones near the centre. It does not blur the image.

Griffin [6] and Kœnderink and van Doorn [12] have suggested weighting his-togram counts using a Gaussian window, but not in relation to anti-aliasing.However, for our purposes, the Gaussian poorly represents the statistics of theunderlying pixels towards the edges due to the slow drop-off. Its long tails causea slow computation since more pixels will have to be considered and also re-


sults in poor localisation. The traditional ‘pro-Gaussian’ arguments do not seemappropriate here.

Analytic solutions for the optimal sampling window are, in theory at least,possible to obtain. However, empirically we have found the following functionworks well:

SW (z) =1

1 +(

z

s

)n z =

√

(

x′

√ρ

)2

+ (y′√

ρ)2. (5)

with n = 42 where x′ = x cos θ + y sin θ and y′ = y cos θ − x sin θ achievesthe desired rotation. We truncate for small values of SW (z). This samplingwindow gives scalar values as a function of distance, z, from the window centre,which are used to build the histogram. Figure 4(c) shows the same slice throughWD space but generated using Equation 5 for the sampling weights. Furtherimplementation details and analysis may be found in [9, 10].

3 Performance under viewpoint variations

The objective here is to determine the extent to which detected regions commutewith viewpoint. This is an example of the global transformation requirementdiscussed in the introduction.

For these experiments, we follow the testing methodology proposed in [18,19]. The method is applied to an image set2 comprising different viewpoints ofthe same (largely planar) scene for which the inter-image homography is known.Repeatability is determined by measuring the area of overlap of correspondingfeatures. Two features are deemed to correspond if their projected positions differby less than 1.5 pixels. Results are presented in terms of error in overlappingarea between two ellipses µa, µb:

εS = 1 −µa ∩

(

AT µbA)

µa ∪ (AT µbA)(6)

where A defines a locally linearized affine transformation of the homographybetween the two images and µa ∩ (AT µbA) and µa ∪ (AT µbA) represent the areaof intersection and union of the ellipses respectively.

Figure 5(a) shows the repeatability performance as a function of viewpointof three variants of the affine invariant salient region detector: exhaustive searchwithout anti-aliasing (FS Affine ScaleSal), exhaustive search with anti-aliasing(AA FS Affine ScaleSal), and local search with anti-aliasing (AA LS Affine Scale-Sal). The performance is compared to the detector of Mikolajczyk and Schmid[19], denoted Affine MSHar. Results are shown for εS < 0.4.

It can be seen that the full search Affine Saliency and Affine MSHar featureshave a similar performance over the range of viewpoints. However, from 40◦ the

2 Graffiti6 from http://www.inrialpes.fr/lear/people/Mikolajczyk/


20 30 40 50 60 700

5

10

15

20

25

30

35

Viewpoint angle (degrees)

Rep

eata

bilit

y

FS Affine ScaleSalAA FS Affine ScaleSalAffine MSHarAA LS Affine ScaleSal

10 15 20 25 30 35 400

5

10

15

20

25

30

Number of Salient Regions

Ave

rage

Mat

chin

g P

ropo

rtio

n (%

)

ScaleSal−NoBGDoG−NoBGMSHar−NoBGScaleSal−BGDoG−BGMSHar−BG

(a) (b)

10 20 30 40 50 600

2

4

6

8

10

12


Ave

rage

Mat

chin

g P

ropo

rtio

n (%

)

ScaleSal−CarsDoG−CarsMSHar−CarsScaleSal−FacesDoG−FacesMSHar−Faces

10 15 20 25 30 35 400

5

10

15

20

25


Ave

rage

Mat

chin

g P

ropo

rtio

n (%

)

Affine ScaleSal−NoBGAffine MSHar−NoBGAffine ScaleSal−BGAffine MSHar−BG

(c) (d)

Fig. 5. Repeatability results under (a) viewpoint changes, (b,d) background perturba-tions and intra-class variations for Bike images, (c) intra-class variations for car andface images. Plots (b,c) are for similarity invariant and (d) for affine invariant detectors.

anti-aliased sampling provides some gains, though curiously diminishes perfor-mance at 20◦. The local search anti-aliased Affine Saliency performs reasonablywell compared to the full search methods but of course takes a fraction of thetime to compute.

4 Performance under intra-class variation and image

perturbations

The aim here is to measure the performance of a region detector under intra-classvariations and image perturbations – the other two requirements specified in theintroduction. In the following subsections we develop this measure and thencompare performance of the salient region detector to other region operators.In these experiments we used similarity invariant versions of three detectors:similarity Saliency (ScaleSal), Difference-Of-Gaussian (DoG) blob detector [16]


and the multi-scale Harris (MSHar) with Laplacian scale selection — this isAffine MSHar without the affine adaptation [19]. We also used affine invariantdetectors Affine ScaleSal and Affine MSHar. An affine invariant version of theDoG detector was not available.

4.1 The performance measure

We will discuss first measuring repeatability over intra-class variation. Supposewe have a set of images of the same object class, e.g. motorbikes. A regiondetection operator which is unaffected by intra-class variation will reliably selectregions on corresponding parts of all the objects, say the wheels, engine or seatfor motorbikes. Thus, we assess performance by measuring the (average) numberof correct correspondences over the set of images.

The question is: what constitutes a correct corresponding region? To deter-mine this, we use a proxy to the true intra-class transformation by assumingthat an affinity approximately maps one imaged object instance to another. Theaffinities are estimated here by manually clicking on corresponding points in eachimage, e.g. for motorbikes the wheels and seat/petrol tank join. We consider aregion to match if it fulfils three requirements: its position matches within 10pixels; its scale is within 20% and normalised mutual information3 between theappearances is > 0.2. For the affine invariant detectors, the scale test is replacedwith the overlap error, εs < 0.4 (Eq. 6), and the mutual information is appliedto elliptical patches transformed to circles. These are quite generous thresholdssince the objects are different and the geometric mapping approximate.

In detail we measure the average correspondence score S as follows. N regionsare detected on each image of the M images in the dataset. Then for a particularreference image i the correspondence score Si is given by the proportion ofcorresponding to detected regions for all the other images in the dataset, i.e.:

Si =Total number of matches

Total number of detected regions=

N i

M

N(M − 1)(7)

The score Si is computed for M/2 different selections of the reference image,and averaged to give S. The score is evaluated as a function of the number ofdetected regions N . For the DoG and MSHar detectors the features are orderedon Laplacian (or DoG) magnitude strength, and the top N regions selected.

In order to test insensitivity to image perturbation the data set is split intotwo parts: the first contains images with a uniform background and the second,images with varying degrees of background clutter. If the detector is robust tobackground clutter then the average correspondence score S should be similarfor both subsets of images.

4.2 Intra-class variation results

The experiments are performed on three separate data-sets, each containingdifferent instances from an object class: 200 images from Caltech Motorbikes

3 MI(A, B) = 2(H(A) + H(B) − H(A, B))/(H(A) + H(B))


(a) (b)

Fig. 6. Example images from (a) the two parts of the Caltech motorbike data set with-out background clutter (top) and with background clutter (bottom), and (b) Caltechcars (top) and Caltech faces (bottom).

(Side), 200 images from Caltech Human face (Front), and all 126 Caltech Cars(Rear) images. Figure 6 shows examples from each data set4.

The average correspondence score S results for the similarity invariant de-tectors are shown in Figure 5(b) and (c). Figure 5(d) shows the results for theaffine detectors on the motorbikes. For all three data sets and at all thresholdsthe best results are consistently obtained using the saliency detector. However,the repeatability for all the detectors is lower for the face and cars compared tothe motorbike case. This could be due to the appearances of the different objectclasses; motorbikes tend to appear more complex than cars and faces.

Figure 7 shows smoothed maps of the locations at which features were de-

tected in all 200 images in the motorbike image set. All locations have been backprojected onto a reference image. Bright regions are those at which detectionsare more frequent. The map for the saliency detector indicates that most detec-tions are near the object with a few high detection points near the engine, seatswheel centres, headlamp. In contrast, the DoG and MSHar maps show a muchmore diffuse pattern over the entire area caused by poor localisation and falseresponses to background clutter.

4.3 Image perturbation results

The motorbike data set is used to assess insensitivity to background clutter.There are 84 images with a uniform background, and 116 images with varyingdegrees of background clutter; see Figure 6(a).

Figure 5(b) shows separate plots for motorbike images with and withoutbackground clutter at N=10 to 40. The saliency detector finds, on average, ap-proximately 25% of 30 features within the matching constraints; this corresponds

4 Available from http://www.robots.ox.ac.uk/˜vgg/data/.


50 100 150 200 250

20

40

60

80

100

120

140

50 100 150 200 250

20

40

60

80

100

120

140

50 100 150 200 250

20

40

60

80

100

120

140

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MultiScale Harris Difference-of-Gaussian Saliency

Fig. 7. Smoothed map of the detected features over all 200 images in the motorbikeset back projected onto one image. The colour indicates the normalised number ofdetections in a given area (white is highest). Note the relative ‘tightness’ of the brightareas of the saliency detector compared to the DoG and MSHarr.

to about 7 features per image on average. In contrast, the MSHar and DoG de-tectors select 2-3 object features per image at this threshold. Typical examplesof the matched regions selected by the saliency detector on this data set areshown in Figure 8.

There is also a marked difference in the way the various detectors are affectedby clutter. It has little effect on the ScaleSal detector whereas it significantlyreduces the DoG performance and similarly that of MSHar. Similar trends areobtained for the affine invariant detectors applied to the motorbikes images,shown in Figure 5(d).

Local perturbations due to changes in the scene configuration, backgroundclutter or changes within in the object itself can be mitigated by ensuring com-pact support of any probing elements. Both the DoG and MSHar methods relyon relatively large support windows which cause them to be affected by non-localchanges in the object and background; compare the two cluttered and unclut-tered background results for the motorbike experiments.

There may be several other relevant factors. First, both the DoG and MSHarmethods blur the image, hence causing a greater degree of similarity betweenobjects and background. Second, in most images the objects of interest tend tobe in focus while backgrounds are out of focus and hence blurred. Blurred regionstend to exhibit slowly varying statistics which result in a relatively low entropyand inter-scale saliency in the saliency detector. Third, the DoG and MSHarmethods define saliency with respect to specific properties of the local surfacegeometry. In contrast, the saliency detector uses a much broader definition.

5 Discussion and Future Work

In this paper we have presented a new region detector which is comparable tothe state of the art [19, 20] in terms of co-variance with viewpoint. We have alsodemonstrated that it has superior performance on two further criteria: robustnessto image perturbations, and repeatability under intra-class variability. The newdetector extends the original method of Kadir and Brady to affine invariance; we


Fig. 8. Examples of the matched regions selected by the similarity saliency detectorfrom the motorbike images: whole front wheels; forks; seats; handle-bars.

have developed a properly anti-aliased implementation and a fast optimisationbased on a local search.

We have also proposed a new method to test detectors under intra-classvariations and background perturbations. Performance under this extended cri-terion is important for many applications, for example part detectors for objectrecognition.

The intra-class experiments demonstrate that defining saliency in the mannerof the saliency detector is, on average, a better search heuristic than the otherregion detectors tested on at least the three data sets used here.

It is interesting to consider how the design of feature detectors affects perfor-mance. Many global effects, such as viewpoint, scale or illumination variationscan be modelled mathematically and as such can be tackled directly providedthe detector also lends itself to such analysis. Compared to the diffusion-basedscale-spaces, relatively little is currently known about the properties of spacesgenerated by statistical methods such as that described here. Further investiga-tion of its properties seems an appealing line of future work.

We plan to compare the saliency detector to other region detection ap-proaches which are not based on filter response extrema such as [17, 22]

Acknowledgements

We would like to acknowledge Krystian Mikolajczyk for supplying the MSHarand embedded (David Lowe) DoG feature detector code. Thanks to Mark Ever-ingham and Josef Sivic for many useful discussions and suggestions. This workwas funded by EC project CogViSys.

References

1. S. Agarwal and D. Roth. Learning a sparse representation for object detection. InProc. European Conf. Computer Vision, pages 113–130, 2002.

2. A. Baumberg. Reliable feature matching across widely separated views. In Proc.

Computer Vision Pattern Recognition, pages 774–781, 2000.

3. E. Borenstein and S. Ullman. Class-specific, top-down segmentation. In Proc.

European Conf. Computer Vision, pages 109–124, 2002.


4. R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervisedscale-invariant learning. In Proc. Computer Vision Pattern Recognition, pages II:264–271, 2003.

5. J.A. Foley and A. Van Dam. Fundamentals of Interactive Computer Graphics.Addison-Wesley, 1982.

6. L.D. Griffin. Scale-imprecision space. Image and Vision Computing, 15:369–398,1997.

7. C. Harris and M. Stephens. A combined corner and edge detector. In Proc. Alvey

Vision Conf., pages 189–192, 1988. Manchester.8. T. Kadir. Scale, Saliency and Scene Description. PhD thesis, University of Oxford,

2002.9. T. Kadir, D. Boukerroui, and J.M. Brady. An analysis of the scale saliency algo-

rithm. Technical Report OUEL No: 2264/03, University of Oxford, 2003.10. T. Kadir and J.M. Brady. Scale, saliency and image description. Intl. J. of Com-

puter Vision, 45(2):83–105, 2001.11. J.J. Kœnderink and A.J. van Doorn. Representation of local geometry in the visual

system. Biological Cybernetics, 63:291–297, 1987.12. J.J. Kœnderink and A.J. van Doorn. The structure of locally orderless images.

Intl. J. of Computer Vision, 31(2/3):159–168, 1999.13. S. Kok-Wiles, M. Brady, and R. Highnam. Comparing mammogram pairs for

the detection of lesions. In Proc. Intl. Workshop on Digital Mammography, pages103–110, 1998.

14. T. Lindeberg. Detecting salient blob-like image structures and their scales with ascale-space primal sketch: A method for focus-of-attention. Intl. J. of Computer

Vision, 11(3):283–318, 1993.15. T. Lindeberg and B.M. ter Haar Romeny. Linear scale-space: I. basic theory,

II. early visual operations. In B.M. ter Haar Romeny, editor, Geometry-Driven

Diffusion. Kluwer Academic Publishers, Dordrecht, The Netherlands, 1994.16. D.G. Lowe. Object recognition from local scale-invariant features. In Proc. Intl.

Conf. on Computer Vision, pages 1150–1157, 1999.17. J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from

maximally stable extremal regions. In Proc. British Machine Vision Conf., pages384–393, 2002.

18. K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points.In Proc. Intl. Conf. on Computer Vision, 2001.

19. K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. InProc. European Conf. Computer Vision, 2002.

20. F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets,or “How do I organize my holiday snaps?”. In Proc. European Conf. Computer

Vision, pages 414–431, 2002.21. C. Schmid and R. Mohr. Local greyvalue invariants for image retrieval. IEEE

Trans. Pattern Analysis and Machine Intelligence, 19(5):530–535, 1997.22. T. Tuytelaars and L. Van Gool. Wide baseline stereo based on local, affinely

invariant regions. In Proc. British Machine Vision Conf., pages 412–422, 2000.23. M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recog-

nition. In Proc. European Conf. Computer Vision, June 2000.

Documents

An a ne invariant salient region detector · 2004. 11. 26. · Timor Kadir, Andrew Zisserman, and Michael Brady Department of Engineering Science, University of Oxford, Oxford, UK