15
Rotationally Invariant Descriptors Using Intensity Order Pooling Bin Fan, Member, IEEE Computer Society, Fuchao Wu, and Zhanyi Hu Abstract—This paper proposes a novel method for interest region description which pools local features based on their intensity orders in multiple support regions. Pooling by intensity orders is not only invariant to rotation and monotonic intensity changes, but also encodes ordinal information into a descriptor. Two kinds of local features are used in this paper, one based on gradients and the other on intensities; hence, two descriptors are obtained: the Multisupport Region Order-Based Gradient Histogram (MROGH) and the Multisupport Region Rotation and Intensity Monotonic Invariant Descriptor (MRRID). Thanks to the intensity order pooling scheme, the two descriptors are rotation invariant without estimating a reference orientation, which appears to be a major error source for most of the existing methods, such as Scale Invariant Feature Transform (SIFT), SURF, and DAISY. Promising experimental results on image matching and object recognition demonstrate the effectiveness of the proposed descriptors compared to state-of-the-art descriptors. Index Terms—Local image descriptor, rotation invariance, monotonic intensity invariance, image matching, intensity orders, SIFT. Ç 1 INTRODUCTION L OCAL image descriptors computed from interest regions have been widely studied in computer vision. They have become more and more popular and useful for a variety of visual tasks, such as structure from motion [1], [2], [3], [4], [5], object recognition [6], and classification [7], as well as panoramic stitching [8]. A good local image descriptor is expected to have high discriminative ability so that the described point can be easily distinguished from other points. Meanwhile, it should also be robust to a variety of possible image transformations, such as scale, rotation, blur, illumination, and viewpoint changes, so that the corresponding points can be easily matched across images which are captured under different imaging conditions. Improving distinctiveness while main- taining robustness is the main concern in the design of local image descriptors. In this paper, we focus on designing local image descriptors for interest regions. Interest regions are usually detected as affine invariant regions (e.g., Hessian-Affine and Harris-Affine regions [9]) and represented by ellipses, which can be normalized to a canonical region by mapping the corresponding ellipses onto a circle. Such an affine normalization procedure has a rotation ambiguity. In order to alleviate this problem, researchers either designed rotationally invariant descriptors [10], [11], [12], [13] or rotated the canonical/normalized region accord- ing to an estimated reference orientation of the interest region, such as [6], [14], [15], [16]. The former achieves rotation invariance at the expense of degrading discrimina- tive ability since this kind of descriptor usually loses some important spatial information, while the latter suffers from the potential instability of orientation estimation. This paper proposes a novel method for descriptor construction. By using gradient-based and intensity-based local features, two local image descriptors are obtained, Multisupport Region Order-Based Gradient Histogram (MROGH) and Multisupport Region Rotation and Intensity Monotonic Invariant Descriptor (MRRID). They are rotation invariant without relying on a reference orientation and still have high discriminative ability. Therefore, they possess the advantages of the existing two kinds of descriptors, but effectively avoid their disadvantages. Our main contribu- tions include: . A comprehensive reliability analysis of orientation estimation is reported in Section 3. We found that the orientation estimation error is one of the major sources of the false negatives (the true corresponding points that are not matched by their descriptors). Since our proposed descriptors calculate local fea- tures without resorting to a reference orientation, they are more reliable than those descriptors which rely on a reference orientation for rotation invariance. . The rotation invariant local features are pooled together based on their intensity orders. Since such a pooling scheme is rotation invariant, the con- structed descriptors are rotation invariant. In addi- tion, the ordinal information is encoded into descriptor by this pooling scheme. Therefore, the constructed descriptors could have high discrimin- ability while be rotation invariant. Fig. 1 shows the matching results of corresponding points whose orientation estimation errors are larger than 20 (orientations are estimated by the method used in [6]). Although these corresponding points can hardly be matched by SIFT due to large orientation IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012 2031 . The authors are with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Automation Building, 95# Zhongguancun East Road, Beijing 100190, China. E-mail: {bfan, fcwu, huzy}@nlpr.ia.ac.cn. Manuscript received 5 Mar. 2011; revised 26 Nov. 2011; accepted 11 Dec. 2011; published online 22 Dec. 2011. Recommended for acceptance by S. Belongie. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2011-03-0145. Digital Object Identifier no. 10.1109/TPAMI.2011.277. 0162-8828/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

Rotationally Invariant DescriptorsUsing Intensity Order Pooling

Bin Fan, Member, IEEE Computer Society, Fuchao Wu, and Zhanyi Hu

Abstract—This paper proposes a novel method for interest region description which pools local features based on their intensity

orders in multiple support regions. Pooling by intensity orders is not only invariant to rotation and monotonic intensity changes, but also

encodes ordinal information into a descriptor. Two kinds of local features are used in this paper, one based on gradients and the other

on intensities; hence, two descriptors are obtained: the Multisupport Region Order-Based Gradient Histogram (MROGH) and the

Multisupport Region Rotation and Intensity Monotonic Invariant Descriptor (MRRID). Thanks to the intensity order pooling scheme, the

two descriptors are rotation invariant without estimating a reference orientation, which appears to be a major error source for most of

the existing methods, such as Scale Invariant Feature Transform (SIFT), SURF, and DAISY. Promising experimental results on image

matching and object recognition demonstrate the effectiveness of the proposed descriptors compared to state-of-the-art descriptors.

Index Terms—Local image descriptor, rotation invariance, monotonic intensity invariance, image matching, intensity orders, SIFT.

Ç

1 INTRODUCTION

LOCAL image descriptors computed from interest regionshave been widely studied in computer vision. They

have become more and more popular and useful for avariety of visual tasks, such as structure from motion [1],[2], [3], [4], [5], object recognition [6], and classification [7],as well as panoramic stitching [8].

A good local image descriptor is expected to have high

discriminative ability so that the described point can be

easily distinguished from other points. Meanwhile, it shouldalso be robust to a variety of possible image transformations,

such as scale, rotation, blur, illumination, and viewpoint

changes, so that the corresponding points can be easilymatched across images which are captured under different

imaging conditions. Improving distinctiveness while main-

taining robustness is the main concern in the design of localimage descriptors. In this paper, we focus on designing local

image descriptors for interest regions.Interest regions are usually detected as affine invariant

regions (e.g., Hessian-Affine and Harris-Affine regions [9])and represented by ellipses, which can be normalized to a

canonical region by mapping the corresponding ellipses onto

a circle. Such an affine normalization procedure has a rotationambiguity. In order to alleviate this problem, researchers

either designed rotationally invariant descriptors [10], [11],

[12], [13] or rotated the canonical/normalized region accord-ing to an estimated reference orientation of the interest

region, such as [6], [14], [15], [16]. The former achieves

rotation invariance at the expense of degrading discrimina-

tive ability since this kind of descriptor usually loses some

important spatial information, while the latter suffers from

the potential instability of orientation estimation.This paper proposes a novel method for descriptor

construction. By using gradient-based and intensity-based

local features, two local image descriptors are obtained,

Multisupport Region Order-Based Gradient Histogram

(MROGH) and Multisupport Region Rotation and Intensity

Monotonic Invariant Descriptor (MRRID). They are rotation

invariant without relying on a reference orientation and still

have high discriminative ability. Therefore, they possess the

advantages of the existing two kinds of descriptors, but

effectively avoid their disadvantages. Our main contribu-

tions include:

. A comprehensive reliability analysis of orientationestimation is reported in Section 3. We found that theorientation estimation error is one of the majorsources of the false negatives (the true correspondingpoints that are not matched by their descriptors).Since our proposed descriptors calculate local fea-tures without resorting to a reference orientation,they are more reliable than those descriptors whichrely on a reference orientation for rotation invariance.

. The rotation invariant local features are pooledtogether based on their intensity orders. Since sucha pooling scheme is rotation invariant, the con-structed descriptors are rotation invariant. In addi-tion, the ordinal information is encoded intodescriptor by this pooling scheme. Therefore, theconstructed descriptors could have high discrimin-ability while be rotation invariant. Fig. 1 shows thematching results of corresponding points whoseorientation estimation errors are larger than 20�

(orientations are estimated by the method used in[6]). Although these corresponding points canhardly be matched by SIFT due to large orientation

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012 2031

. The authors are with the National Laboratory of Pattern Recognition,Institute of Automation, Chinese Academy of Sciences, AutomationBuilding, 95# Zhongguancun East Road, Beijing 100190, China.E-mail: {bfan, fcwu, huzy}@nlpr.ia.ac.cn.

Manuscript received 5 Mar. 2011; revised 26 Nov. 2011; accepted 11 Dec.2011; published online 22 Dec. 2011.Recommended for acceptance by S. Belongie.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2011-03-0145.Digital Object Identifier no. 10.1109/TPAMI.2011.277.

0162-8828/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

Page 2: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

estimation errors, many of them can be correctlymatched by our proposed descriptor (MROGH).

. Multiple support regions are used for descriptorconstruction to further improve its discriminativeability. Although two noncorresponding points mayhave similar appearances in a certain size ofsupport region, they usually can be easily distin-guished in a different size of support region.Therefore, by constructing our descriptors frommultiple support regions, their discriminative abil-ities are further enhanced.

All these factors contribute to the good performance ofour proposed descriptors. Their effectiveness and super-iorities have been evaluated on image matching and objectrecognition tasks.

This paper is an extension of our previous work [17].Specifically, the extensions include:

1. An in-depth analysis of the rotation invariance oflocal descriptors.

2. The MRRID descriptor is introduced to tackle largeillumination changes.

3. A detailed description of the construction of ourdescriptors as well as a discussion about theirproperties.

4. Experiments on object recognition and much moreresults on image matching.

The rest of this paper is organized as follows: Section 2gives a brief overview of related work. Section 3 presents anexperimental study on the rotation invariance of localdescriptors. Construction of our proposed descriptors iselaborated in Section 4, followed by the experiments inSection 5. Finally, conclusions are presented in Section 6.

2 RELATED WORK

Generally speaking, there are three main steps in matchingpoints by local image descriptors. The first step is to detectpoints in images. The detected points should be detectableand matchable across images which are captured underdifferent imaging conditions. Such points are called interestpoints or feature points in the literature. Harris [18] andDifference of Gaussian (DoG) [6] points are two types ofpopular interest points. Feature point detection is usually

followed by an additional step of detecting an affine invariantregion around the interest point in order to deal with largeviewpoint changes. Many methods in the literature have beenproposed for this purpose. Widely used methods include theHarris-Affine/Hessian-Affine detector [9], Maximally StableExtremal Regions (MSERs) [19], intensity-based and edge-based detectors [20]. Please see [21] for a comprehensivestudy of these affine region detectors. Once interest regionshave been extracted, they can be normalized to a canonicalregion which remains the same under affine transformationsof the original local patches. Second, feature descriptors areconstructed from these interest regions (affine normalized) inorder to distinguish them from each other. The final step ofpoint matching is to compute the distance between thedescriptors of two candidate points and to decide whetherthey are a match or not. Popular decision-making strategiesare Nearest Neighbor (NN) and Nearest Neighbor DistanceRatio (NNDR) [22]. It has been shown in [22] that the rank ofdifferent descriptors is not influenced by matching strategies.This makes the comparison of local descriptors convenient asthey do not need to be compared with all possible matchingstrategies.

Local image descriptors have received a lot of attentionin the computer vision community. Many local descriptorshave been developed since the 1990s [6], [14], [15], [23], [24],[25], [26], [27], [28], [29]. Perhaps one of the most famousand popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]. According to the comparative study ofMikolajczyk and Schmid [22], SIFT and its variant GradientLocation and Orientation Histogram (GLOH) outperformother local descriptors, including shape context [23],steerable filters [30], spin images [10], differential invariants[11], and moment invariants [31]. Inspired by the highdiscriminative ability and robustness of SIFT, manyresearchers have developed various local descriptorsfollowing the way of SIFT. Ke and Sukthankar [24] appliedPrincipal Component Analysis (PCA) [32] to gradient patchof keypoint and introduced the PCA-SIFT descriptor whichwas said to be more compact and distinctive than SIFT. Bayet al. [27] proposed an effective implementation of SIFTwith the integral image technique, and they achievedthreefold to sevenfold speedups. Tola et al. [16] developeda fast descriptor named DAISY for dense matching. Winderand Brown [33] proposed a framework to learn local

2032 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

Fig. 1. Matching results of corresponding points which have large orientation estimation errors (� 20�) by (a) SIFT and (b) our proposed descriptor.The corresponding points that are matched by their descriptors are marked with red lines, while yellow lines indicate those corresponding points thatare unmatchable by their descriptors. Most of these corresponding points can be correctly matched by our proposed descriptor (MROGH), but few ofthem can be correctly matched by SIFT.

Page 3: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

descriptors with different combinations of local featuresand feature pooling schemes. SIFT and many otherdescriptors can be incorporated into their framework. ADAISY-like descriptor was reported with the best perfor-mance among all configurations. Then, the best DAISY waspicked in [34]. Of course, there are many other localdescriptors besides those variants of SIFT. Berg and Malik[35] proposed geometric blur for template matching.Forssen and Lowe [36] used the shape of detected MSERto construct the local descriptor. Chen et al. [26] proposed adescriptor based on Weber’s Law [37].

In order to deal with the problem of illuminationchanges, some researchers utilized intensity orders sincethey are invariant to monotonic intensity changes. Mittaland Ramesh [38] proposed a change detection methodwhich penalizes an order flip of two pixels according totheir intensities. Gupta and Mittal [39] proposed anillumination and affine invariant point matching methodwhich penalizes flips of intensity order of certain point-pairs near keypoints. Gupta and Mittal [25] proposed amonotonic change invariant feature descriptor based onintensity order of point pairs in the interest region. Thepoint pairs are carefully chosen from extremal regions inorder to be robust to localization error as well as to intensitynoise. Matching this kind of descriptors is based on adistance function that penalizes order flips. Tang et al. [28]used a 2D histogram of position and intensity order toconstruct a feature descriptor to deal with complex bright-ness changes. Heikkila et al. [14] proposed using a variantof Local Binary Pattern (LBP) [40], i.e., CS-LBP, whichencodes intensity ordinal information locally for featuredescription; the obtained descriptor was reported to havebetter performance than SIFT, especially for matchingimage pairs with illumination changes. Gupta et al. [29]generalized the CS-LBP descriptor with a ternary codingstyle and proposed to incorporate a histogram of relativeintensities in their work. Therefore, their proposed descrip-tor captures both local orders as well as overall distributionof pixel orders in the entire patch.

Besides low-level features (e.g., histogram of gradientorientation in SIFT) which are used for descriptor construc-tion, choosing an optimal support region size is also criticalfor feature description. Some researchers reported that asingle support region is not enough to distinguish incorrectmatches from correct ones. Mortensen et al. [15] proposedcombining SIFT with global context computed from curvi-linear shape information in a much larger neighborhood toimprove the performance of SIFT, especially for matchingimages with repeated textures. Harada et al. [41] proposed aframework of embedding both local and global spatialinformation to improve the performance of local descriptorsfor scene classification and object recognition. Cheng et al.[42] proposed using multiple support regions of differentsizes to construct a feature descriptor that is robust togeneral image deformations. In their work, a SIFT descrip-tor is computed for each support region, then they areconcatenated together to form their descriptor. Moreover,they further proposed a similarity measure model, Local-to-Global Similarity model, to match points described by theirdescriptors.

Our work is fundamentally different from the previousones. We calculate local features in a rotation invariant way,but many previous methods are not strictly rotation invariant

since they need to assign a reference orientation for eachinterest point, such as SIFT, DAISY, and CS-LBP. As shown inour experiments in Section 3, the orientation estimation is anerror-prone process. Since our proposed descriptors do notrely on a reference orientation, they should potentially bemore robust. In fact, the need for an orientation for reference isalso a drawback and bottleneck of the previous methodswhich utilize multiple support regions, hence largely differ-entiating our method from them. Although some localdescriptors such as spin image and Rotation InvariantFeature Transform (RIFT) [10] achieve rotation invariancewithout requiring a reference orientation, they are lessdistinctive since some spatial information is lost due to theirfeature pooling schemes. In our work, local features arepooled together according to their intensity orders. Such afeature pooling scheme is inherently rotation invariant, andalso invariant to monotonic intensity changes.

3 AN ANALYSIS OF THE ROTATION INVARIANCE OF

LOCAL DESCRIPTORS

Local descriptors in the literature can be roughly divided intotwo categories concerning rotation invariance: 1) assigningan estimated orientation to each interest point and comput-ing the descriptor relative to the assigned orientation, such asSIFT, SURF, and CS-LBP; 2) designing the local descriptor inan inherently rotation invariant way, such as spin image andRIFT. Currently, it is commonly believed that local descrip-tors belonging to the second category are less distinctive thanthose in the first category since their rotation invariance isachieved at the expense of some spatial information loss.Therefore, many existing local descriptors first rotate thesupport region according to a reference orientation and thendivide the support region into subregions to encode spatialinformation.

Although SIFT, SURF, etc., can achieve satisfactoryresults in many applications by assigning an estimatedorientation for reference, we would claim that the orienta-tion estimation based on local image properties is an error-prone process. We show experimentally that orientationestimation error will make many true corresponding pointsunmatchable by their descriptors. To this end, we collected40 image pairs with rotation transformation (some of themalso involve scale changes) from the Internet [43], each ofwhich is related by a homography that is supplied alongwith the image pair. They are captured from five types ofscenes, as shown in Fig. 1 in the supplemental material,which can be found in the Computer Society Digital Libraryat http://doi.ieeecomputersociety.org/10.1109/TPAMI.2011.277. For each image pair, we extracted SIFT descriptorsof the SIFT keypoints and matched them by the nearestneighbor of the distances of their descriptors. We focus onthe orientation estimation errors between correspondingpoints. For a pair of corresponding points ðx; y; �Þ andðx0; y0; �0Þ, the orientation estimation error is computed by

" ¼ �0 � fð�;HÞ; ð1Þ

where fð�;HÞ is the ground truth orientation of � whenwarping from the first image to the second image accordingto the homography H between the two images. Fig. 2

FAN ET AL.: ROTATIONALLY INVARIANT DESCRIPTORS USING INTENSITY ORDER POOLING 2033

Page 4: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

presents some statistical results. Fig. 2a is the histogram oforientation estimation errors among all correspondingpoints. A similar histogram was obtained by Winder andBrown [33] by applying random synthetic affine warps.Here, we use real image pairs with mainly rotationtransformation. Fig. 2b shows the histogram of orientationestimation errors among those corresponding points thatare matched by their SIFT descriptors. Fig. 2b indicates thatfor SIFT descriptor, orientation estimation error of no morethan 20� is required in order to match corresponding pointscorrectly. However, it can be clearly seen from Fig. 2a thatmany corresponding points have errors larger than 20�.Only 63.77 percent of the corresponding points haveorientation estimation errors in the range of ½�20�; 20��.Those corresponding points with large orientation estima-tion errors (�20�) may not be correctly matched bycomparing their descriptors. In other words, 36.23 percentof the corresponding points will be incorrectly matchedmainly due to their large orientation estimation errors.Therefore, orientation estimation has a significant impact ondistinctive descriptor construction.

In order to give more insight into the influence of theestimated orientation on the matching performance of localdescriptor, we conducted some image matching experi-ments. The experimental images downloaded from theInternet [43] are shown in Fig. 3. The tested image pairs areselected with the most challenging ones among those in thedata set to perform experiments on image blur, JPEGcompression and illumination change, i.e., the first and sixthimages in the data set. For each image pair, there exists aglobal rotation between the two images. Specifically, 0� forimage pairs in Figs. 3d, 3e, and 3f, and a certain degree forimage pairs in Figs. 3a, 3b, and 3c. We used two localdescriptors (SIFT and DAISY) with two different orientationassignment methods for image matching.

1. The orientation is assigned by the ground truthorientation, and the constructed local descriptors aredenoted as Ori-SIFT and Ori-DAISY.

2. The orientation is assigned by the method suggestedin [6], and the constructed local descriptors aredenoted as SIFT and DAISY.

By comparing the performance of Ori-SIFT (respectively,Ori-DAISY) with SIFT (respectively, DAISY), we can get aclear understanding of the influence of the orientationestimation on constructing local descriptor for imagematching. Experimental results are shown below the testedimage pairs in Fig. 3. It can be seen from Fig. 3 that Ori-SIFT

(Ori-DAISY) significantly outperforms SIFT (DAISY). Sincethe only difference between Ori-SIFT (Ori-DAISY) andSIFT (DAISY) is the orientation assignment method, itdemonstrates that a more accurate orientation estimationwill largely improve the performance of the local descriptor.The currently used orientation estimation method based onthe histogramming technique is still an error-prone proce-dure and will adversely affect the performance of the localdescriptor. Therefore, it leaves a long way for a localdescriptor to be inherently rotation invariant1 while main-taining its high distinctiveness. In this paper, we make aneffort along this way and propose constructing localdescriptors by pooling local features according to intensityorders. The proposed two descriptors are not only inherentlyrotation invariant, but also more distinctive than state-of-the-art local descriptors, as shown by experiments in Section 5.

4 THE PROPOSED METHOD

The key idea of our method is to pool rotation invariant localfeatures based on intensity orders. Instead of assigning areference orientation to each interest point to make thecomputation of local features rotation invariant, we calculatelocal features in a locally rotation invariant coordinatesystem. Thus, they are inherently rotation invariant. Mean-while, sample points are adaptively partitioned into severalgroups based on their intensity orders. Then, the rotationinvariant local features of sample points in these groups arepooled together separately to construct a descriptor. Sincethe intensity orders of sample points are rotation invariant,such a feature pooling scheme is also rotation invariant.Therefore, no orientation is required for reference in ourmethod. In this paper, we pool two kinds of local features

2034 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

Fig. 2. Distributions of the orientation estimation errors between corresponding points. (a) The distribution among all the corresponding points; only63.77 percent of the corresponding points have errors in the range of ½�20�; 20��. (b) The distribution among those corresponding points that arematched by their SIFT descriptors. See text for details.

1. It means achieving rotation invariance without resorting to anestimated orientation for reference.

Page 5: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

based on intensity orders to construct local image descrip-tors: one based on gradients and the other on intensities;hence two descriptors are obtained, named MROGH andMRRID, respectively. The details will be described next.

4.1 Affine Normalized Regions

The detected regions for calculating descriptors are eithercircular or elliptical regions of different sizes based on theused region detectors. For example, the Hessian-Affine/Harris-Affine detector detects elliptical regions that areaffine invariant up to a rotation transformation, while theHessian-Laplace/Harris-Laplace detects circular regions.Please see [21] and [22] for more details about regiondetectors. To obtain scale or affine invariance, the detectedregion is usually normalized to a canonical region. Similarto many other local descriptors, this work is to design alocal descriptor of the normalized region, which is a circularregion of radius 20.5 pixels. Thus, the minimal patch thatcontains the normalized region is in size of 41� 41 pixels. Asimilar patch size is also used in [22] and [24]. If thedetected region is larger than the normalized region, theimage of the detected region is smoothed by a Gaussiankernel before region normalization. The standard derivationof Gaussian used for smoothing is set to be the size ratio ofthe detected region and the normalized region [22].

Given a detected region denoted by a symmetrical

matrix A 2 <2�2, for any point X in the region, it satisfies

XTAX � 1: ð2Þ

If A ¼ 1c2 E, where E is the identity matrix, then the region

is a circular one and c is its radius; otherwise it is an

elliptical region. The normalization aims to warp the

detected region into a canonical circular region as shown

in Fig. 4. The sample point X0 belonging to the normalized

region satisfies

X0TX0 � r2; ð3Þ

FAN ET AL.: ROTATIONALLY INVARIANT DESCRIPTORS USING INTENSITY ORDER POOLING 2035

Fig. 4. The affine normalization of a detected region to the canonicalcircular region (normalized region).

Fig. 3. Image matching results of SIFT and DAISY with two different orientation assignment methods. See text for details.

Page 6: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

where r is the radius of the normalized region, which is setto 20.5 pixels in this paper. Combining (2) and (3), we have

X ¼ 1

rA�

12X0 ¼ T�1X0: ð4Þ

Therefore, for each sample point X0 in the normalizedregion, we calculate its corresponding point X in thedetected region and take the intensity of X as the intensityof X0 in the normalized region, i.e., IðX0Þ ¼ IðXÞ. Usually,X is not exactly located at a grid point, so IðXÞ is obtainedby bilinear interpolation. Fig. 4 gives an example of thenormalized region.

In the remainder of this paper, support regions are suchnormalized circular regions and intensities of points in thesupport regions are obtained by bilinear interpolation asdescribed above.

4.2 Support Region Partition Based on IntensityOrders

Given a support region, one can divide it into several ringsand pool together the calculated local features of samplepoints in each ring in a similar way as spin image or RIFT[10] does, so as to achieve a rotation invariant description ofthe region. However, such an undertaking will reduce thedistinctiveness of the descriptor since some spatial informa-tion is lost. In other words, pooling local features circularlyachieves a rotation invariant representation at the expenseof descriptor’s discriminative ability degradation. There-fore, many popular and state-of-the-art methods divide thesupport region into subregions in order to encode morespatial information, such as SIFT [6], DAISY [16], CS-LBP[14], OSID [28], and so on [15], [27], [29]. Unfortunately,these predefined subregions need to assign a referenceorientation in order to be rotation invariant. As we analyzedin Section 3, the orientation estimation is not stable enough,and here we adopt a different approach. Rather thangeometrically dividing the support region into subregions,we partition sample points into different groups based ontheir intensity orders. In our case, sample points in eachgroup are not necessarily spatial neighbors, and such anadaptive partition approach does not require assigning anorientation for reference.

We denote by R ¼ fX1;X2; . . . ;Xng a support regionwith n sample points, and IðXiÞ the intensity of samplepoint Xi. Our goal is to partition R into k groups accordingto the intensity orders of sample points. First, intensities ofsample points are sorted in nondescending order and a setof sorted sample points is obtained as

fXfð1Þ;Xfð2Þ; . . . ;XfðnÞ : IðXfð1ÞÞ � IðXfð2ÞÞ� � � � � IðXfðnÞÞg;

where fð1Þ; fð2Þ; . . . ; fðnÞ is a permutation of 1; 2; . . . ; n.Then, we can take kþ 1 intensities from them as follows:

ti ¼ IðXfðsiÞÞ : t0 � t1 � � � � � tk; ð5Þ

where

si ¼n

ki

l m; i ¼ 1; 2; . . . ; k;

1; i ¼ 0:

(ð6Þ

Finally, the n sample points are partitioned into k groups as

Ri ¼ fXj 2 R : ti�1 � IðXjÞ � tig; i ¼ 1; 2; . . . ; k: ð7Þ

It is worth noting that only the intensity orders of samplepoints are used. Therefore, the partitioned sample pointgroups are invariant to monotonic intensity changes. Thetop of Fig. 7 gives an example of such intensity order-basedpartition for a support region, where different point groupsare indicated by different colors.

4.3 The Computation of Rotation Invariant LocalFeatures

4.3.1 The Rotation Invariant Coordinate System

The key for the computation of rotation invariant localfeatures is to construct a rotation invariant coordinate systemfor each sample point. As shown in Fig. 5, suppose P is aninterest point and Xi is one of the sample points in its supportregion. Then, a local xy coordinate system can be establishedby P and Xi by setting PXi

��!as the positive y-axis for the

sample point Xi. Since such a local coordinate system isrotation invariant, we calculate local features in thiscoordinate system to obtain rotation invariance and accu-mulate them in the partitioned point groups to construct ourdescriptors. Here, we propose two kinds of local features:one based on gradients and the other on intensities.

4.3.2 Gradient-Based Local Feature

For each sample point Xi, its rotation invariant gradient canbe calculated in this local coordinate system by using pixeldifferences as follows:

DxðXiÞ ¼ I�X1i

�� I�X5i

�; ð8Þ

DyðXiÞ ¼ I�X3i

�� I�X7i

�; ð9Þ

where Xji , j ¼ 1; 3; 5; 7, are Xi’s neighboring points along

the x-axis and y-axis in the local xy-coordinate system, asshown in Fig. 5, and IðXj

iÞ stands for the intensity at Xji .

Then, the gradient magnitude mðXiÞ and orientation �ðXiÞcan be computed as

mðXiÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiDxðXiÞ2 þDyðXiÞ2

q; ð10Þ

�ðXiÞ ¼ tan�1ðDyðXiÞ=DxðXiÞÞ: ð11Þ

Note that �ðXiÞ is then transformed in the range of ½0; 2�Þaccording to the signs of DxðXiÞ and DyðXiÞ. In order tomake an informative representation of Xi, its gradient istransformed to a d-dimensional vector, denoted by

2036 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

Fig. 5. The locally rotation invariant coordinate system used forcalculating local features of a sample point Xi. P is the interest point(center of the normalized region).

Page 7: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

FGðXiÞ ¼ ðfG1 ; fG2 ; . . . ; fGd Þ. To this end, ½0; 2�Þ is split into d

equal bins as diri ¼ ð2�=dÞ � ði� 1Þ, i ¼ 1; 2; . . . ; d, then

�ðXiÞ is linearly allocated to the two adjacent bins according

to its distances to them weighted by mðXiÞ:

fGj ¼mðXiÞ

ð2�=d� �ð�ðXiÞ; dirjÞÞ2�=d

; if �ð�ðXiÞ; dirjÞ< 2�=d;

0; otherwise;

8>><>>:

ð12Þ

where �ð�ðXiÞ; dirjÞ is the angle between �ðXiÞ and dirj.Note that in RIFT [10], a similar way is used to calculate

the rotation invariant gradient for its feature description.

However, since it accumulates the cells of the histogram of

gradient orientation in rings around the interest point to

achieve rotation invariance, some spatial information is lost,

hence degrading its distinctiveness. Our proposed descrip-

tors are constructed using the intensity order pooling,

which encodes ordinal information into descriptors while

maintaining descriptors’ rotation invariance. Experiments

(Section 5) show that pooling by intensity orders is more

informative than by rings.

4.3.3 Intensity-Based Local Feature

Our intensity-based feature is similar to CS-LBP [14], but

it is calculated in a rotation invariant way, i.e., calculated

in the locally rotation invariant coordinate system as

shown in Fig. 5. Since our method aims to skip the

rotation alignment of support region to achieve rotation

invariance without resorting to an orientation for refer-

ence, such a modification is essential. For each sample

point Xi, suppose that Xji , j ¼ 1; 2; . . . ; 2m, are its

2m neighboring points regularly sampled along a circle.

To obtain rotation invariance, X1i is located at the

intersection of this circle and the positive x-axis in the

local xy-coordinate system. By comparing the intensities of

opposite sample points, we get an m-dimensional binary

vector: ðsignðIðXmþ1i Þ � IðX1

i ÞÞ; signðIðXmþ2i Þ � IðX2

i ÞÞ; . . . ;

signðIðXmþmi Þ � IðXm

i ÞÞÞ. Then, a local feature FIðXiÞ ¼ðfI1 ; fI2 ; . . . ; fI2mÞ of Xi can be obtained by mapping the m-

dimensional binary vector into a 2m-dimensional vector:

fIj ¼1; if

Xmk¼1

sign�I�Xkþmi

�� I�Xki

��� 2k�1 ¼ ðj� 1Þ

0; otherwise

8><>: ;

signðxÞ ¼1; x > 0;

0; otherwise;

�ð13Þ

Note that FIðXiÞ has only one element that is 1; all the

others are 0.

4.4 Local Image Descriptor Construction

As our experiments show, one single support region is notenough to distinguish incorrect matches from correct onesin general. Two noncorresponding interest points mayaccidently have similar appearances in a certain localregion. However, it is less likely that two noncorrespondinginterest points have similar appearances in several localregions of different sizes. In contrast, two correspondinginterest points should have similar appearances in a localregion of any size, although some small differences mayexist due to localization error of interest point and regiondetection. That is to say, using multiple support regions,one can handle the mismatching problem better than usinga single support region. Therefore, we utilize multiplesupport regions to construct our descriptors.

As shown in Fig. 6, we choose support regions as the

N nested regions centered at the interest point with an

equal increment of radius. Suppose that the detected

region is denoted by A 2 <2�2. It is used as the minimal

support region and so other support regions are defined

as Ai ¼ 1r2i

A, i ¼ 1; 2; . . . ; N , in which ri indicates the size

of the ith support region. In this paper, we set ri ¼1þ 0:5� ði� 1Þ such that the support regions have an

equal increment of radius as plotted in Fig. 6a. In each

support region, the rotation invariant local features of all

the sample points are then pooled by their intensity

orders. As shown in Fig. 7, first they are pooled together

to form a vector in each partition which is obtained based

on the intensity orders of sample points and then the

accumulated vectors of different partitions are concate-

nated together to represent this support region. We denote

it as DðRÞ ¼ ðF ðR1Þ; F ðR2Þ; . . . ; F ðRkÞÞ and F ðRiÞ is the

accumulated vector of partition Ri, i.e.,

F ðRiÞ ¼XX2Ri

FGðXÞ; ð14Þ

FAN ET AL.: ROTATIONALLY INVARIANT DESCRIPTORS USING INTENSITY ORDER POOLING 2037

Fig. 6. The selection of four support regions and their normalization. Each color corresponds to a boundary of one support region and all the supportregions are normalized to a circular region with unified radius. (a) The selected support regions and (b)-(e) the normalized regions.

Page 8: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

if the gradient-based feature is used (the constructed

descriptor is then called MROGH) or

F ðRiÞ ¼XX2Ri

FIðXÞ; ð15Þ

if the intensity-based feature is used (the constructed

descriptor is then called MRRID). Finally, all the vectors

calculated from the N support regions are concatenated

together to form our final descriptor: fD1D2 . . .DNg. Fig. 8

gives an overview of our proposed method.

4.5 Properties

Our proposed two descriptors (MROGH and MRRID) have

the following characteristics, making them both distinctive

and robust to many image transformations:

1. Local features are calculated in a rotation invariantway without resorting to an estimated orientationfor reference. Therefore, the resulting rotation

invariance is more stable than that obtained byaligning the positive x-axis with an estimatedorientation. The estimated orientation tends to beunreliable according to our experimental study(see Section 3).

2. By partitioning sample points in the support regionaccording to their intensity orders, the obtainedpartitions are rotation invariant and so they aresuitable for pooling rotation invariant local features.Unlike the ring-shaped partitions, which lose somespatial information, the intensity order-based parti-tions can encode ordinal information into descriptor.Therefore, the proposed descriptors could havehigher discriminative ability.

3. Since intensity orders are invariant to monotonicintensity changes, our proposed descriptors providea higher degree of illumination invariance, notmerely to the linear illumination change. Thus theycan deal with large illumination changes, especiallyfor MRRID, it has much better results than MROGH

2038 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

Fig. 8. The workflow of our proposed method.

Fig. 7. The procedure of pooling local features based on their intensity orders for a support region. See text for details.

Page 9: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

and other evaluated descriptors when matchingimages exhibit large illumination changes (seeSection 5.3.2). This is because for MRRID, not onlyits feature pooling scheme is based on intensityorders, its local feature is also based on the relativeintensity relationship of sample points.

4. The proposed descriptors are constructed on thebasis of multiple support regions, further enhancingtheir discriminative ability. By utilizing multiplesupport regions, it also avoids the problem ofselecting an optimal region size to construct de-scriptor for a detected interest region to some extent.

5 EXPERIMENTS

5.1 Parameters Evaluation

There are several parameters in the proposed descriptors:the number of partitions k, the number of support regionsN ,the number of orientation bins d, and the number of binarycodes m. As listed in Table 1, MROGH and MRRID sharetwo parameters: the number of partitions and the number ofsupport regions, while the number of orientation bins isneeded in MROGH and the number of binary codes isneeded in MRRID. In order to evaluate their influences onthe performance of the proposed descriptors, we conductedimage matching experiments on 142 pairs of images2 withdifferent parameter settings as listed in Table 1. These 142image pairs are mainly selected from the data set of zoomand rotation transformations. Note that they do not containimage pairs in the standard Oxford data set [44] becausethose image pairs are used for the descriptors evaluation inthe later stage.

Fig. 9 shows the average recall versus average 1-precisioncurves of MROGH and MRRID with different parametersettings. The definition of a correct match and a correspon-dence is the same as [22], which is determined with overlaperror [9]. The matching strategy used here is the nearestneighbor distance ratio [22]. In the remaining experiments,we used the same definitions of a match, a correct matchand a correspondence.

It can be seen from Fig. 9 that the performances ofMROGH and MRRID are improved when the number ofsupport regions is increased. When a single support regionis used, MROGH performs the best with the parametersetting of “d ¼ 8; k ¼ 8,” closely followed by the setting of“d ¼ 8; k ¼ 6,” while MRRID performs the best with the

parameter setting of “m ¼ 4; k ¼ 8,” followed by the settingof “m ¼ 4; k ¼ 4.” Taking into consideration performanceand complexity (dimension), we use the parameter settingof “d ¼ 8; k ¼ 6;N ¼ 4” for MROGH and “m ¼ 4; k ¼ 4;N ¼ 4” for MRRID in the remaining experiments. Thus, thedimensionality is 192 for MROGH and 256 for MRRID.

5.2 Multisupport Regions versus Single SupportRegion

This experiment aims to show the superiority of usingmultisupport regions over a single support region. We usedthe same data set as in the experiment of parametersevaluation (Section 5.1). Since four support regions are usedin our method, for each of the proposed descriptors(MROGH or MRRID) we can obtain five descriptors forone interest point: SR-i denotes the descriptor calculatedfrom the ith support region and MR is the descriptorconcatenating SR-1, SR-2, SR-3, and SR-4. For each imagepair, we, respectively, calculated these five descriptors toperform point matching and obtained the average recallversus average 1-precision curves as before. The comparativeresults are shown in Fig. 10, in which Fig. 10a is the resultsof MROGH while Fig. 10b shows the results of MRRID. Itcan be found that SR-2, SR-3, and SR-4 perform better thanSR-1, but SR-2, SR-3, and SR-4 perform almost as well.Thus, when using a single support region, the performanceof descriptor improves when the size of support regionincreases, but the effects become negligible after reaching acertain size. Figs. 10a and 10b demonstrate that bycombining multiple support regions for descriptor con-struction, the performance of MROGH and MRRID im-proves significantly over the best performance of using asingle support region. This indicates that using multiplesupport regions improves the discriminative ability of ourproposed descriptors. For comparison, the performance ofSIFT is plotted too. It is worth noting that even a singleregion-based MROGH (SR-1) is superior to SIFT, while thedimension of SR-1 is much lower (48 versus 128),demonstrating the effectiveness of the proposed method.Fig. 10c shows the results of calculating SIFT with multiplesupport regions. Although the performance of SIFT isimproved with multiple support regions, the improvementis much less significant than MROGH and MRRID. This isbecause the error and uncertainty induced by incorrectorientation estimation hamper its further improvement. Incontrast, MROGH and MRRID do not have such limitationsas they do not involve any orientation estimation.

5.3 Image Matching

To evaluate the performance of the proposed descriptors,we conducted extensive experiments on image matching.We followed the evaluation procedure proposed byMikolajczyk and Schmid [22]. The codes for evaluationwere downloaded from their website [44]. Three otherlocal descriptors were evaluated in our experiments forcomparison: RIFT, SIFT, and DAISY. In our experiments,SIFT was downloaded from the Oxford University website[44], which is the same as the one used by Mikolajczykand Schmid [22], while RIFT and DAISY (the T1-8-2r8sconfiguration [34] of DAISY is used) are our ownimplementations. Note that the downloaded SIFT only

FAN ET AL.: ROTATIONALLY INVARIANT DESCRIPTORS USING INTENSITY ORDER POOLING 2039

TABLE 1Parameters of Our Proposed Descriptors

2. They are real images downloaded from [43]. The ground truthhomography is supplied along with the image pair.

Page 10: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

uses one dominant orientation (the orientation correspondsto the largest peak of the orientation histogram) for eachinterest point. In order to be consistent with it, ourimplementation of DAISY also uses one dominant orienta-tion for each interest point. Therefore, all the evaluatedmethods (DAISY, SIFT, RIFT, MROGH, and MRRID) havethe same number of features. So the obtained numbers ofcorrespondences are the same too. In our experiments(Intel Core2 CPU 1.86 GHz), the time for constructing adescriptor for a feature point is 1.6 ms for RIFT, 12.3 msfor DAISY, 4.6 ms for SIFT, 8.3 ms for MROGH, 1.9 ms forMROGH with a single region, 18.5 ms for MRRID, and4.8 ms for MRRID with a single region.

5.3.1 Performance on the Oxford Data Set

The proposed descriptors were evaluated on the standardOxford data set, in which image pairs are under variousimage transformations, including viewpoint change, scaleand rotation changes, image blur, JPEG compression, andillumination change. They are shown in Fig. 2 available in theonline supplemental material. In each row, the images fromthe second to the fifth columns are matched to the first imageseparately. As in [22], we evaluated the performance ofdescriptors using both Hessian-Affine and Harris-Affinedetectors [9]. Due to paper length limit, we only givethe results with Hessian-Affine regions in Fig. 11. Pleasesee Fig. 3 available in the online supplemental material forthe results with Harris-Affine regions. It can be seen fromthese results that although the performance of each descrip-tor varied with different feature detectors (Hessian-Affine orHarris-Affine in this experiment), the relative performanceamong different descriptors is consistent with differentfeature detectors. The results show that MROGH andMRRID outperform the other evaluated local descriptors inall the tested cases. Such a good performance of MROGH and

MRRID may be attributed to their well-designed properties.They not only use multiple support regions to improvediscriminative ability, but also combine local features(histogram of gradient orientation in MROGH and CS-LBPin MRRID) with ordinal information. Moreover, they arerotation invariant without relying on a reference orientation,further improving their robustness. Since MROGH uses asimilar local feature to the one used in RIFT, the significantperformance improvement of MROGH over RIFT demon-strates the effectiveness and advantage of our proposedfeature pooling scheme, i.e., pooling by intensity orders ismore informative than by rings. In most cases, MROGHperforms better than MRRID. When images exhibit blur orillumination changes, MRRID is better, especially whenencountering large illumination changes.

5.3.2 Performance on Images with Large Illumination

Changes

In order to investigate the robustness of the proposeddescriptors to monotonic intensity changes, we conductedimage matching on three additional image pairs with largeillumination changes, as shown in the top of Fig. 12. Theywere captured by changing the exposure time of camera;the matching results are shown below the tested images.The Hessian-Affine detector was used for interest regiondetection. It can be seen from Fig. 12 that MRRID performsthe best in such cases and achieves much higher perfor-mance than the other evaluated descriptors due to itsinvariance to monotonic intensity changes. The more severethe brightness change is, the larger the performance gapbetween MRRID and the other evaluated descriptors is.

5.3.3 Performance Evaluation Based on 3D Objects

Moreels and Perona [45] evaluated different combinationsof feature detectors and descriptors based on 3D objects. We

2040 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

Fig. 10. Performance comparison between multisupport regions and a single support region when using (a) MROGH, (b) MRRID, and (c) SIFT,respectively. SR-i is the result using the ith support region and MR is the result using multisupport regions. The performances of MROGH andMRRID are significantly improved when using multisupport regions, while the improvement of SIFT is much less significant.

Fig. 9. The average performance of the proposed descriptors with different parameter settings.

Page 11: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

FAN ET AL.: ROTATIONALLY INVARIANT DESCRIPTORS USING INTENSITY ORDER POOLING 2041

Fig. 11. Experimental results under various image transformations in the Oxford data set for Hessian-Affine Region.

Page 12: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

evaluated the proposed descriptors following their work.3

The Hessian-Affine detector was used for feature extraction

since it was reported with the best results combined with

SIFT in [45]. As in [45], interest point matching was

conducted on a database containing both the target features

and a large amount of features from unrelated images so as

to mimic the process of image retrieval/object recognition.

105 features were randomly chosen from 500 unrelated

images obtained from Google by typing “things.” Please

refer to [45] for more details about the data set and

experimental setup.Fig. 13 shows the comparative results. In Fig. 13a, the

ROC curves of “detection rate versus false alarm rate” were

obtained by varying the threshold of nearest neighbor

distance ratio which defines a match. The detection rate is

the number of detections against the number of tested

matches, while the false alarm rate is the number of false

alarms divided by the number of tested matches. A tested

match is classified as a nonmatch, a false alarm or a correct

match (a detection) according to the distance between their

descriptors and whether the geometric constraints are

satisfied or not [45]. Fig. 13b shows the detection rate as a

function of the viewpoint changes at a fixed false alarm rate.

The false alarm rate 10�6 implies that one false alarm over10 attempts on average since the false alarm rate is

normalized by the number of database features (105). It

can be seen from Fig. 13 that the proposed descriptors and

DAISY outperform SIFT and RIFT, while MROGH is sightlybetter than DAISY.

5.3.4 Performance on the Patch Data Set

The proposed descriptors were also evaluated on the PatchData Set [34], [46]. Since the matched patches in this data set

only have very small rotation, we varied the rotation of

patches before constructing descriptors in order to mimicthe real situation of image matching in which the rotation

between the matched images is usually unknown. The

matching results are shown in Fig. 14. From Figs. 14a to 14c

and Figs. 14e to 14g, MROGH and MRRID outperform SIFT,DAISY, and RIFT on these image patches taken from the

same objects under different lighting and viewing condi-

tions. Figs. 14d and 14h show the performance degradationof SIFT and DAISY when they need to assign a reference

orientation. OriDAISY and OriSIFT refer to the method that

construct descriptors on the patches without rotation, while

DAISY and SIFT refer to the method that constructdescriptors on the patches according to a reference

orientation. It can be seen from Figs. 14d and 14h that the

2042 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

Fig. 13. Experimental results on the data set of 3D objects. (a) Detection Rate versus False Alarm Rate. (b) Detection Rate versus Viewpoint

Changes at false alarm rate of 10�6.

Fig. 12. Matching results of images with large illumination changes which were captured by changing the exposure time of camera. The more severethe brightness change is, the larger the performance gap between MRRID and the other evaluated descriptors is.

3. The data set was downloaded from http://www.vision.caltech.edu/pmoreels/Datasets/TurntableObjects. The downloadable data set actuallycontains 77 different objects, so in our experiments it is 77 different objectsused for evaluation, not 100 as described in [45].

Page 13: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

matching performances of SIFT and DAISY degrade whenthe orientation is unknown. (In this case, a referenceorientation is required for descriptor construction.)

5.4 Object Recognition

We further conducted experiments on object recognition toshow the effectiveness of the proposed descriptors. In ourexperiments, the similarity between images is defined asfollows: Suppose that I1 and I2 are two images, andff1

1 ; f12 ; . . . ; f1

mg, ff21 ; f

22 ; . . . ; f2

ng are two sets of featuredescriptors extracted from I1 and I2. Then, the similaritybetween I1 and I2 is defined as

SimðI1; I2Þ ¼P

i;j gðf1i ; f

2j Þ

m� n ; ð16Þ

where

g�f1i ; f

2j

�¼ 1; if dist

�f1i ; f

2j

�� T

0; otherwise;

�ð17Þ

in which distðf1i ; f

2j Þ is the euclidean distance of descriptors,

and T is a threshold which is tuned to give the best resultfor each evaluated descriptor. Such a definition of similarityonly relies on the distinctiveness of the local descriptor.

Three tested data sets were downloaded from theInternet [47], [48]. The first data set (named 53 Objects)contains 53 objects, each of which has five images takenfrom different viewpoints. The second data set (namedZuBuD) contains 201 different buildings, and each buildinghas five images taken from different viewpoints. The thirddata set is the Kentucky data set [48]. It contains 2,550objects, each of which has four images taken from differentviewpoints. We used the first 4,000 images from 1,000objects in our experiments. Some images in these data setsare shown in Fig. 4 in the online supplemental material. Foreach image in the 53 Objects and ZuBuD data sets, wecalculated its similarities to the remaining images in thedata set, and returned the top four images with the largestsimilarities, while for each image in the Kentucky data set,

we returned the top three images. We recorded the ratio of

the number of returned correct images to the number of

total returned images as the recognition accuracy. Table 2

gives the recognition results and some recognition exam-

ples are shown in Fig. 5 in the online supplemental material.

MROGH performs the best among all the evaluated

descriptors. Since MRRID is primarily designed to deal

with large illumination changes and the tested data sets

mainly contain viewpoint changes, MRRID does not per-

form as good as MROGH. However, it is still better than

SIFT and RIFT.

6 CONCLUSION

This paper presents a novel method for constructing

interest region descriptors. The key idea is to pool rotation

invariant local features based on intensity orders. It has the

following important properties:

1. Unlike the undertaking in many popular descriptorswhere an estimated orientation is assigned to thedescriptor for obtaining rotation invariance, ourproposed descriptors are inherently rotation invariantthanks to a rotation invariant way of local featurecomputation and an adaptive feature pooling scheme.

FAN ET AL.: ROTATIONALLY INVARIANT DESCRIPTORS USING INTENSITY ORDER POOLING 2043

TABLE 2Recognition Results on the Three Tested Data Sets

with Different Local Descriptors

Fig. 14. ROC curves of five different local descriptors on the Liberty, Notre Dame, and Yosemite data set (20,000 matches and 20,000 nonmatchesare used). (a)-(c) are the results on DoG interest point and (e)-(g) are the results on multiscale Harris interest point. (d) and (h) show the performancedegradation of SIFT and DAISY when they need to assign a reference orientation for rotation invariance.

Page 14: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

2. The proposed pooling scheme partitions the samplepoints into several groups based on their intensityorders, rather than their geometric locations. Poolingby intensity orders makes our two descriptors rotationinvariant without resorting to the conventionalpractice of orientation estimation, which is an error-prone process according to our experimental analysis.In addition, the intensity order pooling scheme canencode ordinal information into descriptor.

3. Multiple support regions are used to furtherimprove discriminative ability.

By pooling two different kinds of local features (gradi-

ent-based and intensity-based) based on intensity orders,

this paper obtained two descriptors: MROGH and MRRID.

The former combines information of intensity orders and

gradient, while the latter is completely based on intensity

orders, which makes it particularly suitable to large

illumination changes. Extensive experiments show that

both MROGH and MRRID could achieve better perfor-

mance than state-of-the-art descriptors.

ACKNOWLEDGMENTS

This work is supported by the National Science Foundationof China under the grant Nos. 60835003 and 61075038.Thanks for the anonymous suggestions by the reviewers.Thanks for the helpful discussions with Prof. Shiming Xiangand Prof. Michael Werman.

REFERENCES

[1] S. Agarwal, N. Snavely, I. Simon, S.M. Seitz, and R. Szeliski,“Building Rome in a Day,” Proc. 12th IEEE Int’l Conf. ComputerVision, pp. 72-79, 2009.

[2] Y. Furukawa and J. Ponce, “Accurate, Dense, and RobustMultiview Stereopsis,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 32, no. 8, pp. 1362-1376, Aug. 2010.

[3] J.-M. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Raguram,C. Wu, Y.-H. Jen, E. Dunn, B. Clipp, S. Lazebnik, and M. Pollefeys,“Building Rome on a Cloudless Day,” Proc. 11th European Conf.Computer Vision, pp. 368-381, 2010.

[4] N. Snavely, S.M. Seitz, and R. Szeliski, “Photo Tourism: ExploringPhoto Collections in 3D,” ACM Trans. Graphics, vol. 25, pp. 835-846, 2006.

[5] S. Agarwal, Y. Furukawa, N. Snavely, B. Curless, S.M. Seitz, andR. Szeliski, “Reconstructing Rome,” Computer, vol. 43, pp. 40-47,2010.

[6] D.G. Lowe, “Distinctive Image Features from Scale-InvariantKeypoints,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

[7] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “LocalFeatures and Kernels for Classification of Texture and ObjectCategories: A Comprehensive Study,” Int’l J. Computer Vision,vol. 73, no. 2, pp. 213-238, 2007.

[8] R. Szeliski, “Image Alignment and Stitching: A Tutorial,”Foundations and Trends in Computer Graphics and Vision, vol. 2,pp. 1-104, 2006.

[9] K. Mikolajczyk and C. Schmid, “An Affine Invariant Interest PointDetector,” Proc. Seventh European Conf. Computer Vision, pp. 128-142, 2002.

[10] S. Lazebnik, C. Schmid, and J. Ponce, “A Sparse TextureRepresentation Using Local Affine Regions,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 27, no. 8, pp. 1265-1278, Aug.2005.

[11] J.J. Koenderink and A.J.V. Doorn, “Representation of LocalGeometry in the Visual System,” Biological Cybernetics, vol. 55,pp. 367-375, 1987.

[12] A. Baumberg, “Reliable Feature Matching across Widely Sepa-rated Views,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 1, 2000.

[13] F. Schaffalitzky and A. Zisserman, “Multi-View Matching forUnordered Image Sets, or How Do I Organize My Holiday Snaps?”Proc. Seventh European Conf. Computer Vision, pp. 414-431, 2002.

[14] M. Heikkila, M. Pietikainen, and C. Schmid, “Description ofInterest Regions with Local Binary Patterns,” Pattern Recognition,vol. 42, pp. 425-436, 2009.

[15] E. Mortensen, H. Deng, and L. Shapiro, “A SIFT Descriptor withGlobal Context,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 1, pp. 184-190, 2005.

[16] E. Tola, V. Lepetit, and P. Fua, “A Fast Local Descriptor for DenseMatching,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, 2008.

[17] B. Fan, F. Wu, and Z. Hu, “Aggregating Gradient Distributionsinto Intensity Orders: A Novel Local Image Descriptor,” Proc.IEEE Conf. Computer Vision and Pattern Recognition, pp. 2377-2384,2011.

[18] C. Harris and M. Stephens, “A Combined Corner and EdgeDetector,” Proc. Fourth Alvey Vision Conf., pp. 147-151, 1988.

[19] J. Matas, O. Chum, M. Urba, and T. Pajdla, “Robust Wide BaselineStereo from Maximally Stable Extremal Regions,” Proc. BritishMachine Vision Conf., pp. 384-396, 2002.

[20] T. Tuytelaars and L.V. Gool, “Matching Widely Separated ViewsBased on Affine Invariant Regions,” Int’l J. Computer Vision,vol. 59, no. 1, pp. 61-85, 2004.

[21] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,F. Schaffalitzky, T. Kadir, and L.V. Gool, “A Comparison of AffineRegion Detectors,” Int’l J. Computer Vision, vol. 65, nos. 1/2, pp. 43-72, 2005.

[22] K. Mikolajczyk and C. Schmid, “A Performance Evaluation ofLocal Descriptors,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 27, no. 10, pp. 1615-1630, Oct. 2005.

[23] S. Belongie, J. Malik, and J. Puzicha, “Shape Context: A NewDescriptor for Shape Matching and Object Recognition,” Proc.Neural Information Processing Systems Conf., pp. 831-837, 2001.

[24] Y. Ke and R. Sukthankar, “PCA-SIFT: A More DistinctiveRepresentation for Local Image Descriptors,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, vol. 2, pp. 511-517, 2004.

[25] R. Gupta and A. Mittal, “SMD: A Locally Stable MonotonicChange Invariant Feature Descriptor,” Proc. 10th European Conf.Computer Vision, pp. 265-277, 2008.

[26] J. Chen, S. Shan, G. Zhao, X. Chen, W. Gao, and M. Pietikainen, “ARobust Descriptor Based on Weber’s Law,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, 2008.

[27] H. Bay, A. Ess, T. Tuytelaars, and L.V. Gool, “SURF: Speeded UpRobust Features,” Computer Vision and Image Understanding,vol. 110, no. 3, pp. 346-359, 2008.

[28] F. Tang, S.H. Lim, N.L. Change, and H. Tao, “A Novel FeatureDescriptor Invariant to Complex Brightness Changes,” Proc. IEEEConf. Computer Vision and Pattern Recognition, pp. 2631-2638, 2009.

[29] R. Gupta, H. Patil, and A. Mittal, “Robust Order-Based Methodsfor Feature Description,” Proc. IEEE Conf. Computer Vision andPattern Recognition, pp. 334-341, 2010.

[30] W.T. Freeman and E.H. Adelson, “The Design and Use ofSteerable Filters,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 13, no. 9, pp. 891-906, Sept. 1991.

[31] L.V. Gool, T. Moons, and D. Ungureanu, “Affine/PhotometricInvariants for Planar Intensity Patterns,” Proc. Fourth EuropeanConf. Computer Vision, pp. 642-651, 1996.

[32] I.T. Joliffe, Principal Component Analysis. Springer-Verlag, 1986.[33] S. Winder and M. Brown, “Learning Local Image Descriptors,”

Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8,2007.

[34] S. Winder, G. Hua, and M. Brown, “Picking the Best DAISY,” Proc.IEEE Conf. Computer Vision and Pattern Recognition, pp. 178-185,2009.

[35] A.C. Berg and J. Malik, “Geometric Blur for Template Matching,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 607-614, 2001.

[36] P.-E. Forssen and D.G. Lowe, “Shape Descriptor for MaximallyStable Extremal Regions,” Proc. IEEE 11th Int’l Conf. ComputerVision, 2007.

[37] A.K. Jain, Fundamentals of Digital Image Signal Processing. Prentice-Hall, 1989.

[38] A. Mittal and V. Ramesh, “An Intensity-Augmented OrdinalMeasure for Visual Correspondence,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, vol. 1, pp. 849-856, 2006.

2044 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

Page 15: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...visionlab.hanyang.ac.kr/wordpress/wp-content/... · and popular descriptors is Scale Invariant Feature Trans-form (SIFT) [6]

[39] R. Gupta and A. Mittal, “Illumination and Affine—Invariant PointMatching Using an Ordinal Approach,” Proc. IEEE 11th Int’l Conf.Computer Vision, 2007.

[40] T. Ojala, M. Pietikainen, and D. Harwood, “A Comparative Studyof Texture Measures with Classification Based on FeatureDistributions,” Pattern Recognition, vol. 29, pp. 51-59, 1996.

[41] T. Harada, H. Nakayama, and Y. Kuniyoshi, “Improving LocalDescriptors by Embedding Global and Local Spatial Information,”Proc. 11th European Conf. Computer Vision, pp. 736-749, 2010.

[42] H. Cheng, Z. Liu, N. Zheng, and J. Yang, “A Deformable LocalImage Descriptor,” Proc. IEEE Conf. Computer Vision and PatternRecognition, 2008.

[43] http://lear.inrialpes.fr/people/mikolajczyk/, 2011.[44] http://www.robots.ox.ac.uk/~vgg/research/affine/, 2012.[45] P. Moreels and P. Perona, “Evaluation of Features Detectors and

Descriptors Based on 3D Objects,” Int’l J. Computer Vision, vol. 73,no. 3, pp. 263-284, 2007.

[46] http://www.cs.ubc.ca/~mbrown/patchdata/patchdata.html,2012.

[47] http://www.vision.ee.ethz.ch/datasets/index.en.htm, 2011.[48] D. Nister and H. Stewenius, “Scalable Recognition with a

Vocabulary Tree,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 2, pp. 2161-2168, 2006.

Bin Fan received the BS degree in automationfrom Beijing University of Chemical Technology,China, in 2006, and the PhD degree in patternrecognition and intelligent systems from theNational Laboratory of Pattern Recognition,Institute of Automation, Chinese Academy ofSciences, Beijing, China, in 2011. In July 2011,he joined the National Laboratory of PatternRecognition, where he works as an assistantprofessor. His research interest covers image

matching and feature extraction. He is a member of the IEEE ComputerSociety.

Fuchao Wu received the BS degree in mathe-matics from Anqing Teacher’s College, China, in1982. From 1984 to 1994, he acted first as alecturer then as an associate professor atAnqing Teacher’s College. From 1995 to 2000,he acted as a professor at Anhui University,Hefei, China. Since 2000 he has been with theInstitute of Automation, Chinese Academy ofSciences, where he is now a professor. Hisresearch interests include computer vision,

which includes 3D reconstruction, active vision, and image-basedmodeling and rendering.

Zhanyi Hu received the BS degree in automa-tion from the North China University of Technol-ogy, Beijing, in 1985, and the PhD degree incomputer vision from the University of Liege,Belgium, in January 1993. Since 1993 he hasbeen with the Institute of Automation, ChineseAcademy of Sciences, where he is now aprofessor. His research interests include robotvision, which includes camera calibration and 3Dreconstruction, vision-guided robot navigation.

He was the local chair of ICCV ’05, an area chair of ACCV ’09, and is aPC chair of ACCV ’12.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

FAN ET AL.: ROTATIONALLY INVARIANT DESCRIPTORS USING INTENSITY ORDER POOLING 2045