Object Recognition by Discriminative Combinations of Line Segments, Ellipses, and Appearance Features

Object Recognition by DiscriminativeCombinations of Line Segments, Ellipses,

and Appearance FeaturesAlex Yong-Sang Chia, Deepu Rajan, Member, IEEE,

Maylor Karhang Leung, Member, IEEE, and Susanto Rahardja, Fellow, IEEE

Abstract—We present a novel contour-based approach that recognizes object classes in real-world scenes using simple and generic

shape primitives of line segments and ellipses. Compared to commonly used contour fragment features, these primitives support more

efficient representation since their storage requirements are independent of object size. Additionally, these primitives are readily

described by their geometrical properties and hence afford very efficient feature comparison. We pair these primitives as shape-tokens

and learn discriminative combinations of shape-tokens. Here, we allow each combination to have a variable number of shape-tokens.

This, coupled with the generic nature of primitives, enables a variety of class-specific shape structures to be learned. Building on the

contour-based method, we propose a new hybrid recognition method that combines shape and appearance features. Each

discriminative combination can vary in the number and the types of features, where these two degrees of variability empower the

hybrid method with even more flexibility and discriminative potential. We evaluate our methods across a large number of challenging

classes, and obtain very competitive results against other methods. These results show the proposed shape primitives are indeed

sufficiently powerful to recognize object classes in complex real-world scenes.

Index Terms—Shape primitives, appearance features, image classification, category-level object detection.

Ç

1 INTRODUCTION

RECOGNIZING object classes in real-world images is a longstanding goal in computer vision. Conceptually, this is

challenging due to large appearance variations of objectinstances belonging to the same class. Additionally, distor-tions from background clutter, scale, and viewpoint varia-tions can render appearances of even the same objectinstance to be vastly different. Further challenges arise frominterclass similarity in which instances from differentclasses can appear very similar. Consequently, models forobject classes must be flexible enough to accommodate classvariability, yet discriminative enough to sieve out trueobject instances in cluttered images. These seeminglyparadoxical requirements of an object class model makerecognition difficult.

This paper addresses two goals of recognition: imageclassification and object detection. The task of imageclassification is to determine if an object class is present in

an image, while object detection localizes all instances ofthat class from an image. Toward these goals, the maincontribution in this paper is an approach for object classrecognition that employs edge information only. Thenovelty of our approach is that we represent contours byvery simple and generic shape primitives of line segmentsand ellipses, coupled with a flexible method to learndiscriminative primitive combinations. These primitivesare complementary in nature, where line segment modelsstraight contour and ellipse models curved contour. Wechoose an ellipse as it is one of the simplest circular shapes,yet is sufficiently flexible to model curved shapes [1]. Theseshape primitives possess several attractive properties. First,unlike edge-based descriptors [2], [3], they support abstractand perceptually meaningful reasoning like parallelism andadjacency. Also, unlike contour fragment features [4], [5],[6], storage demands by these primitives are independent ofobject size and are efficiently represented with fourparameters for a line and five parameters for an ellipse.Additionally, matching between primitives can be effi-ciently computed (e.g., with geometric properties), unlikecontour fragments, which require comparisons betweenindividual edge pixels. Finally, as geometric properties areeasily scale normalized, they simplify matching acrossscales. In contrast, contour fragments are not scale invar-iant, and one is forced either to rescale fragments, whichintroduces aliasing effects (e.g., when edge pixels are pulledapart), or to resize an image before extracting fragments,which degrades image resolution.

In recent studies [1], [7], [8], it is shown that the genericnature of line segments and ellipses affords them an innateability to represent complex shapes and structures. Whileindividually less distinctive, by combining a number of

1758 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 9, SEPTEMBER 2012

. A.Y.-S. Chia and S. Rahardja are with the Institute for Infocomm Research,1 Fusionopolis Way, #21-01, Connexis (South Tower), Singapore 138632.E-mail: {ysachia, rsusanto}@i2r.a-star.edu.sg.

. D. Rajan is with the School of Computer Engineering, NanyangTechnological University, N4-2c-78, Nanyang Avenue, Singapore639798. E-mail: [email protected].

. M.K. Leung is with the Faculty of Information and CommunicationTechnology, Department of Computer Science, University Tunku AbdulRahman, Block A, Jln Universiti, Bandar Barat, Kampar 31900, Perak,Malaysia. E-mail: [email protected].

Manuscript received 1 Nov. 2010; revised 19 July 2011; accepted 30 Sept.2011; published online 2 Nov. 2011.Recommended for acceptance by T. Darrell, D. Hogg, and D. Jacobs.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMISI-2010-11-0833.Digital Object Identifier no. 10.1109/TPAMI.2011.220.

0162-8828/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

these primitives, we empower a combination to be suffi-ciently discriminative. Here, each combination is a two-layerabstraction of primitives: pairs of primitives (termed shape-tokens) at the first layer, and a learned number of shape-tokens at the second layer. We do not constrain acombination to have a fixed number of shape-tokens, butallow it to automatically and flexibly adapt to an object class.This number influences a combination’s ability to representshapes, where simple shapes favor fewer shape-tokens thancomplex ones. Consequently, discriminative combinationsof varying complexity can be exploited to represent an objectclass. We learn this combination by exploiting distinguish-ing shape, geometric, and structural constraints of an objectclass. Shape constraints describe the visual aspect of shape-tokens, while geometric constraints describe its spatiallayout (configurations). Structural constraints enforce pos-sible poses/structures of an object by the relationships (e.g.,XOR relationship) between shape-tokens. Correspondingly,these combinations attempt to strike a winning tradeoff: beflexible and hence bring tolerance toward intraclass varia-tion, while also being discriminative enough to be robust tobackground clutter and interclass similarity.

An important feature of our contour-based recognitionapproach is that it affords us substantial flexibility toincorporate additional image information. Specifically, weextend the contour-based recognition method and proposea new hybrid recognition method which exploits shape-tokens and SIFT features as recognition cues. Shape-tokensand SIFT features are largely orthogonal, where the formercorresponds to shape boundaries and the latter to sparsesalient image patches. Here, each learned combination cancomprise features that are either 1) purely shape-tokens,2) purely SIFT features, or 3) a mixture of shape-tokens andSIFT features. The number and types of features to becombined together are learned automatically from trainingimages, and represent the more discriminative ones basedon the training set. Consequently, by imparting these twodegrees of variability (in both the number and the types offeatures) to a combination, we empower it with even greaterflexibility and discriminative potential.

A shorter version of this paper appeared in [9].

1.1 System Overview

We give an overview of the contour-based recognitionmethod in Fig. 1a. Shape primitives of line segments andellipses are first extracted from an edge image. We pairconnected primitives to form shape-tokens, and leverage on

the shape and positional cues of these shape-tokens to learna class-specific codebook of shape-tokens. Combinations ofshape-token codewords which are discriminative of theobject category are next learned, where each combinationcontains a variable number of x codewords and which wedefine as a x-codeword-combination. These combinations arethen exploited to detect objects in test images.

Fig. 1b shows an overview of the hybrid recognitionmethod, which is a straightforward extension of the contourrecognition method. Here, SIFT features are extractedalongside shape-tokens, and we learn codebooks of shape-tokens and SIFT features separately from training images.Combinations of codewords which are discriminative forthe object class are learned, where a combination cancomprise a mixture of codewords from different codebooksand is defined as a x-mixture-codeword-combination. Thesex-mixture-codeword-combinations are then used by thehybrid recognition method to detect objects in test images.

1.2 Related Work

Shape primitives have been used previously for objectrecognition. Line segments were used in [7], [8] to detectobjects in cluttered scenes. While good performances wereachieved, their method did not model curved objectboundaries. This inhibits their ability to learn complex classmodels. Jurie and Schmid [3] extracted circular arcs in edgeimage, and described the spatial distribution of edge pixels ina thin neighborhood of the circle. As one weakness, a circleprimitive is not an affine invariant shape in which it readilydeforms to an ellipse under slight viewpoint variations.Consequently, their work is unlikely to be robust toviewpoint changes. Recently, Ferrari et al. [10] usedconnected straight line segments as features. This is similarto our work, where line segments (and ellipses) are used.However, a number of important differences exist. They usedonly lines to model local shape structures, while we harnessboth lines and ellipses to yield richer representation. Also,their work required the powerful but slow Berkeleyboundary detector [11] to extract meaningful object bound-aries and to filter noisy background contours before trainingor testing. In contrast, our method extracts contours by thetraditional Canny detector and is robust to noisy edge pixels.Another difference is that they represented shapes byfeatures extracted from class-neutral images. We departfrom this framework, and instead explicitly tailor primitivecombinations to a specific object class.

Our choice for learning class-specific features is motivatedby the substantial success of the contour-based methods of

CHIA ET AL.: OBJECT RECOGNITION BY DISCRIMINATIVE COMBINATIONS OF LINE SEGMENTS, ELLIPSES, AND APPEARANCE FEATURES 1759

Fig. 1. Overview of the contour-based recognition method is shown in (a), and that for the hybrid recognition method in (b).

Shotton et al. [6] and Opelt et al. [5]. A difference between ourwork and theirs is that they represented local shapes bycontour fragments while we employ generic shape primi-tives. Consequently, they suffered the shortcomings high-lighted in Section 1. More importantly, to ensure tractablelearning of useful features, they followed a similar frame-work as [10], [12], in which each feature was comprised of afixed number of codewords (single codeword in [6], [10], [12]and two codewords in [5]). Our approach imposes no suchlimitation, and instead learns class-specific features that havea variable number of codewords. Rather than adopting localshapes of contour fragments as features, Leordeanu et al. [13]explicitly used the pairwise relations between contourfragments as features. They showed that these relationsalone can yield good recognition results even withoutincorporating local shapes of contour fragments.

The above methods exploited only a single source ofimage information (either shape or appearance information)for recognition. A number of recent approaches haveexploited both shape and appearance features as recogni-tion cues. Kumar et al. [4] employed contour fragments andthe texture enclosed within fragments to detect articulatedobject classes. Good detection results were demonstrated,though they require tracked video sequences or handlabeled parts for training; we only require bounding boxes.Fergus et al. [14] used both curved contour fragments andPCA coefficients that describe salient interest regions forobject recognition. As one weakness, edge information isnot fully exploited in that work as they used only cleanplanar curved fragments between bitangent edge points.

More closely related to our hybrid method, Opelt et al. [5]extended their contour-based method, and proposed an-other method in the same paper which used SIFT featuresand contour fragments. In another related work, Shotton etal. [15] used dense texton features and contour fragments asrecognition cues. Both methods employ boosting to learndiscriminative features. As limitations, each boosted featureof [5], [15] is restricted to being either shape-based orappearance-based; their boosted feature cannot simulta-neously contain both contour fragments and SIFT/texton.Additionally, to avoid exponential explosion in computa-tional cost, they constrain each boosted feature to becomprised of a fixed number of contour fragments (twofragments in [5] and a single fragment in [15]), or SIFT/texton (two SIFT features in [5] and a single texton in [15]).Our hybrid method imposes neither restriction; eachdiscriminative feature combination can be comprised ofvaried numbers and types of features. Such flexibilityempowers our method to learn discriminative combinationsthat naturally evolve to best represent an object class.

2 SHAPE-TOKENS

In this section, we present our contour features of shape-tokens, discussing how they are described and matched at asingle and across multiple scales. First, however, we detailour method to construct shape-tokens.

2.1 Constructing Shape-Tokens

We extract shape primitives of line segments and ellipsesfrom an edge image using the methods in [16] and [17],respectively. A shape-token is constructed by pairing areference primitive to its neighboring primitive. In this

work, given two primitives of different types, we alwaysconsider an ellipse to be the reference primitive. For twoprimitives of the same type, we consider each primitive inturn as the reference primitive. This gives the followingthree types of shape-tokens: line-line, ellipse-line, andellipse-ellipse.

We first present our method to identify neighboring linesof a reference line to construct line-line shape-tokens. Here,our aim is to find neighboring lines that are likely to lie on thesame boundary as the reference line. In the absence of higherlevel information of the boundary, connectivity is a naturaland intuitive mechanism for selecting neighbors, where weidentify lines that are adjacently connected at either endpointof a reference line as its neighbors. To counter the problem ofbroken edge contours, we adopt the approach in [18], anddefine a small isosceles trapezium (search area) at bothendpoints of the reference line (see Fig. 2a). A line segmentwhich has any of its points within either of the twotrapeziums is considered connected to the reference line.This bridges small breaks between line segments andprovides robustness to broken edge contours. Note thatour scheme to find neighboring lines defines a trapezium atboth ends of a reference line, and is a generalization of [10],where trapeziums are defined only at edge discontinuities.This generalization provides tolerance to some poorly fittedlines (as neighbors are found from a small search area andare not restricted only to adjacently connected lines) andaffords a larger pool of line-line shape-tokens modeling thesame boundary to be constructed. We center the minor baseof each trapezium on an endpoint of the reference line, andorient its height in the direction of the reference line. For allexperiments, we used the same fixed size trapezium ofminor base 5 pixels, major base 9 pixels, and height 11 pixels.

Next, we describe the search for the neighbors of areference ellipse. Unlike a line segment, an ellipse does nothave any endpoint, and is possibly detected from discon-nected elliptical fragments. While one can define atrapezium search area at the endpoints of these ellipticalfragments, such trapeziums will be oriented along thereference ellipse, and hence likely to collect neighboringprimitives that are already described by the ellipse. Instead,given a reference ellipse, we first center a circular searcharea on this ellipse. A circular search area is independent ofthe orientation of the reference ellipse, and avoids missingneighbors when pose of an object changes. We fix the radiusof the circular search area to be three times the half-lengthof the major axis of the reference ellipse. We consider aprimitive which has any points within this search area andis weakly connected to the reference ellipse as a neighbor.Our concept of weak connectivity is described as follows:First, we bridge breaks in the line edge map (LEM) [17] bydefining trapeziums at edge discontinuities. A line segmentis considered weakly connected to a reference ellipse if apath can be traced from it to the reference ellipse byfollowing along line segments and across bridges betweenline segments of the LEM. Similarly, we identify an ellipseto be weakly connected to a reference ellipse if a path can betraced between any of their underlying line segments. Forillustration, Fig. 2b shows a reference ellipse in bold black,and its neighboring line segments in bold brown. As


observed, black line segments (that form the large octagon)

lying outside the search area are not considered as

neighboring primitives, even though they are weakly

connected to the reference ellipse. Similarly, we show the

same reference ellipse in Fig. 2c and its neighboring ellipse

as the bold brown outline.

2.2 Describing Shape-Tokens

A low-dimension descriptor comprising simple geometrical

attributes is used to describe a shape-token. Let � denote the

orientation of a primitive, � 2 ½0; ��. For an ellipse whose

eccentricity is more than �" (fixed at 0.8), � is assigned to be

the orientation of its major axis; orientation of a circle or an

ellipse whose eccentricity is less than �" is fixed as �. We

define ½vx vy�T

as the unit vector from the center of a

reference primitive to the center of its neighbor, and h as the

distance between their centers. The midpoint between their

centers is defined to be the shape-token centroid. We denote

the length and width of a primitive as l and w, respectively.

We define the length and width of an ellipse by its major

and minor axis respectively, and fix the width of a line

segment to be one pixel. Given these notations, a shape-

token is described as A ¼ ½lr wr �r ln wn �n h vx vy�T , where

the superscripts r and n differentiate attributes of a

reference primitive from its neighbor. These geometric

attributes are similar to that used by Leordeanu et al. [13]

for modeling the pairwise relations between parts. Fig. 3

visualizes these geometrical attributes used for describing

shape-tokens.The shape descriptor can easily be made rotation

invariant by defining the orientation of the neighboring

primitive with respect to the reference primitive and

removing �r from the descriptor. Similarly, it can also be

made scale invariant by normalizing the length and width

of the primitives by the spatial separation between the

primitives (i.e., h), and removing h from the descriptor. We

note that such modifications would also make the descrip-

tor more compact at the cost of reducing its distinctiveness,

although we do not explore them further in this paper.

2.3 Matching Shape-Tokens

We now present our approach to compare different shape-

tokens in which a shape-token is compared only with

similar typed shape-tokens. We first describe comparisons

at a single scale. For this purpose, we define a symmetrical

distance measure between two shape-tokens with descrip-

tors Ai and Aj as

D�Ai; Aj

�¼Xp2ðr;nÞ

wlDl

�lpi ; l

pj

�þXp2ðr;nÞ

wlDl

�wpi ; w

pj

�

þXp2ðr;nÞ

w�D�

��pi ; �

pj

�þ wlDl

�hi; hj

�þ wvDv

�vxi ; v

yi ; v

xj ; v

yj

�;

ð1Þ

where

Dlðli; ljÞ ¼ minðj lnðli=ljÞj; 1Þ;

D�ð�i; �jÞ ¼ minðj�i � �jj; �� j�i � �jjÞ.�

2;

Dv

�vxi ; vy

i ; vxj ; vy

j

�¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�vxi � vx

j

�2 þ�vyi � vy

j

�2q

;

and Dl 2 ½0; 1�measures the difference in lengths, D� 2 ½0; 1�the difference in orientations, and Dv 2 ½0; 2� the differencein relative primitive positions. wl, w�, and wv are weightparameters. Given that no term in (1) has substantialdominance over the other terms, we assign a constant andequal value of 1 to these weight parameters. In this aspect,(1) combines the differences in geometrical attributes of theshape-tokens together into a single useful dissimilaritymeasure (matching cost), in which the distance range forcomparing line-line, ellipse-line, and ellipse-ellipse shape-tokens are ½0; 7�, ½0; 8�, and ½0; 9�, respectively. This formula-tion in (1) is similar to that used by Ferrari et al. [10] forcomparing between their features of connected line seg-ments, but has been generalized to evaluate geometricaldifferences between ellipse primitives and to account fordifference in spatial separation between primitives.

It is easy to extend the above matching to multiple scalesby simply rescaling the descriptors. Specifically, thedescriptor can be normalized against an object scale bs asA ¼ fðA; 1=bsÞ,

f A; 1=bsð Þ ¼ lr

bs

wr

bs�r

ln

bs

wn

bs�n

h

bsvx vy

� �T:

Matching at scale s between a scale-normalized Ai and anunscaled Aj is then computed by rescaling Ai with fðAi; sÞ,and then evaluating DðfðAi; sÞ; AjÞ with (1).

3 CODEBOOK OF SHAPE-TOKENS

This section presents our method to learn a codebook ofrepresentative shape-tokens of the target object class. Theoutline of our method is as follows: We first extract aninitial set of shape-tokens from within the bounding boxesof all training objects. Candidate codewords are next foundby a two-step clustering for the descriptors and relativepositions of these shape-tokens. For efficiency considera-tion, we order the two-step clustering procedure in which afaster clustering step is first applied to divide the initial setof shape-tokens into smaller clusters, before a second and


Fig. 2. Toy examples for finding the neighboring primitives of a referenceprimitive for constructing (a) line-line, (b) ellipse-line, and (c) ellipse-ellipse shape-tokens. Reference primitives are shown in bold black,neighboring primitives in bold brown, and search areas as the shadedgreen regions.

Fig. 3. Geometric attributes used to describe (a) line-line, (b) ellipse-line,and (c) ellipse-ellipse shape-tokens. Reference primitives are shown inbold black and neighboring primitives in bold brown.

more robust clustering step is applied to find subclusterswhose members have similar shape and position. Finally, aradial ranking method is employed to select the best set ofcandidates which covers a large spatial extent of the objectmodel as codewords.

3.1 Clustering by Shape

We start with an initial set of shape-tokens that areextracted from within the bounding boxes bb of all trainingobjects. We normalize the descriptors of each shape-tokenby the object scale bs (defined to be the diagonal length ofbb). In this initial clustering step, we adapt the efficientbisecting k-medoid method to find clusters whose membershave similar shapes. Specifically, we apply 2-medoidclustering to similar typed shape-tokens (i.e., line-line,ellipse-line, and ellipse-ellipse shape-tokens are separatelyclustered) where distances between shape-tokens arecomputed using its scale-normalized descriptors with (1).For each cluster, we evaluate its intracluster shape dissim-ilarity value by the average shape distance between themedoid and its members. We repartition a cluster by2-medoid clustering if its dissimilarity value is more than th(fixed at 20 percent of the maximum of the range of Dð�Þ in(1)), but retain those whose dissimilarity values are lessthan th. This first step clustering procedure terminateswhen the intracluster shape dissimilarity value for everycluster is less than th. This method improves on k-medoidclustering used by Shotton et al. [6] in that it avoidsspecifying the number of clusters as input, which isunknown a priori and varies between data sets.

3.2 Clustering by Relative Positions

A second clustering step is applied separately to each of theabove clusters on the positions of its members to findsubclusters whose members agree in shape and position.Here, for each shape-token, we define vector x as thepositional vector that is directed from the object centroid(i.e., center of the bounding box from which the shape-token is extracted) to the shape-token centroid, and isnormalized by the object scale bs. The very robust mean-shift clustering method [19] is then applied to each of theabove clusters on vectors x of its members. Given thatvector x is two-dimensional and we apply mean-shiftseparately to each of the above clusters (as opposed toapplying mean-shift directly on all shape-tokens), subclus-ters can thus be efficiently found. Shape-tokens within eachsubcluster now have similar shape and are located atsimilar position relative to the object centroid. We identifythe medoid (with the distance measure in (1)) in eachsubcluster as a candidate codeword ’, and associate it witha shape distance threshold � . This threshold indicates therange of shapes the candidate represents, and is computedas the mean shape distance between the candidate and itssubcluster members plus one standard deviation. Eachcandidate is also parameterized with a scale-normalizedcircular window specifying where the candidate is expectedto be found relative to an object centroid. We compute therelative center of this window, c, as the mean of vectors x ofthe subcluster members, and its radius, r, as the meaneuclidean distance between c to x of each subclustermember plus one standard deviation.

3.3 Selecting Candidates into Codebook

We show in Fig. 4a all candidate codewords that areobtained from the initial set of shape-tokens for theWeizmann horse data set [20], in which a candidate isshown darker if it is from a more populated mean-shiftsubcluster. Note that there are a considerable number ofcandidates from the background since candidates fromevery mean-shift subcluster (even those that are veryweakly populated) are visualized. While one can considerall candidates as codewords, this results in an unnecessarilylarge codebook where many entries belong to the back-ground. To reduce redundancy in the codebook, our aimhere is to select a subset of high-quality candidates that arerepresentative of the object class as codewords. A simpleheuristic to select candidates based on cluster size caninadvertently pick nonsalient candidates. For example,candidates from the 350 most populated mean-shiftsubclusters for the horse class are shown in Fig. 4b, wheremany candidates are identified from the background withlittle or no candidates from the neck and the crest of a horsemodel. Here, rather than identifying codewords by thecluster size, we first compute a score for each candidatecodeword based on its shape and geometric qualities, andthen employ a radial ranking method to select the bestscoring candidates as codewords.

We score each candidate as a product of 1) its intraclustershape similarity value, 2) the number of unique trainingbounding boxes its members are extracted from, and 3) itsvalue of 1=r. We compute the intracluster shape similarityvalue as d� � , where d is the maximum of the range ofshape distance for the type of candidate currently con-sidered, and � is the shape distance threshold (as explainedbefore). Note that the first two terms in the product seekcandidates that have distinctive shape and are flexibleenough to accommodate intraclass variations, while the lastterm seeks candidates that estimate a stable and preciselocation for its members, and measures the geometricquality of the candidates (unlike the first two terms whichmeasure its shape quality).

To ensure selected candidates represent a large spatialextent of an object model, as opposed to a localized portion,we select high-scoring candidates by a radial rankingscheme. We emit a pair of rays from the object centroid todelineate a sector. Candidates within each sector areidentified and we collect the top scoring t candidates ofeach sector. Fig. 5 illustrates this radial method forcollecting candidates in which we show two examplesectors for the horse class. Centroid positions of thet candidates in each sector are shown as green � aroundan object centroid (yellow +), and we also visualize the


Fig. 4. Candidate codewords from (a) all mean-shift subclusters, (b) the350 most populated subclusters, and (c) the 350 subclusters selected byour method for the Weizmann horse data set. Line segments are shownin blue, ellipses in magenta, and object centroid by yellow +.

t candidates of each sector. Observe that the spatial layoutof each sector has selected salient shape structurescorresponding to the head and crest of a horse. For allexperiments, we use 30 nonoverlapping sectors and fix t tobe 20. Finally, instead of using all collected candidates ascodewords, we retain the 350 highest scoring candidates ascodewords. This provides robustness against collectingpoor candidates when a sector has substantial overlap withthe background.

Fig. 4c shows the final 350 candidates selected into thehorse codebook. Note that both these visualizations andthat based on cluster size in Fig. 4b are obtained with350 codewords. Compared to Fig. 4b, Fig. 4c showssubstantially fewer codewords corresponding to the back-ground. In addition, a large spatial extent of a horse modelcomprising salient shape structures (e.g., crest and neck) isaccounted for with several codewords typically represent-ing different poses of the same object part. This providestolerance toward intraclass variations, small pose changes,and partial occlusion.

4 CODEWORD COMBINATIONS

The previous section presents our method to learn class-specific codewords. In the simplest form, one can use a singlecodeword that is matched in the test image to predict objectlocations. Such a scheme was adopted by Gall and Lempitsky[12], who used image patch codewords. However, unlikeimage patches, contour codewords are less discriminativeand often are matched in the background of complex images.Instead, a combination of several codewords can be morediscriminative for an object class, and thus find fewer falsematches [5]. In this section, we present our method to learndiscriminative codeword combination. We start by describ-ing how a combination of different codewords is matched inan image. Next, we present an efficient method to find theexhaustive set of codeword combinations that are matched intraining images. We then discuss our method, whichcapitalizes on distinguishing shape, geometric, and structuralconstraints of an object class to learn discriminative code-word combinations. Finally, we describe how these combina-tions are exploited to recognize object classes in test images.Each discriminative combination comprises a variablenumber of shape-token codewords, which we termed x-codeword-combination or xCC.

4.1 Matching a Codeword Combination

A codeword combination is matched at scale s in an image Iif 1) shape distance at scale s between each codeword in thatcombination and a shape-token in I is within the shape

distance threshold � of the codeword (shape constraint),and 2) centroid predictions by all codewords in thecombination concur (geometric constraint). A codewordwhich satisfies the shape constraint with a shape-token ofposition x in I will predict an object centroid by a circularwindow with center x ¼ x� sc, and radius sr, where c and rare defined in Section 3.2. These centroid predictions concurif there is a common region among their windows.

Fig. 6 exemplifies these constraints for matching acombination of four shape-token codewords, where code-words, matched shape-tokens, and centroid predictions areshown color coded. Fig. 6a depicts a 4-codeword combina-tion. Using the same scale s, shape-tokens from positive andnegative images which satisfy the shape constraints withthe codewords are shown in Fig. 6b, where colored �denote shape-token centroids. Centroid windows predictedby these codewords are shown in Fig. 6c, and coloredarrows denote scaled vectors �sc. These windows share acommon region in the positive image but not the negativeimage, and hence this codeword combination is matchedonly in the positive image.

4.2 Finding All Matched Codeword Combinations

In the previous section, we have described how a codewordcombination is matched in an image. In this section, wediscuss an efficient method (linear in codebook size) to findthe exhaustive set of codeword combinations that arematched across all training (both positive and negative)images. This exhaustive set of matched codeword combina-tions will then be used later in Section 4.3 to learn anensemble of discriminative xCC for the object class.

The following theorem states the basic idea of ourmethod to find all matched codeword combinations.

Theorem 1. For a scale s and location x in image I, all codewordsof a target object class which satisfy the shape matchingconstraint with a shape-token located within its estimatedwindow of center xþ sc and radius sr in I will also satisfy thegeometric matching constraint.

Proof. Consider Fig. 7a. Let x ¼ ½0 0�T be the Cartesiancoordinate of the origin in image I, where we depict thisorigin as a yellow + in the figure. For scale s and locationx in I, let the scaled vector sc of a codeword be ½a b�T .Thus, this codeword estimates a circular window ofcenter ½a b�T and radius sr, which we depict as themagenta circle in Fig. 7a. Suppose a shape-token in Isatisfies the shape matching constraint with this code-word, and is located at position x ¼ ½aþ dx bþ dy�T inthis window. We depict the position of this shape-token


Fig. 5. Radial method of selecting candidate codewords. We show twoexample sectors that are defined between pairs of rays (red lines)emitted from object centroid (yellow +). Green � depicts centroidpositions of the top scoring t candidates of the sector. Candidatecodewords in each sector are also visualized and represent the head(left) and crest (right) of a horse model. Fig. 6. Matching a codeword combination in positive and negative

images.

as a small green � in the figure. Since it is within thewindow, it follows that:

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffidx2 þ dy2

p� sr: ð2Þ

From Section 4.1, this codeword also predicts an objectcentroid in I by a circular window of radius sr and center

x� sc ¼ ½aþ dx bþ dy�T � ½a b�T

¼ ½dx dy�T ;ð3Þ

which we depict as the magenta circle in Fig. 7b. From(2), the point x ¼ ½0 0�T must be within this window. Assuch, at the same scale s and location x in I, allcodewords which satisfy the shape matching constraintswith a shape-token located in its estimated windows willpredict centroids that are guaranteed to contain acommon point x, and thus also satisfy the geometricmatching constraint. tuFor a scale s and location x in I, we use a numerical

value <iðs; xÞ to indicate if a codeword �i finds matchingshape-tokens that satisfy the shape matching constraintand that are also located within its estimated window. LetA0 be the shape descriptor of a shape-token t0 located atposition x0 in I . The shape-token t� which best matchescodeword �i is defined as one whose shape is most similarto �i (and within shape distance threshold �i of thecodeword), and whose position is closest to its expectedposition (and within the estimated window of the code-word), i.e., t� ¼ arg mint0 ðdshapeð�i; t0Þ þ dgeoð�i; t0ÞÞ, where

dshapeð�i; t0Þ ¼DðfðAi; sÞ ; A0Þ

�iif DðfðAi; sÞ ; A0Þ � �i

1 otherwise;

8<:

ð4Þ

dgeoð�i; t0Þ ¼kxþ sci � x0k

2

sriif kxþ sci � x0k

2� sri

1 otherwise;

8<: ð5Þ

and k � k2

is the L2 norm. <iðs; xÞ for codeword �i at scale sand location x in an image is then

<iðs; xÞ ¼ dshapeð�i; t�Þ þ dgeoð�i; t� Þ: ð6Þ

It is easily verified that a codeword which finds a shape-token that satisfies its shape matching constraint and that isin its estimated window has <ð�Þ value in the range ½0; 2�,where a lower value indicates better matching. In contrast, anonmatching codeword has infinity <ð�Þ value. Then, fromTheorem 1, at s and x in image I, every combination ofcodewords whose <ðs; xÞ values are less than infinity are

matched in I. By iterating through all scales and locationsacross all training images, the exhaustive set of matchedxCC can thus be found. The computational complexity ofthis search is Oðl�NÞ, where N is the codebook size and land � are, respectively, the number of locations and scalesbeing searched. For greater efficiency, we sample locationsat every 15 pixels in the horizontal and vertical directions,and use a number of scales that covers the range of objectscales in training images. This reduces computation over-heads, and is similar in concept to the efficient multiscalesliding-window technique.

4.3 Learning Discriminative xCC

We seek an xCC which models discriminative shape,geometric, and structural constraints of an object class toreliably predict object locations. The formulation for <ð�Þ in(6) provides a mathematically convenient method to findsuch a xCC. We take as input the <ð�Þ values of all shape-token codewords at every sampled scale and location ðs; xÞfor all training images, where each ðs; xÞ is an objecthypothesis. Consider first an xCC which comprises twocodewords �i and �j. From Section 4.2, the matching of thisxCC at ðs; xÞ in image I can be mathematically represented as

<iðs; xÞ � #i and <jðs; xÞ � #j; ð7Þ

where # is a threshold in the range ½0; 2�. Note that sincethis xCC is matched at the hypothesis, it therefore implicitlymodels the shape and geometric configurations of shape-tokens at this hypothesis. This representation can be furthergeneralized to the following form:

pi<iðs; xÞ � pi#i and pj<jðs; xÞ � pj#j; ð8Þ

where p has value þ1 or �1 to indicate the direction of theinequality, and # is relaxed to take on any real numbervalue. With this representation, structural configurations,e.g., XOR relationship between codewords at this hypoth-esis (matching of �i implies �j is weakly matched orunmatched), can be integrated with the shape andgeometric configurations of shape-tokens at the samehypothesis. This representation can be modeled by anxCC, with each of its codeword having a p and # values.Additionally, by allowing an xCC to have a variablenumber of codewords, we can impart greater flexibility tothis xCC to model shapes and structures of varyingcomplexity. Our aim here is to learn such discriminativexCC (i.e., codewords in the xCC, and its p and # values)which can reliably predict the presence/absence of an objectinstance at an object hypothesis ðs; xÞ.

We exploit a binary decision tree [21] to learn such anxCC. For our purpose, the input to learning a decision treeis a Nc �M training matrix, a 1�M binary label vector anda 1�M weight vector, where Nc is the number of shape-token codewords and M is the number of object hypotheses.An entry ði; jÞ in the training matrix is the <ð�Þ value of theith codeword at the jth object hypothesis, and each entry ofthe label vector has value þ1 or �1 to indicate the presenceor absence of an object at a hypothesis. The weight of thehypothesis (initialized according to the number of positiveand negative hypotheses) is given in its correspondingentry of the weight vector. Given these inputs, we learn a


Fig. 7. Illustration for finding matched codeword combinations.

decision tree with k decision nodes. It is easily shown that apath from the root node to any leaf node of the decision treeencounters at most k decision nodes, where each decisionnode is a predicate of the form p <ðs; xÞ � p #, and thepredicates in a path can be combined into the mathematicalrepresentation of (8). Each path (i.e., the number of decisionnodes and their predicates, and the predicted label at theleaf node) automatically adapts to an object class, and isdiscriminatively learned to predict the presence/absence ofan object at an object hypothesis. Hence, by learning abinary decision tree with k decision nodes, discriminativexCC can be found by simple path transversal from a rootnode to each leaf node, where each xCC has a variablenumber of between 1 to k codewords, and a hypothesis isassigned a predicted label by exactly one xCC of the tree.The prediction by an xCC at hypothesis ðs; xÞ is

xCC ½s; x� ¼ ‘ if pi<iðs; xÞ � pi#i and pj<jðs; xÞ � pj#j . . .0 otherwise;

�

where ‘ 2 ½�1; þ1� is the predicted label given by the leafnode in the path, and prediction of 0 implies that the xCCdoes not vote for or against an object presence at ðs; xÞ.

We learn an ensemble of discriminative xCC byAdaBoost, where every boosting round outputs a binarydecision tree with k decision nodes. Fig. 8a shows the firstthree xCC that are learned automatically for the horse class.Codewords in each xCC are shown in separate colors. Wedenote the estimated windows of each codeword (i.e.,where it is expected to find matching shape-tokens) bycolored circles. A green-colored circle depicts a requirementthat the codeword must find matching shape-tokens andcorresponds to p ¼ þ1 of predicate p <ðs; xÞ � p #. Con-versely, a red-colored circle illustrates the requirement forthe codeword to be unmatched (or be weakly matched) andcorresponds to p ¼ �1. For each xCC, we show an examplepositive hypothesis ðs; xÞ which satisfies the shape, geo-metric, and structural constraints modeled by the xCC. Weindicate the location x of a hypothesis by a blue “+” andrepresent its scale s by the diameter of a blue-outlined circlein the training image. Codewords of the xCC and itsmatching shape-tokens in training images are shown by thesame color. All xCC learned by AdaBoost are thencombined to form a boosted ensemble, whose detectionconfidence on an object hypothesis is

H ½s; x� ¼Xj¼1

�j � xCCj ½s; x�; ð9Þ

with�j as the weight ofxCCj, and is learned during boosting.For clarity of presentation, the xCC shown in Fig. 8a arelearned with decision trees having three decision nodes (i.e.,k ¼ 3), and thus each xCC has between 1 to 3 shape-tokencodewords. However, rather than fixing the same number kfor all classes, we learn a k value separately for each class by

threefold cross validation over values 1 to 10 on the traininghypotheses. This approach to tailoring parameter value to aclass was used in [6], and renders the ensemble to be lessprone to overfitting. After finding the optimal k value, welearn the ensemble of xCC from the entire traininghypotheses and use it to recognize object classes in testimages, described next.

4.4 Object Detection and Image Classification

We detect objects in a test image with a multiscale slidingwindow approach using the boosted ensemble learned in theprevious section. The sliding and scale steps for each class areidentical to those used during training in Section 4.2, with theaspect ratio of the window equal to that of the averagetraining bounding box of the class. Each window is anobject hypothesis ðs; xÞ in which s is the diagonal length ofthe window and x its center. We evaluate detectionconfidence of each window by (9), which gives a 3Dconfidence map for the scale and location of an objectinstance in the test image. We consider local maxima ascandidate detections, and, to avoid multiple detections ofthe same object instance, apply a postfiltering step toremove candidate detections whose x are within thewindow of a more confident detection. The retaineddetections yield the final set of detections of the test image.For image classification, we use the confidence of thestrongest detection as the classification confidence of thetest image.

5 FUSION OF SHAPE AND APPEARANCE

The previous sections discussed our contour-based method,which exploits only shape information as recognition cues.Given the flexible nature of the method, incorporatingcomplementary cues of appearance information provesparticularly straightforward. In this section, we present ahybrid method which jointly exploits shape and appearancecues for recognition. Here, we employ the very powerfuland popular SIFT features as appearance cues, although wenote our method supports the use of other features likeGabor and MSER. We first learn a codebook of shape-tokensand a codebook of SIFT features separately from trainingimages of the class. Decision tree and AdaBoost are thenexploited to learn discriminative combinations comprisingdifferent codeword types. We term such a combinationx-mixture-codeword-combination or xMCC.

5.1 Learning a Codebook of SIFT Features

We follow a similar framework as Section 3 to learn a SIFTcodebook. Specifically, we first extract SIFT features [22]from within the bounding boxes of training objects andcluster them by the adapted bisecting k-medoid method.We compute dissimilarity between SIFT features by theeuclidean distance measure (in line with [23]) and the rangeof this measure is ½0; 2�. As this measure differs from that


Fig. 8. (a) Three example discriminative xCC and (b) three example discriminative xMCC. Examples shown are for the horse class.

used for shape-tokens in (1) and affects only this step, weexperimentally determine the best value for the clusteringparameter, giving dissimilarity threshold th to be 0.50. SIFTcandidate codewords are then extracted with the samesecond-step clustering for relative positions of SIFT fea-tures. Finally, we use the same radial ranking scheme toselect candidate codewords into a SIFT codebook.

5.2 Learning Discriminative xMCC

The learning of discriminative xMCC follows a straight-forward adaption to the method in Section 4.3. At eachobject hypothesis, we find the <ð�Þ values of all shape-token codewords by (6) and the <ð�Þ values of all SIFTcodewords as

<iðs; xÞ ¼ dappð�i; t�Þ þ dgeoð�i; t�Þ; ð10Þ

where

t� ¼ arg mint0ðdappð�i; t0Þ þ dgeoð�i; t0ÞÞ; ð11Þ

dappð�i; t0Þ ¼kAi �A0k2

�iif kAi �A0k2

� �i1 otherwise:

8<: ð12Þ

Here, �i is a SIFT codeword, Ai its SIFT descriptor values, �iits appearance distance threshold, and t0 is an image SIFTfeature with descriptor values A0. We collect the <ð�Þ valuesof all shape-token and SIFT codewords into an N �Mtraining matrix, where N is the total number of shape-tokenand SIFT codewords and M is the number of objecthypotheses. This training matrix is then exploited to learnan ensemble of discriminative xMCC as done previously inSection 4.3 to learn the ensemble of xCC. We show the firstfew xMCC that are learned with k ¼ 3 for the horse class inFig. 8b. The layout in this figure follows that of Fig. 8a.Notice that unlike xCC in Fig. 8a, each xMCC comprisescodewords of types 1) purely shape-token, 2) purely SIFT,or 3) mixture of SIFT and shape-token.

6 EXPERIMENTAL EVALUATION

We structure the evaluation of our contour-based andhybrid recognition methods as follows: In Section 6.1, weperform an extensive evaluation of the classification anddetection performance of our contour-based method (i.e.,shape-token only) on several challenging data sets. Theseresults show that shape-tokens alone afford powerfulrecognition cues, where our contour-based method per-forms comparably or better than recent state-of-the-artcontour-based methods and other related methods. Next, inSection 6.2, we investigate the performance of our hybridmethod (i.e., SIFT + shape-tokens), and compare it againstour contour-based method (i.e., shape-tokens only) and anappearance-based method (i.e., SIFT only). Additionally, wealso compare our hybrid method against other recenthybrid methods.

We adhere to the evaluation criteria and experimentalprotocols of previous methods. As closely as possible, weuse the same training and test object images as previousmethods for comparing performances. An object is correctly

detected if overlap of the ground truth and detectedbounding boxes is above 50 percent, and multiple detec-tions of the same object count as false positives. Wecompare the detection performance by two scores of arecall-precision (RP) curve: equal error rate (RP-EER) andarea under curve (RP-AUC). RP-EER reports recall at asingle precision value, while RP-AUC measures detectionperformance across all precision levels and so gives a morerepresentative score for comparison purposes. For imageclassification, we consider an image to be correctlyclassified if it contains the object class. We compareclassification performance by two scores of a ROC-curve:the ROC-EER and the ROC-AUC scores.

6.1 Contour-Based Recognition Method

We first investigate important aspects of our contour-basedrecognition method (namely its tolerance to viewpointchanges and its ability to discriminate between similarshaped object classes) before evaluating on two challengingdata sets [5], [20]. All experiments, unless otherwise stated,are performed with the following parameter settings. Weuse �" ¼ 0:8 to describe shape-tokens. To learn a shape-token codebook, we set th at 20 percent of the maximumrange of Dð�Þ, collect the 20 best candidates in each of30 sectors, and retain the best 350 candidates as codewords.Training hypotheses are taken with a sliding step of15 pixels in the horizontal and vertical directions with thenumber of scales optimized against the training data.Evaluation uses the same sliding and scale steps. We learnthe boosted ensemble with 300 boosting rounds, and selectthe optimal number of decision nodes by threefold crossvalidation on the training data over values f1; 2; . . . ; 10g.

6.1.1 Detection under Viewpoint Changes

We exploit shape, geometric, and structural constraints ofan object class to learn discriminative xCC for objectdetection. These constraints are largely dependent on theviewing angle of training instances. For example, changes toobject contours due to viewpoint variations can alter visualaspects of the extracted shape-tokens, and thus shapeconstraints that are valid for a specific viewpoint may nolonger be valid for other viewpoints. In this aspect, adetector which is learned for one viewpoint may lackflexibility to detect objects seen at other viewing angles.

To investigate the tolerance of our method towardviewpoint changes, we learn detectors from training objectsseen in side-view and apply it to test objects that are rotatedabout a vertical axis. We perform this experiment on theETH-80 data set [24] which is comprised of eight classes,with 10 different object instances per class. We consider thefirst five instances for each class as training instances andthe remaining instances as test instances. Here, we focus theevaluation on the car and horse classes, in which instancesfrom these classes exhibit large shape variations whenrotated about the vertical axis (unlike that of some otherclasses, e.g., pear). Given a targeted class, e.g., car, we useall training car instances seen at 0 degree viewing angle aspositive instances, and training instances from each of theremaining classes (i.e., dog, horse, pear, . . . ), seen also at0 degree viewing angle, as negative instances. At test time,


the detector is applied to positive and negative testinstances seen at various vertically rotated viewing angles.

Fig. 9a plots the average detection confidence of test cars(bold green curve) and that for all noncar instances (boldred curve) across viewpoints. Additionally, we plot theaverage confidence of each noncar class in the same graph.As shown, the car detection confidence is above that ofevery noncar class for all viewpoints. This demonstrates thecar detector can discriminate against other test classesacross significant viewpoint changes. In Fig. 9b, we observethe horse detector to have weaker ability to discriminateagainst other classes across viewpoints, especially againstthe dog and cow classes. This weaker performance againstthe dog and cow classes is not surprising since these classesshare similar shape boundaries with a horse. In contrast,classes (e.g., tomato) which have vastly different shapeboundaries from a horse can be robustly discriminatedacross all tested viewpoints. Overall, these results show ourapproach to possess some robustness to viewpoint changes,where detection is better for the car class as compared to thehorse class (due possibly to larger shape variations in thehorse class).

6.1.2 Discriminating Similar Shaped Object Classes

Next, we investigate the ability of our method to discriminatebetween classes whose objects share similar shape bound-aries. We evaluate on both rigid (motorbike versus bike-side)and articulated (cow-side versus horse-side) classes, and useimages from the Graz-17 data set [5]. For rigid classes, welearn detectors on the same training set (90 motorbike +90 bike-side images), and evaluate the detectors on the same(53 + 53) test images. Specifically, to learn a detector for atarget class, e.g., motorbike, we use the 90 motorbike trainingimages as object images and the 90 bike-side images asbackground images. To evaluate the motorbike detector, weconsider the 53 motorbike test images as object images andthe 53 bike-side images as background images. For thearticulated classes, we learn detectors on the same trainingimages (45 cow-side + 45 horse-side) and evaluate on thesame (65 + 65) test images.

Table 1 gives the RP-AUC detection scores of thesedetectors. To investigate the ability of the detector todiscriminate between similar shape object classes, for eachdetector (e.g., motorbike detector) we report within bracketsthe number of background instances (i.e., bike-side) that arelocalized within the most confident P detections, where P isthe number of object instances in the test images and equals

53 and 65 for the rigid and articulated classes, respectively.Considering the rigid classes, we observe the bike-sidedetector to have weaker detection and discriminativeabilities as compared to the motorbike detector. This weakerperformance is a consequence of greater challenges fromthe bike-side class, where objects are characterized by holesand thin skeletal structures and are shown embedded insubstantial background clutter. For the articulated classes,both cow-side and horse-side detectors have equivalentdiscriminative ability. However, the cow-side detector hasbetter detection performance. This is due to the lowintraclass variability of this class. On the whole, theseresults demonstrate our method is good at discriminatingbetween classes with similar shapes, but will benefit fromeither more consistent or a larger set of training instances.

6.1.3 Evaluation on the Weizmann Horse Data Set [20]

We next evaluate our contour only method on thechallenging Weizmann horse data set. We pair horse imagesfrom this data set with the Caltech-256 background images[25], which comprise a diverse collection of indoor andoutdoor real-world scenes. Many cluttered edges arepresent in these images and pose a formidable challengeto our contour only method. Following the protocols of [6],images are downsampled to a maximum 320 pixels heightor width, with the first 100 horse and 100 backgroundimages used for training, and the remaining 228 horse and228 background images for testing.

We report quantitative results in Fig. 10. Additionally,the figure also includes the performance of the recent state-of-the-art contour-based method of Shotton et al. [6], whichwe denote as Shotton-I and Shotton-II. While Shotton-I usedthe same number of training images as us, Shotton-IIaugmented a set of object hypotheses from the test imagesto the training data, and retrained their system on theaugmented data. In this aspect, Shotton-II learned from alarger and more diverse set of training images. Never-theless, we achieve better classification and detection


Fig. 9. Detecting (a) car and (b) horse classes under viewpoint changes.

TABLE 1Discriminating Similarly Shaped Object Classes

Fig. 10. Image classification and object detection results by our contour-based method on the Weizmann horse test set, with comparisons to [6].

performances, as evident from the higher ROC-AUC andRP-AUC scores (given in the legends). The plot of recallagainst false positives per image (fppi) in Fig. 10b furtherreveals the better detection performance of our method inwhich 93 percent of test objects are correctly detected at alow fppi of 0.15 (around one false positive every sevenimages). In contrast, Shotton-I and Shotton-II managed arecall of 87 and 90 percent, respectively. Our detectionresult is also superior to that achieved very recently by thecontour-based method of Bai et al. [26], which obtained0.8032 RP-AUC. While their method used a smaller trainingsubset, this is due to their need to use segmented objectinstances for training. In contrast, since our method needsonly bounding boxes, we fully exploit all training images,which likely accounts for some part of the improvement.

We show example detections by our method in Fig. 11.Codewords which contribute positively to detections areback-projected at their detected positions and scales andshown at the bottom of test images. Bounding boxes denotedetections where green indicates true positives, red falsepositives, and yellow the ground truth for false negatives.As observed, our contour only method gives excellentdetections, where horses are accurately localized despiteextensive background clutter and significant intraclassvariation. Turning to incorrect detections, we observe falsepositives to arise when the pattern of cluttered edgesappears very similar to the object model. For example, theoutline of the box placed on a grass patch is substantiallysimilar to the neck and front legs of a horse, leading to afalse detection. We show in Section 6.2.2 that the fusion of

appearance cues with shape cues could be exploited toimprove detection performance. False negatives are dueeither to considerable pose changes (second to last testimage) or to viewpoint variations (last test image) whichexceed the reasonable tolerance of our detector that is learnfor only a single object viewpoint. Using a more extensivetraining set, covering different poses and viewpoints of theobject class, may mitigate such false negatives. Addition-ally, for robust detections across unconstrained viewpoints,one can combine a battery of detectors where each detectoris trained for a specific viewpoint of the object, though weleave these improvements for future endeavors.

6.1.4 Evaluation on the Graz-17 Data Set [5]

This demanding data set features 17 diverse classes. Wefollow the experimental setup of [5], [6] and evaluate on thesame training and test sets as used by their methods. Weshow example detections in Fig. 12. Table 2 reportsquantitative results, with comparison to the recent state-of-the-art contour only methods of Opelt et al. [5] and Shottonet al. [6]. Classification and detection scores are averagedover 17 classes and shown in the last row of the table.


Fig. 11. Example detections on the Weizmann test set. The top rowshows detections and the bottom row the shape-token codewords thatcontribute positively to detections. Codewords are visualized at theirdetected positions and scales.

Fig. 12. Example detections by our contour-based method for all Graz-17 classes. Note accurate localization despite challenges like intraclassappearance variations, occlusion, background clutter, and multipleobjects.

TABLE 2Classification and Detection Results on the Graz-17 Data Set

*Average ROC-EER for [5] is calculated from the four classes whose classification results are reported by the authors.

Even though AUC scores are more representative, wecompare against [5] by EER scores (since the authors onlyprovide EER scores). For classification, we achieve anaverage (across first four classes) ROC-EER of 4.4 percent,which is not as good as that of their state-of-the-art method.Nevertheless, it is still quite competitive and attains a betteraverage ROC-EER score compared to other published resultsshown in Table 3. It has to be mentioned that even though weuse stronger supervision than the methods of Table 3, ourresults are obtained with 100 positive and 100 negativetraining images, which is less than half of what was used bythese methods. For object detection, we perform better onsome classes (e.g., motorbike and face), but considerablyworse for classes like car-2/3-rear and bike-rear which havevery few test images. The number of test images, m, cansharply affect an RP-EER score and hence directly influencesthe significance of the RP-EER measure. Specifically, RP-EERreports performance at a single precision value and henceeven one false positive or missed detection can impact RP-EER by as much as 100

m %, as also pointed out in [6].Consequently, even though a higher average RP-EER isobtained by our method, much more significant are thedetection results for classes with more test images. Inparticular, considering classes with more than 200 testimages (i.e., first four classes), we attain 4.425 percentaverage RP-EER, exactly matching that of [5] (see Table 2).We compare against [6] by the more representative AUCscores. Overall, we have better classification and detectionperformances as observed from the higher ROC-AUC andRP-AUC scores. Importantly, these improvements areobtained with a smaller training set, compared to [6], wherehypotheses from the test data are augmented with theoriginal training data for retraining (as in Shotton-II methodof the previous Weizmann horse experiment).

6.2 Hybrid Recognition Method

We evaluate our hybrid method on the challenging Weiz-mann horse data set and the first four classes of the Graz-17

data set. Both image classification and object detection resultsare reported, and we compare against 1) our contour-basedmethod, 2) an appearance-based benchmark method (ob-tained by replacing shape-tokens of our contour-basedmethod with SIFT features), and 3) other recent hybridmethods [5], [15]. We adhere to the experimental protocolsdescribed in the previous section. Additionally, for faircomparisons against our contour-based method, all experi-ments for each object class follow the exact parametersettings as Section 6.1.

6.2.1 Image Classification Performance

Table 4 shows the image classification ROC-AUC and ROC-EER scores of our methods with comparisons to [15]. Wemake several interesting observations. First, our hybridmethod is more effective in classifying images as comparedto our appearance-based method, in which it attains betterclassification results on all five tested classes. Additionally,we observe our hybrid method to attain higher (admittedlysmall) average ROC-AUC score as compared to thatobtained by our contour-based method. Turning to classi-fication results by the hybrid method of Shotton et al. [15],which used dense texton and contour features, our hybridmethod obtains lower average ROC-AUC score. We believethe better classification results of their method is due totheir use of dense texton features; our contour method(third column) attains slightly higher average ROC-AUCscore than their contour method (sixth column), but ourappearance method, which use sparse SIFT features (secondcolumn), has a much lower ROC-AUC number as comparedto their appearance method, which used dense textonfeatures (fifth column). Consequently, it does appear thatthe dense texton features have contributed substantially tothe improved performance by their hybrid method, and wepostulate that classification results of our hybrid methodwill also benefit by replacing sparse SIFT features with thedense texton features. Finally, we note that the classificationresults of our hybrid method are superior to the other


TABLE 3Comparison of Classification ROC-EER Scores on the First Four Graz-17 Classes to Other Results

*Average ROC-EER scores for [31], [32] are calculated from the three classes whose classification results are reported by the authors.

TABLE 4Classification Performance Using Different Feature Types

published results of Table 3, in which its average ROC-EERnumber (across the four Graz-17 classes) is 3.8 percent.

6.2.2 Object Detection Performance

We now evaluate detection performance of our hybridmethod on the same five object classes. We show exampledetections by our hybrid method in Fig. 13, where shape-token and SIFT codewords which contribute positively todetections are back-projected at their detected positions andscales. Table 5 reports RP-AUC and RP-EER detectionscores of our hybrid method, with comparison against ourappearance-based and contour-based methods, and thehybrid methods of Shotton et al. [15] and Opelt et al. [5](which used sparse SIFT and contour features).

We draw several conclusions. First, we observe that ourhybrid method obtains better detection results than ourappearance-based method on all five tested classes, with anaverage RP-EER improvement of 9.5 percent. Compared toour contour-based method, we also observe better perfor-mance by our hybrid method where it attains improveddetection RP-EER scores on four of five tested classes withequal detection score on the one other class. These resultsdemonstrate that our approach to combine complementaryfeature types can improve detection performance.

In comparison with [15], our hybrid method attainshigher RP-AUC score on the Weizmann horse data set, butobtains weaker detection performance on the other fourGraz-17 classes. Turning to the plane class, for which wehave much weaker RP-EER score than [15], we note thatplane instances typically contain extensive regions ofhomogeneity. Given that the DoG detector fires only atsalient regions, there are often few SIFT features extractedfrom plane instances and hence our hybrid recognitionmethod cannot fully capitalize on the appearance informa-tion. Several other authors [33], [34] have also indepen-dently concluded that a single interest region detectorcannot provide sufficient regions to represent an object classand have instead resorted either to using a wide plethora of

region detectors or to densely sampling image regions atevery pixel location as done in [15]. As future work, we willexplore how the choice of appearance features can affectrecognition performance. We also compare detectionperformance of our hybrid method against [5], which usedsparse SIFT features (with contour fragments) as recogni-tion cues. Overall, we attain 4.1 percent average RP-EERscore (computed across the four Graz-17 classes), which isbetter than the 5.2 percent obtained by their method.

7 CONCLUSION

We have presented a contour-based method that exploitsvery simple and generic shape primitives of line segmentsand ellipses for image classification and object detection.Primitive combinations which reliably predict object loca-tions are learned by exploiting discriminative shape,geometric, and structural constraints of an object class in aprincipled and unified framework. A novelty of our methodis that we do not restrict combinations to have a fixednumber of primitives, but allow them to automatically adaptto an object class. This, coupled with the generic nature ofthe primitives, imparts great flexibility to a combination torepresent discriminative shapes of varying complexity.Extensive evaluation shows the effectiveness of our method,where a wide variety of classes is successfully recognized.Comparison against state-of-the-art contour-based methodsand related methods also yields very competitive recogni-tion performances of our contour-based method.

Building on the contour-based method, we proposed ahybrid method which capitalizes on shape and appearanceinformation as recognition cues. The novelty of this methodis that each discriminative combination of features that islearned can contain either only shape-based features, onlyappearance-based features, or a mixture of both shape andappearance-based features. Both the number and the typesof features to be combined together are learned fromtraining images. It is shown that our method of combiningshape and appearance features can improve recognitionresults, and also compares very favorably with other recenthybrid methods.

As future work, we plan to incorporate additional cueslike dense appearance features and semantic cues (i.e.,contextual features) to improve recognition performance.Further improvement of our approach toward viewpointchanges is also desirable. Perhaps affine invariant regionscan be exploited for this purpose to guide the search for testobject instances across varying viewpoints. Finally, we plan


TABLE 5Object Detection Performance Using Different Feature Types

*Average RP-AUC and RP-EER scores for [5], [15] are calculated from the classes whose detection scores are reported by the authors.

Fig. 13. Example detections on the Weizmann horse and Graz-17 plane,motorbike, face and car-rear classes. Note accurate localization acrossscale and space despite substantial challenges like cluttered back-ground and occlusion.

to investigate other methods of learning discriminativefeature combinations, e.g., using separable function net-works with rectangular component functions.

ACKNOWLEDGMENTS

The authors thank the Associate Editor and all reviewers fortheir valuable input. This work is supported by the Institutefor Infocomm Research.

REFERENCES

[1] A.-S. Chia, S. Rahardja, D. Rajan, and M. Leung, “StructuralDescriptors for Category Level Object Detection,” IEEE Trans.Multimedia, vol. 11, no. 8, pp. 1407-1421, Dec. 2009.

[2] K. Mikolajczyk, A. Zisserman, and C. Schmid, “Shape Recognitionwith Edge-Based Features,” Proc. British Machine Vision Conf.,pp. 779-788, 2003.

[3] F. Jurie and C. Schmid, “Scale Invariant Shape Features forRecognition of Object Categories,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, pp. 90-96, 2004.

[4] P.M. Kumar, P.H.S. Torr, and A. Zisserman, “Extending PictorialStructures for Object Recognition,” Proc. British Machine VisionConf., pp. 789-798, 2004.

[5] A. Opelt, A. Pinz, and A. Zisserman, “Learning an Alphabet ofShape and Appearance for Multi-Class Object Detection,” Int’l J.Computer Vision, vol. 80, no. 1, pp. 16-44, 2008.

[6] J. Shotton, A. Blake, and R. Cipolla, “Multi-Scale CategoricalObject Recognition Using Contour Fragments,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 30, no. 7, pp. 1270-1281, July 2008.

[7] P. David and D. DeMenthon, “Object Recognition in High ClutterImages Using Line Features,” Proc. Int’l Conf. Computer Vision,pp. 1581-1588, 2005.

[8] X. Ren, “Learning and Matching Line Aspects for ArticulatedObjects,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,pp. 1-8, 2007.

[9] A.-S. Chia, S. Rahardja, D. Rajan, and M. Leung, “ObjectRecognition by Discriminative Combinations of Line Segmentsand Ellipses,” Proc. IEEE Conf. Computer Vision and PatternRecognition, pp. 2225-2232, 2010.

[10] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid, “Groups of AdjacentContour Segments for Object Detection,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 30, no. 1, pp. 36-51, Jan. 2008.

[11] D.R. Martin, C.C. Fowlkes, and J. Malik, “Learning to DetectNatural Image Boundaries Using Local Brightness, Color andTexture Cues,” IEEE Trans. Pattern Analysis and Machine Intelli-gence, vol. 26, no. 5, pp. 530-549, May 2004.

[12] J. Gall and V. Lempitsky, “Class-Specific Hough Forests for ObjectDetection,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, pp. 1022-1029, 2009.

[13] M. Leordeanu, M. Hebert, and R. Sukthankar, “Beyond LocalAppearance: Category Recognition from Pairwise Interactions ofSimple Features,” Proc. IEEE Conf. Computer Vision and PatternRecognition, pp. 1-8, 2007.

[14] R. Fergus, P. Perona, and A. Zisserman, “A Visual Category Filterfor Google Images,” Proc. European Conf. Computer Vision, pp. 242-256, 2004.

[15] J. Shotton, A. Blake, and R. Cipolla, “Efficiently CombiningContour and Texture Cues for Object Recognition,” Proc. BritishMachine Vision Conf., 2008.

[16] M. Leung and Y. Yang, “Dynamic Two-Strip Algorithm in CurveFitting,” Pattern Recognition, vol. 23, nos. 1/2, pp. 69-79, 1990.

[17] A.-S. Chia, S. Rahardja, D. Rajan, and M. Leung, “A Split andMerge Based Ellipse Detector with Self-Correcting Capability,”IEEE Trans. Image Processing, vol. 20, no. 7, pp. 1991-2006, July2011.

[18] V. Ferrari, T. Tuytelaars, and L. Gool, “Object Detection byContour Segment Networks,” Proc. European Conf. ComputerVision, pp. 14-28, 2006.

[19] D. Comaniciu and P. Meer, “Mean Shift: A Robust Approachtoward Feature Space Analysis,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 24, no. 5, pp. 603-619, May 2002.

[20] E. Borenstein and S. Ullman, “Class-Specific, Top-Down Segmen-tation,” Proc. European Conf. Computer Vision, pp. 639-641, 2002.

[21] L. Breiman, J. Friedman, C. Stone, and R. Olshen, Classification andRegression Trees. Wadsworth and Brooks, 1984.

[22] D.G. Lowe, “Distinctive Image Features from Scale-InvariantKeypoints,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

[23] G. Csurka, C.R. Dance, L. Fan, J. Willamowski, and C. Bray,“Visual Categorization with Bags of Keypoints,” Proc. EuropeanConf. Computer Vision Workshop, pp. 1-22, 2004.

[24] B. Leibe and B. Schiele, “Interleaved Object Categorization andSegmentation,” Proc. British Machine Vision Conf., pp. 759-768,2003.

[25] G. Griffin, A. Holub, and P. Perona, “Caltech-256 Object CategoryData Set,” technical report, California Inst. of Technology, no. 24,2007.

[26] X. Bai, X. Wang, L.J. Latecki, W. Liu, and Z. Tu, “Active Skeletonfor Non-Rigid Object Detection,” Proc. Int’l Conf. Computer Vision,2009.

[27] J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, and W.T. Freeman,“Discovering Objects and Their Location in Images,” Proc. Int’lConf. Computer Vision, pp. 370-377, 2005.

[28] D. Crandall and D. Huttenlocher, “Weakly Supervised Learningof Part-Based Spatial Models for Visual Object Recognition,” Proc.European Conf. Computer Vision, pp. 16-29, 2006.

[29] R. Fergus, P. Perona, and A. Zisserman, “Weakly SupervisedScale-Invariant Learning of Models for Visual Recognition,” Int’l J.Computer Vision, vol. 71, no. 3, pp. 273-303, 2007.

[30] A. Bar-Hillel and D. Weinshall, “Efficient Learning of RelationalObject Class Models,” Int’l J. Computer Vision, vol. 7, nos. 1-3,pp. 175-198, 2008.

[31] Y. Chen, L.L. Zhu, A. Yuille, and H. Zhang, “UnsupervisedLearning of Probabilistic Object Models (POMs) for ObjectClassification, Segmentation and Recognition Using KnowledgePropagation,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 31, no. 10, pp. 1747-1761, Oct. 2009.

[32] L.L. Zhu, Y. Chen, and A. Yuille, “Unsupervised Learning ofProbabilistic Grammar-Markov Models for Object Categories,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 1,pp. 114-128, Jan. 2009.

[33] E. Nowak, F. Jurie, and B. Triggs, “Sampling Strategies for Bag-of-Features Image Classification,” Proc. European Conf. ComputerVision, pp. 490-503, 2006.

[34] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “LocalFeatures and Kernels for Classification of Texture and ObjectCategories: A Comprehensive Study,” Int’l J. Computer Vision,vol. 73, no. 2, pp. 213-238, 2007.

Alex Yong-Sang Chia received the BEngdegree in computer engineering with first-classhonors and the PhD degree from the NanyangTechnological University, Singapore, in 2005and 2010, respectively. Currently, he is ascientist with the Institute for Infocomm Re-search. He was awarded the Tan Sri Dr. TanChin Tuan Scholarship in 2004, the A*STARGraduate Scholarship in 2006, and the Tan KahKee Young Inventors’ Merit and Silver Awards

(open category) in 2009 and 2012, respectively. His research interestsare in image processing, computer vision, and machine learning.

Deepu Rajan received the bachelor of engineer-ing degree in electronics and communicationengineering from the Birla Institute of Technol-ogy, Ranchi, India, the MS degree in electricalengineering from Clemson University, and thePhD degree from the Indian Institute of Technol-ogy Bombay, India. He is an associate professorin the School of Computer Engineering atNanyang Technological University, Singapore.From 1992 till 2002, he was a lecturer in the

Department of Electronics at Cochin University of Science andTechnology, India. His research interests include image processing,computer vision, and multimedia signal processing. He is a member ofthe IEEE.


Maylor Karhang Leung received the BScdegree in physics from the National TaiwanUniversity in 1979, and the BSc, MSc, and PhDdegrees in computer science from the Universityof Saskatchewan, Canada, in 1983, 1985, and1992, respectively. Currently, he is a professorat Universiti Tunku Abdul Rahman, Malaysia.His research interests are in the areas ofcomputer vision, pattern recognition and imageprocessing. Particular interests are in object

recognition, video surveillance for human behavior detection, robotnavigation, line pattern analysis, and computer aids for the visuallyimpaired. He is a member of the IEEE and the IEEE Computer Society.

Susanto Rahardja received the BEng degree inelectrical engineering from the National Univer-sity of Singapore in 1991, and the MEng andPhD degrees in electrical engineering fromNanyang Technological University in 1993 and1997, respectively. He is currently the deputyexecutive director (Research) and head of theSignal Processing Department at the Institute forInfocomm Research in Agency for Science,Technology and Research, Singapore. He was

involved in multimedia standardization activities in which he contributedtechnologies for scalable to lossless audio compression and losslessonly coding which were adopted and published as normative interna-tional standards ISO/IEC 14496-3:2005/Amd.3:2006 and ISO/IEC14496-3:2005/Amd.2:2006, respectively. He has published numer-ous internationally refereed journal and conference papers in the area ofmultimedia signal processing and digital communications. He served asan associate editor for the IEEE Transactions on Multimedia and theIEEE Transactions on Audio, Speech, and Language Processing from2007-2011 and is currently serving as an associate editor for theElsevier Journal of Visual Communication and Image Representation.He has received several awards, including the IEE Hartree PremiumAward in 2002, the Tan Kah Kee Young Inventors’ Open Category Goldaward in 2003, and the National Technology Award in 2007. He iscurrently the president of the SIGGRAPH Singapore Chapter (SSC), amember of the board of governors of the Asia-Pacific Signal andInformation Processing Association (APSIPA), a member of theManagement Board of the Interactive Digital Media Institute at theNational University of Singapore, and a council member of the NationalIT Standards Committee in Singapore. He holds an adjunct appointmentas a full professor at the National University of Singapore and is a fellowof the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

Object Recognition by Discriminative Combinations of Line Segments, Ellipses, and Appearance Features