Generic object recognition with regional statistical models and layer joint boosting

www.elsevier.com/locate/patrec

Pattern Recognition Letters 28 (2007) 2227–2237

Generic object recognition with regional statistical models andlayer joint boosting

Jun Gao a, Zhao Xie a,*, Xindong Wu a,b

a School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR Chinab Department of Computer Science, University of Vermont, Burlington, VT 05405, USA

Received 6 February 2007; received in revised form 27 June 2007Available online 24 July 2007

Communicated by J.A. Robinson

Abstract

This paper presents novel regional statistical models for extracting object features, and an improved discriminative learning method,called as layer joint boosting, for generic multi-class object detection and categorization in cluttered scenes. Regional statistical propertieson intensities are used to find sharing degrees among features in order to recognize generic objects efficiently. Based on boosting formulti-classification, the layer characteristic and two typical weights in sharing-code maps are taken into account to keep the maximumHamming distance in categories, and heuristic search strategies are provided in the recognition process. Experimental results reveal that,compared with interest point detectors in representation and multi-boost in learning, joint layer boosting with statistical feature extrac-tion can enhance the recognition rate consistently, with a similar detection rate.� 2007 Elsevier B.V. All rights reserved.

Keywords: Generic object recognition; Regional statistical models; Layer joint boosting; Sharing-code maps; ECOC matrix

1. Introduction

A long-standing goal of computer vision has been tobuild systems that are able to detect and categorize manydifferent kinds of objects in a cluttered world. Althoughthe general problem remains unsolved, progress has beenmade on a restricted version of this goal. Generally speak-ing, generic object recognition contains three key issues –object representation, learning from training data, andrecognition on test data, which have been studied in manyresearch efforts.

Feature extraction is an initial task for object recogni-tion. Currently, many useful detectors have been designed,ranging from simple pixel difference operators to complexvisual attention mechanisms (Navalpakkam and Itti,2006), according to biological visual selection. The use of

0167-8655/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.patrec.2007.07.006

* Corresponding author. Fax: +86 551 2901393.E-mail address: [email protected] (Z. Xie).

local features, like scaled-Harris (Mikolajczyk and Schmid,2001, 2003), affine-invariance (Mikolajczyk and Schmid,2002; Kadir et al., 2004), and DoG (Lowe, 1999) interestpoints with SIFT descriptors, have been widely used forfeature extraction due to scale and affine-invariance. Acombination of these vectors forms an effective object rep-resentation for learning at a later stage. Object representa-tion takes the assumption that objects have to besegmented properly from the cluttered background; other-wise the false-positive vectors from the background willcause inferior recognition results.

There are two common ways to deal with the learningissue, namely generative and discriminative methods. Thegenerative way emphasizes on building probabilistic para-metric models by maximizing posterior probabilities. TheBayesian model (Li and Perona, 2005) and the constella-tion model (Fergus et al., 2003, 2004) are good examplesfor generative object recognition using distinct model struc-tures. The discriminative way, like boosted decision stumps,

mailto:[email protected]

2228 J. Gao et al. / Pattern Recognition Letters 28 (2007) 2227–2237

is to find the best classifiers for minimum errors, which isthe main topic of this paper.

Boosting (Friedman et al., 2000; Schapire and Singer,2000) has been widely used for learning in object detection.It converts the object detection problem into binary classi-fication with a sliding window in images to distinguishbetween the object class and the background. Viola andJones (2001) introduced a new cascade structure to rejectnegative examples with little additional processing, whichcan detect facial regions rapidly. Opelt et al. (2004) appliedmulti-boosting to recognize four kinds of objects and thisapproach can be extended for other applications. As eachcategorization needs to be boosted separately, multi-boost-ing requires more space and more time for training. Torr-alba et al. (2004) designed shared features of differentcategories with less training data in generic object detec-tion, which can benefit categorization.

In this paper, we study the performance of multi-classboosting with shared features and propose an improvedlearning method, called layer joint boosting, for bothdetection and categorization. Novel statistical models arealso designed to guide feature extraction and learning.Compared with existing research efforts, our proposedmethod has a better performance in categorization, witha similar detection rate.

The rest of the paper is organized as follows. Section 2introduces the overall framework of the proposed method,which consists of regional statistical models and layer joint

boosting for generic object recognition. The regional statis-tical models and layer joint boosting are discussed in detailin Sections 3 and 4 respectively. After Sections 3, and 4,Section 5 describes the steps in our proposed algorithm.Section 6 provides our experimental results, and Section7 concludes the paper.

2. Overall framework

The framework of our proposed method for genericobject recognition is given Fig. 1. First, statistical proper-ties are modeled to classify detected regions into four differ-ent types with distinct intra-class and extra-class entropies,which can describe both generic and specific shared fea-

Fig. 1. Framework for gen

tures. After the statistical modeling, a layer joint boostingmethod is proposed based on multi-class categorizationto come up with heuristic search strategies for a better per-formance in categorization. Finally, strong classifiers andan ECOC (Error-Correct-Output Code) matrix areobtained for generic object recognition including detectionand categorization.

3. Regional statistical models for shared features

In this section, we discuss regional statistical models toextract features from segmented objects in cluttered scenesautomatically. Different types of models with differententropies form the shared feature codes for weight initiali-zation in learning.

3.1. Visual regional statistical models

Visual patterns indicate that the pixel intensities in a dis-crete space usually have typical distributions (Zhu, 2003),such as uniform, cluttered, repetitive or shading ones,according to selectivity and relativity in human vision.Therefore, we describe these statistical properties with dif-ferent probabilistic models as follows.

1. The flat regions r1, which show smooth variations ofinner objects or a large-scale background, have uniformstructures. The intensities in these regions are subject toan independently and identically distributed Gaussiandistribution,

p1ðr1jIRÞ ¼Yv2R

GðIv � l; r2Þ ð1Þ

where IR indicates a region, Iv represents the intensity ofpixel v in a flat region, l is the Gaussian mean, and r isthe Gaussian variance.

2. The cluttered regions r2 have no distinct distribution,and the probability in their appearance is calculatedby local histograms,

p2ðr2jIRÞ ¼Yv2R

hðIvÞ ¼Y255

j¼0

hj ð2Þ

eric object recognition.

J. Gao et al. / Pattern Recognition Letters 28 (2007) 2227–2237 2229

where hj represents the probability of intensity j and it iscomputed through the number of pixels with intensity j

over the total number of pixels in these regions.3. The texture regions r3 in a gray space present visual rep-

etitions and orientations. To facilitate the computation,we choose a set of eight filters and get the statistics inthese regions. Let v 2 IR denote the center of a filter,Uv indicate the neighborhood of v, h(Iv) represent thelocal histogram in the regions of gabor-filter images(Zhu et al., 1998), and h(IUv) be the vector consistingof eight local histograms of filter responses in the neigh-borhood Uv of pixel v. Each of the filter histogramscounts the filter responses at pixels whose filter windowscover v,

p3ðr3jIRÞ ¼Yv2R

expf�hH; hðIvjIUvÞig=Zv ð3Þ

where hÆ, Æi is the inner product between two vectors, Hdenotes the parameter of local histogramshðIvjIrvÞ ¼ hðIvÞ=

PhðIv0 Þ, v 0 2 Uv, and Zv is used for

normalization.4. The shading regions r4 can reveal shading effects, such as

sky, lake, wall, and perspective textures. In (Zhu et al.,1998), lower order random Markov fields often modelsuch smooth and inhomogeneous regions. In our exper-iments, we adopt a 2D Bezier-spline model with sixteenequally spaced control points on IR. This is a generativemodel. Let B (x,y) be the Bezier surface. For anyv = (x,y) 2 IR

Bðx; yÞ ¼ U TðxÞ �M � U ðyÞ ð4Þ

where U(x) = ((1 � x)3, 3x(1 � x)2, 3x2(1 � x), x3)T andM = (m11,m12,m13,m14; . . . ;m41, . . . ,m44). Therefore, theimage model for shading regions r4 is

p4ðr4jIRÞ ¼Yv2R

GðIv � Bv; r2Þ ð5Þ

where Bv is the value of Bezier surface and r is theGaussian variance.

Fig. 2 shows several patches from four different typicalregions. Considering the above proposed models, thereare several advantages over existing ones for the sharedfeature extraction, which will be discussed below.

Fig. 2. Different types of patc

3.2. Feature extraction for segmented objects

Although many local detectors can extract patches withscale, affine and transform invariance information, thereare some constraints for objects in cluttered scenes. InFig. 3b with SIFT detectors, many unrelated regions in thebackground are still detected without hand-segmented labels,which lead to misclassifications. Little research (Sivic et al.,2005) concerns feature extraction of automatically segmentedobjects in an unsupervised environment. Based on our previ-ous work on statistical regional models in large experiments(Xie and Gao, 2007), we can infer that texture and uniformpatches with high probabilities in a small-scale space oftenappear at the edge of objects or inner objects, and otherscan be regarded as the background. A simplified descriptionof the object appearance probability p(objectjIR) in region IR

can be written in the following equation:

pðobjectjIRÞ ¼X4

i¼1

ð�1Þiþ1wi

Yr2I

piðrijIR; rÞ ð6Þ

where r is a parameter in a discrete scale space I, pi showsthe conditional probability in the four typical regions de-scribed in Section 3.1, and wi is the corresponding weightcomputed in Eq. (7):

wi ¼ ki ��piP4i¼1�pi

ð7Þ

where �pi is the typical probabilistic mean in a given region,and ki represents the influence factor which is calculated inEqs. (8)–(10) respectively.

k1 ¼X

ri�1;ri2I

1

1þ expð�ðp1ðr1jIR; riÞ � p1ðr1jIR; ri�1ÞÞ2Þ� p1ðr1jIR; riÞ ð8Þ

k2;4 ¼ maxri2Iðp2;4ðr2;4jIR; riÞÞ ð9Þ

k3 ¼X

ri�1;ri2Iexpð�ðp3ðr3jIR;riÞ � p3ðr3jIR;ri�1ÞÞ2Þp3ðr3jIR; riÞ

ð10Þ

Fig. 3c reveals that a higher probability in appearancecorresponds to more brightness in a cluttered scene, andFig. 3d is the result of feature extraction with the proposedmodels, in which many patches in the background areeliminated.

hes from different objects.

Fig. 3. (a) Original scene. (b) Scene with SIFT detectors. (c) Foreground probabilities with brightness. (d) A combination of (b) and (c).


3.3. Entropy analysis for shared features in typical regions

In this subsection, mutual entropies are applied to mea-sure the intra-class and extra-class information in the fourtypical models at first. Given a set of images that containdifferent objects from class C, Boolean variable Xk repre-sents the appearance of patches. The maximum similarityprobability pk can be calculated by normalized cross-corre-lation (Goshtasby et al., 1984), and hk is the threshold. Ifpk < hk, Xk = 0; otherwise Xk = 1. The conditional proba-bilities p(Xk(hk) = 0jC), p(Xk(hk) = 1jC), and the priorprobabilities p(C = 0) and p(C = 1) can be easily obtainedfrom training data:

We define typical region sets with high probabilitiesas Si = {Si1,Si2, . . . ,Sij} for each category, wherei 2 {I, II, III, IV} denotes the four typical regions, and j isthe number of categories. Eq. (11) describes the intra-classinformation of multiple instances. On the other hand, wecan also compute the extra-class information by Eq. (12):

I1ijðX kðhÞ;X k0 ðh0Þ; CÞ ¼ HðCÞ � HðCjX kðhÞ;X k0 ðh0ÞÞ;

X k;X k0 2 Sij ð11Þ

I2ijðX kðhÞ;X k0 ðh0Þ; CÞ ¼ HðCÞ � HðCjX kðhÞ;X k0 ðh0ÞÞ;

X k 2 Sij; X k0 2 Sij0 ; j 6¼ j0 ð12Þ

where HðCÞ ¼ �P

x2CpðxÞ logðpðxÞÞ and HðCjX kðhkÞÞ ¼�P

x2Cpðx;X kðhkÞÞ logðxjX kðhkÞÞ. hk can be obtained byhk ¼ arg max

hðHðCÞ � HðCjX kðhÞÞÞ.

Fig. 4 shows the results of the information on the fourtypical regions in twenty categories. The normalized curves

Fig. 4. (a) Entry curves of inner-class entropie

represent inner-class and extra-class entropies in Figs. 4aand b which are complementary. From these two figures,flat and shading regions have higher extra-class and lowerintra-class entropies with more generic shared features. Onthe contrary, texture regions are more specific, and theentropies in cluttered regions vary dramatically.

Considering shared features (Torralba et al., 2004) indifferent categories, generic ones have a more compact rep-resentation and specific ones contain a more distinct classi-fication. So, two kinds of weights, namely w1 for generalityand w2 for specificity, are introduced for this dilemma andfor weight initialization in boosting.

1. Since texture regions usually correspond to specificshared features, we use Eq. (13) for w1. A higher valueof w1 indicates a stronger specific description and sucha patch is rarely shared.

w1ðX jkÞ ¼ a

XX k0 2SIIIj

I1IIIjðX k;X k0 ; CÞ

� bX

X k0 2SIIIj0;j0 6¼j

I2IIIjðX k;X k0 ; CÞ ð13Þ

s. (b

where

a ¼X

X k ;X k0 2SIIIj

I1IIIjðX kðhÞ;X k0 ðh0Þ; CÞ

,X

X k2SIIIij;X k0 2SIIIj0 ;j0 6¼j

I2IIIjðX kðhÞ;X k0 ðh0Þ; CÞ; b ¼ 2=a

) Entry curves of extra-class entropies.


and we set a to increase the intra-class information andb to decrease the extra-class information.

2. The weight w2 for generic shared features from flat andshading regions can be computed by Eq. (14); a highervalue of w2 results in a stronger general descriptionand such a patch is usually shared.

w2ðX jkÞ ¼ a

XX k0 2SIIIj0 ;j

0 6¼j

I2ijðX k ;X k0 ; CÞ � b

XX k0 2SIIIj

I1ijðX k ;X k0 ; CÞ;

i 2 fI; IVg ð14Þ

where

a ¼X

X k2Sij;X k0 2Sij0 ;j0 6¼j

I2ijðX kðhÞ;X k0 ðh0Þ; CÞ

,

XX k ;X k0 2Sij

I1ijðX kðhÞ;X k0 ðh0Þ; CÞ

in order to increase the extra-class information andb = 2/a in order to decrease the intra-class information.

3. In cluttered regions, both weights need to be computedseparately due to uncertain changes in the informationentropy.

4. Rarely, one patch with a high probability in morethan one typical region can be treated by the featuresharing code discussed in Section 4.2 for the choice ofw1 or w2.

4. Layer joint boosting in machine learning

Our layer joint boosting method is proposed in this sec-tion, based on the basic boosting for multi-class object rec-ognition. Layer properties are presented in sharing-codemaps and heuristic search strategies are provided to keepthe Hamming distances between the codes in each row ofthe ECOC matrix.

4.1. Boosting for multi-class object detection

Boosting is well known to deal with classification inmachine learning by the combination with several weakclassifiers for final decision. In each boosting round, oneclassifier with least-errors is selected and weights on sam-ples are updated iteratively. Joint boosting with shared fea-tures is widely applied to detect multiple objects for lesstraining. However, several constraints remain in the recog-nition performance.

1. An exhaustive search needs to be performed in thewhole candidate space in each round with O(2C) com-plexity, where C is the number of categories. This willlead to an exponential explosion in running time withthe increase of C.

2. Although the selected features provide the least-errorclassification, the maximum Hamming distance in eachcategory cannot be guaranteed.

3. Boosting with sharing features can possibly choose aweak classifier with the same effect in classification thatcauses a high redundancy.

From the above viewpoints, we will discuss ECOCmatrix properties and propose an improved joint boostingmethod.

4.2. Layer analysis by ECOC matrix

An ECOC matrix l, which is the output in boosting,contains C · K elements with 0/+1 values where C is thenumber of categories and K is the number of selected fea-tures. Let Fi* represent the vector in the ith row and F*j

denote the sharing feature code in the jth column, whichcan be obtained for its appearance in categories with thenormalized cross-correlation mentioned in Section 3.2.The Hamming distance between categories i and j is definedby Eq. (15):

Dij ¼ kF i� � F j�k2 ¼XK

k¼1

ðfik � fjkÞ2 ð15Þ

where fik is the value of the ith category on the kth feature.An empty matrix is initialized before boosting and each

shared feature code f*k fills in each column entry in the kthround, which satisfies the following two properties (Oriolet al., 2006).

Property 1. The elements in f*k are neither all ones nor allzeros.

If the elements in f*k are all zeros or all ones, thecorresponding linear classifier has no classificationeffect.

Property 2. Any F �k0 (k 0 < k) can neither be the same as nor

complementary with F*k.

If F �k0 and F*k are the same, the output of an ideal two-value function would be equal, and so feature k would havea redundant effect for error correction. Similarly, if F �k0 andF*k are complementary, their results are just opposite toeach other and would also have no impact on classification.

With these two properties, a sharing-code map needs tobe introduced for the layer characteristic. Given a numberof categories, C, there are 2C nodes shown in Figs. 5a, andb shows the entire map when C = 4. Each F*k = (b1b2. . .bC)is an element in the sharing-code set, where bC is one bit inthe Cth category. Nodes in the lth layer show that l catego-ries share features. Apparently, the number of layers is C

and the number of nodes isCl

� �in the lth layer. The

0th and Cth layers are excluded due to Property 1 and only2C�1 � 1 sharing nodes are considered based on Property2. An edge Eij = hi, ji in the sharing-code map indicates thatthe Hamming distance between nodes i and j is 1.

When selecting k features in a single round, a layer-related heuristic search is used by Theorem 1 as follows.

Fig. 5. (a) A generic sharing-code map. (b) An example with C = 4.


Theorem 1. The ordinal number of sharing nodes in a

graphic layer satisfying the maximum Hamming distance,

with a given C (C > 1), is bC/2c.

Proof 1. The bit exchangeability of each node in the samelayer is supposed to be order-independent in Hamming dis-tance. If C is even, there are C/2 different Hamming dis-tances and the sharing nodes are located in layers fromthe 1st to the C/2th. The Hamming distance Di of nodesin the ith layer can be computed by the following equation:

Di ¼ ðC � iÞ � i ¼ Ci� i2

Di can reach the maximum value C2/4 when i = C/2. Thesame result can be obtained in a similar proof process whenC is odd. We can draw a conclusion that when i = bC/2c,the Hamming distance can reach the maximum.

In Theorem 1, the sharing nodes in the bC/2cth layer areconsidered only to keep the Hamming distance in catego-ries in a single round. We then define average sharing nodesafter k rounds in Eq. (16):

�ej ¼1Pki¼1

eij=k P 0:5

0Pki¼1

eij=k < 0:5

8>>><>>>:

j ¼ 1; 2; . . . ;C ð16Þ

where ek ¼ ðek1ek

2 . . . ekCÞ represents the selected sharing node

in the kth round. In the (k + 1)th round, we get a heuristicsearch in sharing features by Theorem 2. h

Theorem 2. If the ordinal number of average sharing nodes is

l after k rounds, the selected features in the (k + 1)th round

satisfying the maximum Hamming distance in C categories

is located in the (l + bl/2c � b(C � l)/2c)th layer.

Proof 2. When k = 1, the selected feature is located in thel1th layer. The numbers of appearances of ‘1’ and ‘0’ are l1and l = C � l1 respectively in the corresponding binarycode. Assuming that there are n1 bits with ‘0’ and n2 bitswith ‘1’ in the next round, we can conclude that the num-bers of appearances of ‘00’, ‘01’, ‘10’ and ‘11’ are n1,l1 � n1, C � l1 � n2 and n2 respectively. Then the Hammingdistance D can be obtained by

D ¼ n1ðl1 � n1 þ C � l1 � n2 þ 2n2Þ þ ðl1 � n1Þðn1 þ 2C

� 2l1 � 2n2 þ n2Þ þ ðC � l1 � n2Þðn1 þ 2l1 � 2n1 þ n2Þþ n2ð2n1 þ l1 � n1 þ C � l1 � n2Þ

¼ �2n21 þ ð4l1 � 2CÞn1 � 2n2

2 þ ð2C � 4l1Þn2 þ 4n1n2

þ 4l1ðC � l1Þ

Let oD/on1 = 0, oD/on2 = 0, we can get,

�4n1 þ 4l1 � 2C þ 4n2 ¼ 0

The Hamming distance is maximal when n1 = bl1/2c,n2 = b(C � l1)/2c, and in this case, the sharing code has(l + b(C � l1)/2c � bl1/2c) bits with ‘1’, which are locatedin the (l + bl/2c � b(C � l)/2c)th layer. When k > 1, wecan firstly get the average sharing code by Eq. (16), andthen prove Theorem 2 with the same process.

We consider shared features in several layers only, whenthe Hamming distance is long, and we can reduce the searchspace from a complexity of O(2C) to O(C) by Theorem 2which makes up the first and second constraints in Section4.1 for multi-object detection and categorization. h

4.3. Layer joint boosting

With the above discussion on a heuristic search, we pro-pose a layer joint boosting method in this subsection. Ineach round, an inner-layer search can be performed insteadof an exhaustive search following the maximum Hammingdistance principle. We take the following points in oursearch process:

1. The ordinal number of the initial search layers, l, is bC/2c, and the feature code cannot be the same as or com-plementary with the selected features.

2. The average sharing code �e = (e1e2 . . . eC) after m roundscan be computed through Eq. (16) to guide layer searchin the l = l 0 + bl 0/2c � b(C � l 0)/2cth layer, where l 0 is thenumber of bits with ‘1’. Additionally, an extension of thesearch scope [l � 2, l + 2] is taken to avoid layer looping.

3. In each round, weight factors Die/C can be introduced,where Die represents the Hamming distance betweensharing feature nodes ei and e. We increase the weightsof features in case of a long Hamming distance to rein-force training.


The initialization of weights can be performed by learn-ing from the regional statistical models mentioned in Sec-tion 3. wc

1i and wc2i represent the weights of specific and

general features, wc1i are weights of nodes in the [1, C/2]

layer, and wc2i are weights of nodes in the (C/2,C] layer.

Different weights can be computed in order to describe fea-ture information and get a better performance.

5. Our proposed algorithm

We combine the feature extraction in Section 3 and thelayer joint boosting in Section 4.3 to design a new methodfor generic object recognition. The description of ourmethod is given below.

Feature extraction

(1) Compute the probabilities of the four typical models,given detected patches with SIFT detectors, by Eqs.(1)–(5), and choose the patches with highest probabil-ities in Eq. (6) as the candidates.

(2) Compute weights w1ðX ci Þ with particularity in the

given texture and some cluttered regions by Eq.(13); and compute weights w2ðX c

i Þ with generality inshading, flat and other cluttered regions by Eq. (14).

(3) Apply normalized cross-correlation to calculate shar-ing codes for extracting features.

Fig. 6. Cropped examples

Learning

Define N training samples X = {x1,x2, . . . ,xN} and avector Z ¼ fzc

1; zc2; . . . ; zc

Ng with label Z 2 {0,+1}, wherec = 1, . . . ,C, C denotes the number of categories, and thedimensionality of feature vectors V = {v1,v2, . . . ,vK} is K.We can get a linear classifier hðvi; cÞ ¼ adðvj

i > hÞ þ bfor each feature vj in sample xi, where d is the indicatorfunction, h is the threshold, and a and b are regres-sion parameters. The learning algorithm is given asfollows.

(1) Initialize the weights wc1i ¼ w1ðX c

i Þ and wc2i ¼ w2ðX c

i Þof the ith sample and the cth category, and set astrong classifier H (vi,c) = 0, where i = 1, . . . ,N andc = 1, . . . ,C;

(2) Initialize the number of search layers as l = [C/2], andset the average sharing code e randomly;

(3) Repeat for m = 1,2, . . . ,M (M is the number ofrounds)Repeat for n ¼ 1; 2; . . . ;

Cl� 1

� �þ C

l

� �þ C

lþ 1

� �(not the same as or complementary with selectedfeatures)(i) If l < [C/2], wc

i ¼ wc1i; otherwise wc

i ¼ wc2i. Search

for a proper threshold h to get a and b; Fit theshared stump: if c belongs to a sharing featureset S; hmðvi; cÞ ¼ adðvj

i > hÞ þ b, where

of twenty categories.

Fig. 7. Some examples for correct detection.


b ¼Xc�S

Xi

wci z

ci dðv

fi 6 hÞ

Xc�S

, Xi

wci dðv

fi 6 hÞ;

aþ b ¼Xc�S

Xi

wci z

ci dðv

fi > hÞ

Xc�S

, Xi

wci dðv

fi > hÞ;

otherwise hmðv; cÞ ¼X

i

wci z

ci

Xi

,wc

i ; c 62 S

(ii) Evaluate errors JðnÞ ¼PC

c¼1

PNi¼1wc

i ðzci � hmðvi;

cÞÞ2 and find the best subset: n* = argminnJ(n)for one weak classifier hm(v,c);

(iii) Compute the average sharing code e by Eq. (16)and determine the number of search layers inthe sharing-code map l = l0 + bl0/2c � b(C� l0)/2c,where l 0 is the number of bits with ‘1’ in e;

(4) Combine the weak classifier into the strong classifier:H(vi,c) = H(vi,c) + hm(vi,c);

(5) Update weights: wci ¼ Diewc

i e�zc

i hmðvi;cÞ=C; where Die isthe Hamming distance between the selected featurecode and e;

(6) Get a strong classifier for object detection and anECOC matrix for object categorization.

Fig. 8. RPC curves.

6. Experiments

The above proposed method is applied for generic objectdetection and categorization in the MIT-CSAIL imagebase, where each category object is view-variant with a dif-

Fig. 9. The ECOC matrix

ferent illumination or clutter conditions, and especially,objects in the same category are different instances of thecategory. Since Torralba et al’s research (2004), a state-of-the-art approach, is also for generic object detectionwith only a few training samples, we select this approachfor a comparative study with our proposed method. Wenot only place a bounding box around objects withinimages like in (Torralba et al., 2004), but also recognizethese objects. There are twenty categories of objects in

after twenty rounds.


the image base, including bottles, cans, frontal cars, sidecars, chairs, computers, ‘DO NOT ENTER’ signs, ‘STOP’

Fig. 10. Curves for object recognit

signs, ‘ONE WAY’ signs, faces, keyboards, lights, mousses,mugs, posters, pedestrians, frontal screens, side screens,

ion rates in twenty categories.


speakers, and traffic lights, and some cropped images ofthem are shown in Fig. 6.

The initial feature pools are generated by our methodfor feature extraction including binary spatial masks toshow object locations. During training, we select twentysamples for each class. For performance considerations,8000 (rather than up to 220) feature candidates areextracted for our comparative study. Like (Torralbaet al., 2004), the feature vector f is computed in Eq. (17),

vf ¼ jI � S � gf j � wf ð17Þ

where I is an image, S is an image filter, gf is a fragment, wf

is a spatial mask, � represents the normalized correlation,and * denotes the convolution operator. The followingexperiments are conducted step by step using the proposedalgorithm in Section 5.

After 100 rounds in layer joint boosting, we get a finalclassifier for detection and an ECOC matrix with 100 shar-ing codes of selected features for categorization. In detec-tion, we regard objects as the foreground and others arethe background, and a strong classifier is applied for binaryclassification. Fig. 7 shows some correct detection examplescontaining the ground-truth location with solid squaresand object localization with dashed squares with a properscale assumption. The algorithm performance is evaluatedusing recall–precision curves (RPC). In RPC, the x-axis(recall) represents the number of objects with correct detec-tion over the number of true objects and the y-axis (preci-sion) represents the number of objects with correctdetection over the number of objects with actual detection.To be considered a correct detection, the overlapping area,a0, between the estimated bounding box B and the ground-truth bounding box Bgt must exceed 0.5 according to thefollowing criterion:

a ¼ areaðB \ BgtÞareaðB [ BgtÞ

ð18Þ

Fig. 8 shows RPC with the blue1 curve representing ourmethod and the red one for Torralba et al’s approach.We can see that the two methods have a similar perfor-mance when the recall rate is higher than 0.05. When therecall rate is higher than 0.25, Torralba et al’s methodhas a slightly higher precision rate than ours due to theindiscernible layer properties of sharing codes in two cate-gories (the foreground and background, etc.). Overall, re-call rates from the two methods are similar to each other.

At the same time, the ECOC matrix is applied for cate-gorization. Fig. 9 shows the matrix after twenty roundsincluding both general and specific features. The final deci-sion is made based on the minimum Euclidean distance inrow vectors of the matrix.

Although the selected patches may not reach the globalminimum misclassification rate in each round, the final

1 For interpretation of color in figures, the reader is referred to the Webversion of this article.

average ECOC matrix’s Hamming distance for the pro-posed method is 11.638 compared to 9.832 in Torralbaet al’s approach. In each round, only a small fraction ofthe features in the given layers are considered for candi-dates rather than all possible features. The recognition ratecurve of each category is shown in Fig. 10. The x-axis is thenumber of boosting rounds and the y-axis is the recogni-tion rate. The blue and red ones show the proposed methodand Torralba et al’s approach respectively. We can con-clude from this figure that the recognition rate generallyincreases with the number of rounds, and our method out-performs Torralba et al’s approach for the classification ofall 20 categories. The running time of our method is alsogenerally less than that of the rival approach in our exper-iments. With a similar detection rate (in Fig. 8), our pro-posed method gets a better performance in categorization.

There are several points to be mentioned regarding falserecognition:

1. some objects are possibly deeply rotated and thereforepresent different view-points; and

2. an extreme illumination or scale can cause a high vari-ance on the object surface and a sightless appearancein the sample.

We can possibly regard objects in deep-viewpoints asdifferent categories like frontal cars, side cars and rear cars,or model the range information for parameter estimation.Other influences can possibly be formed by external condi-tions for image acquisition, which are out of the scope ofthis paper.

7. Conclusion

Based on traditional ways of feature extraction andboosting in generic object recognition, we have presentedregional statistical models and an improved multi-boostingcalled layer joint boosting for generic multi-class objectdetection and categorization in cluttered scenes. Layer prop-erties in sharing feature codes with different statistical typescan guarantee the Hamming distance in categories, whichbenefits our boosting in search and further categorization.Our proposed method can improve the recognition rate con-sistently with a similar performance in the detection rate.

Acknowledgements

This research is partly supported by the National Natu-ral Science Foundation of China (No. 60375011 and No.60575028), the Natural Science Foundation of Anhui Prov-ince (No. 04042044), and also China’s Program for NewCentury Excellent Talents in Universities (NCET-04-0560). We would also like to thank the anonymous review-ers, whose constructive comments and advice have helpedimprove this paper.


References

Fergus, R., Perona, P., Zisserman, A., 2003. Object class recognition byunsupervised scale-invariant learning. In: IEEE Proc. Conf. onComputer Vision and Pattern Recognition (CVPR), pp. 264–272.

Fergus, R., Perona, P., Zisserman, A., 2004. A visual category filter forgoogle images. In: Proc. European Conf. of Computer Vision (ECCV),pp. 242–256.

Friedman, J., Hastie, T., Tibshirari, R., 2000. Additive logistic regression:A statistical view of boosting. Ann. Statist. 28 (2), 337–374.

Goshtasby, A., Gage, S.H., Bartholic, J.F., 1984. A two-stage cross-correlation approach to template matching. IEEE Trans. Pattern Anal.Machine Intell. 6 (3), 374–378.

Kadir, T., Zisserman, A., Brady, M., 2004. An affine invariant salientregion detector. In: Proc. European Conf. on Computer Vision(ECCV), pp. 345–457.

Li, F.F., Perona, P., 2005. A Bayesian hierarchical model for learningnatural scene categories. In: IEEE Conf. on Computer Vision andPattern Recognition (CVPR), vol. 2, pp. 524–531.

Lowe, D.G., 1999. Object recognition from local scale-invariant features.In: Proc. Internat. Conf. on Computer Vision (ICCV), pp. 1150–1157.

Mikolajczyk, K., Schmid, C., 2001. Indexing based on scale invariantinterest points. In: Proc. Internat. Conf. on Computer Vision (ICCV),pp. 525–531.

Mikolajczyk, K., Schmid, C., 2002. An affine invariant interest pointdetector. In: Proc. European Conf. on Computer Vision (ECCV), pp.128–142.

Mikolajczyk, K., Schmid, C., 2003. A performance evaluation of localdescriptors. In: Proc. Conf. on Computer Vision and Pattern Recog-nition (CVPR), vol. 2, pp. 257–263.

Navalpakkam, V., Itti, L., 2006. An integrated model of top-down andbottom-up attention for optimal object detection. In: Proc. IEEEConf. on Computer Vision and Pattern Recognition (CVPR), pp.2049–2056.

Opelt, A., Fussenegger, M., Pinz, A., Auer, P., 2004. Weak hypothesis andboosting for generic object detection and recognition. In: Proc.European Conf. of Computer Vision (ECCV), vol. 2, pp. 71–84.

Oriol, P., Petia, R., Jordi, V., 2006. Discriminant ECOC: A heuristicmethod for application dependent design of error correcting outputcodes. IEEE Trans. Pattern Anal. Machine Intell. 28 (6), 1007–1012.

Schapire, R., Singer, Y., 2000. BoosTexter: A boosting-based system fortext categorization. Machine Learning 39 (2/3), 135–168.

Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T., 2005.Discovering objects and their location in images. In: Proc. Internat.Conf. on Computer Vision (ICCV), vol. 1, pp. 370–377.

Torralba, A., Murphy, K.P., Freeman, W.T., 2004. Sharing features:Efficient boosting procedures for multiclass object detection. In: Proc.IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),pp. 762–769.

Viola, P., Jones, M., 2001. Rapid object detection using a boosted cascadeof simple features. In: Proc. Conf. on Computer Vision and PatternRecognition (CVPR), pp. 511–518.

Xie, Z., Gao, J., 2007. Object localization based on visual probabilisticmodels. Chinese J. Image and Graphics 12 (7), 1234–1243.

Zhu, S.C., 2003. Statistical modeling and conceptualization of visualpatterns. IEEE Trans. Pattern Anal. Machine Intell. 25 (6), 1–22.

Zhu, S.C., Wu, Y.N., Mumford, D., 1998. Filters, random field andmaximum entropy: Towards a unified theory for texture modeling.Internat. J. Computer Vision 27 (2), 107–126.

Documents

Generic object recognition with regional statistical models and layer joint boosting