9
Object detection using boosted local binaries Haoyu Ren n , Ze-Nian Li Vision and Media Lab, School of Computing Science, Simon Fraser University, 8888 University Drive Burnaby, BC, Canada V5A1S6 article info Article history: Received 23 October 2014 Received in revised form 4 July 2016 Accepted 4 July 2016 Available online 5 July 2016 Keywords: Binary descriptor Boosted Local Binary Object detection RealAdaBoost Structure-aware abstract This paper presents a novel binary descriptor Boosted Local Binary (BLB) for object detection. The pro- posed descriptor encodes variable local neighbour regions in different scales and locations. Each region pair of the proposed descriptor is selected by the RealAdaBoost algorithm with a penalty term on the structural diversity. As a result, condent features that are good at describing specic characteristics will be chosen. Moreover, the encoding scheme is applied in the gradient domain in addition to the intensity domain, which is complementary to standard binary descriptors. The proposed method was tested using three benchmark object detection datasets, the CalTech pedestrian dataset, the FDDB face dataset, and the PASCAL VOC 2007 dataset. Experimental results demonstrate that the detection accuracy of the proposed BLB clearly outperforms traditional binary descriptors. It also achieves comparable perfor- mance with some state-of-the-art algorithms. & 2016 Elsevier Ltd. All rights reserved. 1. Introduction Object detection is one of the most important tasks in com- puter vision. It is widely used in humancomputer interaction, multimedia application, and medical imaging. This task is rela- tively difcult because there are numerous factors affecting the performance, such as illumination variation, occlusion, as well as background clutters. All these factors will increase the difculty of classifying the target object with surrounding backgrounds. To solve this issue, some researchers focus on designing effec- tive descriptors, e.g., Haar [1], Histogram of Oriented Gradient (HOG) [2], covariance matrix [3] or their combinations, e.g., het- erogeneous feature [4], HOG-LBP [5], feature fusion [6]. Others utilize more powerful machine learning algorithms, e.g., latent SVM [7], multiple instance learning [8], and Hough forest [9]. Among these algorithms, the cascade boosted structure proposed by Viola and Jones has shown its efciency and effectiveness on object categories such as faces [1], pedestrians [10] and cars [11]. In most of the cases, the performance of the boosted classier mainly depends on the features. To address the accuracy and ef- ciency issues simultaneously, boosting with appropriate features to construct the cascade classier is the key step. In consideration of the efciency, binary descriptors are one of the most commonly used descriptors in object detection. The Local Binary Pattern (LBP) [12] is a local descriptor based on binary coding of adjacent pixel pairs, which is widely used for many pattern recognition tasks. Despite its simplicity, a number of LBP modications and extensions have been proposed. Some of them work on the post-processing steps [13] or the co-occurrence structure [14] which improve the discriminative ability of binary coding. Others focus on the denition of the location where the gray value measurement is taken [15,16]. Unfortunately, these descriptors have drawbacks when employed to encode general object's appearance. A notable disadvantage is the insufcient discriminative ability. Most of the traditional binary descriptors depend on the intensities of particular locations. It will be easily inuenced by illumination, occlusion, and noises. In addition, al- though the size of some binary descriptors are exible, the pat- terns of the local pixels and adjacent rectangles are still xed. It might not have sufcient ability to describe the objects in some complicate detection tasks, e.g, pedestrians and multi-view cars. This paper is an extension of our previous work [17] with the following contributions. First, we introduce a Boosted Local Binary (BLB) descriptor, where the variable local region pairs are selected by the RealAdaBoost algorithm considering both the dis- criminative ability and the feature structural diversity. In addition, we show that using the gradient image and intensity image to- gether for binary coding is more effective than only using the in- tensity image. As a result, the proposed BLB descriptor is more discriminative and robust compared to commonly used binary descriptors such as Haar, LBP and LAB. We evaluate the perfor- mance by employing it on three commonly used datasets, the FDDB face dataset, CalTech Pedestrian dataset and PASCAL VOC 2007 dataset. Experimental results show that BLB has a superior performance in comparison with traditional binary descriptors. Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition http://dx.doi.org/10.1016/j.patcog.2016.07.010 0031-3203/& 2016 Elsevier Ltd. All rights reserved. n Corresponding author. E-mail addresses: [email protected] (H. Ren), [email protected] (Z.-N. Li). Pattern Recognition 60 (2016) 793801

Object detection using boosted local binariesli/papers-on-line/Haoyu-PR-2016.pdfLBP mutations improve the accuracy considerably compare to traditional LBP, they do not have good generalization

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Pattern Recognition 60 (2016) 793–801

    Contents lists available at ScienceDirect

    Pattern Recognition

    http://d0031-32

    n CorrE-m

    journal homepage: www.elsevier.com/locate/pr

    Object detection using boosted local binaries

    Haoyu Ren n, Ze-Nian LiVision and Media Lab, School of Computing Science, Simon Fraser University, 8888 University Drive Burnaby, BC, Canada V5A1S6

    a r t i c l e i n f o

    Article history:Received 23 October 2014Received in revised form4 July 2016Accepted 4 July 2016Available online 5 July 2016

    Keywords:Binary descriptorBoosted Local BinaryObject detectionRealAdaBoostStructure-aware

    x.doi.org/10.1016/j.patcog.2016.07.01003/& 2016 Elsevier Ltd. All rights reserved.

    esponding author.ail addresses: [email protected] (H. Ren), [email protected]

    a b s t r a c t

    This paper presents a novel binary descriptor Boosted Local Binary (BLB) for object detection. The pro-posed descriptor encodes variable local neighbour regions in different scales and locations. Each regionpair of the proposed descriptor is selected by the RealAdaBoost algorithm with a penalty term on thestructural diversity. As a result, confident features that are good at describing specific characteristics willbe chosen. Moreover, the encoding scheme is applied in the gradient domain in addition to the intensitydomain, which is complementary to standard binary descriptors. The proposed method was tested usingthree benchmark object detection datasets, the CalTech pedestrian dataset, the FDDB face dataset, andthe PASCAL VOC 2007 dataset. Experimental results demonstrate that the detection accuracy of theproposed BLB clearly outperforms traditional binary descriptors. It also achieves comparable perfor-mance with some state-of-the-art algorithms.

    & 2016 Elsevier Ltd. All rights reserved.

    1. Introduction

    Object detection is one of the most important tasks in com-puter vision. It is widely used in human–computer interaction,multimedia application, and medical imaging. This task is rela-tively difficult because there are numerous factors affecting theperformance, such as illumination variation, occlusion, as well asbackground clutters. All these factors will increase the difficulty ofclassifying the target object with surrounding backgrounds.

    To solve this issue, some researchers focus on designing effec-tive descriptors, e.g., Haar [1], Histogram of Oriented Gradient(HOG) [2], covariance matrix [3] or their combinations, e.g., het-erogeneous feature [4], HOG-LBP [5], feature fusion [6]. Othersutilize more powerful machine learning algorithms, e.g., latentSVM [7], multiple instance learning [8], and Hough forest [9].Among these algorithms, the cascade boosted structure proposedby Viola and Jones has shown its efficiency and effectiveness onobject categories such as faces [1], pedestrians [10] and cars [11].In most of the cases, the performance of the boosted classifiermainly depends on the features. To address the accuracy and ef-ficiency issues simultaneously, boosting with appropriate featuresto construct the cascade classifier is the key step.

    In consideration of the efficiency, binary descriptors are one ofthe most commonly used descriptors in object detection. The LocalBinary Pattern (LBP) [12] is a local descriptor based on binarycoding of adjacent pixel pairs, which is widely used for many

    (Z.-N. Li).

    pattern recognition tasks. Despite its simplicity, a number of LBPmodifications and extensions have been proposed. Some of themwork on the post-processing steps [13] or the co-occurrencestructure [14] which improve the discriminative ability of binarycoding. Others focus on the definition of the location where thegray value measurement is taken [15,16]. Unfortunately, thesedescriptors have drawbacks when employed to encode generalobject's appearance. A notable disadvantage is the insufficientdiscriminative ability. Most of the traditional binary descriptorsdepend on the intensities of particular locations. It will be easilyinfluenced by illumination, occlusion, and noises. In addition, al-though the size of some binary descriptors are flexible, the pat-terns of the local pixels and adjacent rectangles are still fixed. Itmight not have sufficient ability to describe the objects in somecomplicate detection tasks, e.g, pedestrians and multi-view cars.

    This paper is an extension of our previous work [17] with thefollowing contributions. First, we introduce a Boosted Local Binary(BLB) descriptor, where the variable local region pairs are selectedby the RealAdaBoost algorithm considering both the dis-criminative ability and the feature structural diversity. In addition,we show that using the gradient image and intensity image to-gether for binary coding is more effective than only using the in-tensity image. As a result, the proposed BLB descriptor is morediscriminative and robust compared to commonly used binarydescriptors such as Haar, LBP and LAB. We evaluate the perfor-mance by employing it on three commonly used datasets, theFDDB face dataset, CalTech Pedestrian dataset and PASCAL VOC2007 dataset. Experimental results show that BLB has a superiorperformance in comparison with traditional binary descriptors.

    www.sciencedirect.com/science/journal/00313203www.elsevier.com/locate/prhttp://dx.doi.org/10.1016/j.patcog.2016.07.010http://dx.doi.org/10.1016/j.patcog.2016.07.010http://dx.doi.org/10.1016/j.patcog.2016.07.010http://crossmark.crossref.org/dialog/?doi=10.1016/j.patcog.2016.07.010&domain=pdfhttp://crossmark.crossref.org/dialog/?doi=10.1016/j.patcog.2016.07.010&domain=pdfhttp://crossmark.crossref.org/dialog/?doi=10.1016/j.patcog.2016.07.010&domain=pdfmailto:[email protected]:[email protected]://dx.doi.org/10.1016/j.patcog.2016.07.010

  • H. Ren, Z.-N. Li / Pattern Recognition 60 (2016) 793–801794

    The rest of the paper is organized as follows: related work willbe introduced in Section 2. Section 3 gives the details of the pro-posed BLB. The structure-aware RealAdaBoost for BLB is describedin Section 4. The next section presents the experimental results.Conclusions are given in Section 6.

    Fig. 1. Traditional LBP descriptor.

    2. Related work

    There have been a wide variety of approaches developed forobject detection. Most of them focus on designing more dis-criminative local descriptors and using appropriate machinelearning methods.

    There are many local features and descriptors proposed forvarious detection tasks. Most of them reflect the characteristic ofsome pre-defined local patterns. The template descriptors whichare based on intensity values are widely used for object detection[1,12,16,18–25]. Among these template features, Local Binary Pat-tern (LBP) is widely used. After the pioneering work [12], Tan andTriggs [18] propose a local ternary pattern (LTP) for robust facerecognition. Yan et al. [16] utilize rectangles instead of the singlepixel to generate the Local Assembled Binary (LAB) for face de-tection. Guo et al. [19] propose a weighted LBP method, where thevariance that characterizes the local contrast information is usedto weight the one dimensional LBP histogram. Vu and Caplierpropose oriented edge patterns [20] and patterns of dominantorientations [22] for face recognition. Guo and Zhang [21] proposea completed LBP (CLBP) feature to incorporate the sign and mag-nitude information into the final descriptor. Ahonen et al. [24]propose the LBP Fourier histogram (LBPHF) to achieve rotationinvariance. Zhao et al. [25] propose an extension of LBPHF, bycombining the sign and magnitude information. Although theseLBP mutations improve the accuracy considerably compare totraditional LBP, they do not have good generalization power due tothe artificially designed local patterns. For some general objectdetection tasks such as PASCAL VOC challenge, such that the objectappearance varies a lot with complex background, these featureswill not work well.

    Besides the binary descriptors, more complicate features anddescriptors have been utilized, such as SIFT [26], HOG [2], andcovariance matrix [3]. Lowe [26] first proposes the scale invariantfeature transform (SIFT). Several mutations are designed in theseyears for object detection and recognition [27–30]. Dalal andTriggs [2] propose the basic form of the HOG descriptor with 2�2cells. Multi-size version are developed in [10,31,32], and furtherextended to pyramid structure [33–35]. Tuzel et al. [3] utilize thecovariance matrix projected on Riemann manifolds for detection.A heterogeneous version based on covariance matrix is furtherproposed in [36]. Sometimes these features are combined witheach other to increase the discriminative power. For instance, Leviand Silberstein [37] utilize an accelerated version of the featuresynthesis method applied on multiple object parts respectively.Bar-Hillel et al. [38] design an iterative process including featuregeneration and pruning using multiple operators for part locali-zation. Chen et al. [39] propose Multi-Order Contextual co-Oc-currence (MOCO), to implicitly model the high level context usingsolely detection responses from the object detection based on thecombination of HOG and LBP. Using complicate features clearlyimproves the accuracy compared to binary descriptors, but theefficiency is reduced at a large scale. Most of these features couldnot satisfy the requirement of real-time object detection systems.As a result, the development of effective binary descriptors isnecessary.

    Boosting framework is widely used in training the cascadeclassifier for fast object detection. Zhu et al. [10] apply linear SVMwith HOG descriptor as the weak classifier to build a cascade

    detector. This procedure is revisited through properly designingthe feature pooling, feature selection, preprocessing, and trainingmethods using a single rigid component [40]. Wu and Nevatia [41]propose the cluster boosted tree method, in which the samplespace is divided by unsupervised clustering based on dis-criminative image features selected by boosting algorithm. Tu [42]develops the probabilistic boosting-tree, where each node com-bines a number of weak classifiers (evidence, knowledge) into astrong classifier (a conditional posterior probability).

    3. Boosted binary patterns

    3.1. Traditional binary descriptors

    The traditional LBP is developed for texture classification andthe success is due to its robustness under illumination variations,computational simplicity and discriminative power on specificpatterns. Fig. 1 represents an example of the traditional LBP, whichis a binary coding of the intensity contrast of the center pixel and8 neighbouring pixels. If the intensity of neighbouring pixels arehigher than the center one, the corresponding bits will be assigned1, otherwise it will be assigned 0. Given a center pixel, the LBPfeature response is defined by

    ∑= ( − ) ×( )=

    −LBP I Isign 2 ,1

    d ri

    d

    i centeri

    ,1

    1

    where d is the number of neighbouring pixels, r is the distancebetween the neighbouring pixels and the center pixel, I is the in-tensity, and

    ( ) = ≥<

    ⎧⎨⎩xxx

    sign 1 00 0

    .

    Different from LBP which reflects the intensity pattern of pixelpairs, LAB [16] utilizes rectangles instead, as shown in Fig. 2. LABcombines 8 locally adjacent 2-rectangle binary Haar features withthe same size. These Haar features share a common center rec-tangle. LAB's encoding scheme is similar to LBP: if the intensitysum of the adjacent rectangle is higher than the center one, thecorresponding bit will be assigned 1; otherwise it will be assigned0. Given a center rectangle C0, the LAB feature response is

    )(∑= − ×( )=

    −LAB I Isign 2 ,2i

    C Ci

    1

    81

    i 0

    where Ci is the adjacent rectangle, ICi is the intensity sum of all thepixels in Ci.

    The calculation of LBP is efficient because the feature responsein Eq. (1) is based on binary comparisons and bit-shift operations.Although LAB uses regions instead of pixels, the computation costwill not increase because the intensity sum of any rectangles couldbe efficiently calculated using the integral image [1].

  • Fig. 2. LAB descriptor. Fig. 4. Constraint of the neighbouring pairs. (a) For BLB-8, the overlap ratio is 50%.(b) For BLB-6, the overlap ratio is 25%. (For interpretation of the references to colorin this figure, the reader is referred to the web version of this paper.)

    Fig. 5. Neighbouring pairs in BLB-4, BLB-6, and BLB-8. (For interpretation of thereferences to color in this figure, the reader is referred to the web version of thispaper.)

    H. Ren, Z.-N. Li / Pattern Recognition 60 (2016) 793–801 795

    3.2. Boosted local binaries

    Instead of using the artificially designed binary patterns, weintroduce the proposed Boosted Local Binaries (BLB) descriptor. ABLB-n descriptor consists of n non-overlapped surrounding rec-tangles …C C C, , , n1 2 and a center rectangle C0 in the same size. Asin Eq. (3), the ( )x y,i i is the left-top corner of the rectangle Ci, and( )w h,i i is the width and height

    = { … }

    = { } = = ∈ { … } ≠ ( )

    BLB C C C

    C x y w h w w h h i j n i j

    , , ,

    , , , , , , , 0, 1, , , . 3

    n n

    i i i i i i j i j

    0 1

    These rectangles compose n pairs { } = …C C i n, , 1, ,i0 . In our case,the n could be 4, 6, or 8. Fig. 3 illustrates the examples of BLB-4,BLB-6, BLB-8 descriptors.

    In BLB, the surrounding rectangles …C C C, , , n1 2 are numberedclockwise starting from the one at the top of the center rectangleC0. None of these surrounding rectangles are overlapped with eachother. Fig. 4(a) shows a BLB-8 descriptor, where the center rec-tangle C0, top rectangle C1, right-top rectangle C2 and left-toprectangle C8 are highlighted. In general, the surrounding rec-tangles C C C, ,1 2 8 could spread anywhere. To describe meaningfulpatterns and reduce the size of the feature pool, we define theneighbouring pairs illustrated as the rectangles connected by greenlines in Fig. 5. There are 4 neighbouring pairs in BLB-4, 12 in BLB-6,and 12 in BLB-8. We add a constraint that the two rectangles ineach neighbouring pair should overlap either on their width ortheir height. The overlap ratio of BLB-4, BLB-6, and BLB-8 is set to0.5, 0.25, and 0.5 respectively. Based on this rule, the C1 in Fig. 4(a) should overlap at least 50% with C C C, ,0 2 8 either on their widthor on their height. This is illustrated by the dot-dashed linesaround C C C, ,0 2 8 which reflect the possible region of their neigh-bours. As a result, the possible region for C1 is the cyan regionwhich is intersected by the corresponding dot-dashed lines of C0,C2 and C8. Similarly, Fig. 4(b) illustrates a BLB-6 descriptor. C1 isincluded in three neighbouring pairs { } { } { }C C C C C C, , , , ,0 1 6 1 2 1 . Then

    Fig. 3. BLB descriptors. The black rectangle is the center rectangle, and the whiteones are the surrounding rectangles.

    the possible region for C1 is the green region intersected by thedot-dashed lines.

    If the surrounding rectangles lay far away from the center one,the feature response will be easily influenced by noises. As a result,the classification confidence of this descriptor is lower. We definethe structural diversity as the average ratio of the distance be-tween each surrounding rectangle and the center rectangle to therectangle size as

    ∑= ( )+ ( )=

    Dn

    dist C Cw h

    1 2 ,,

    4i

    ni

    i i1

    0

    where the ( )dist x y, is the Euclidean distance between x and y. Ifthe surrounding rectangles are far away from the center one, Dwill be larger. As shown in Fig. 6, the surrounding regions of theleft BLB-6 descriptor are closer to the center one, so its diversity issmaller. In contrast, the surrounding rectangles of the right BLB-8descriptor spread farther, so that the diversity is relatively larger.

    Fig. 6. Structural diversity of BLB descriptor. The left BLB-6 has smaller diversity,while the right BLB-8 has a larger one.

  • Fig. 7. Selecting BLB descriptors using RealAdaBoost.

    H. Ren, Z.-N. Li / Pattern Recognition 60 (2016) 793–801796

    This diversity is designed to reflect the BLB's structural confidence.It will be considered as a penalty term in the RealAdaBoostprocedure.

    Besides applying the above patterns on the intensity image, wealso utilize them on gradient images. In consideration of the effi-ciency, the x-direction gradient image and y-direction gradientimage are generated respectively. Given a BLB-n descriptor, thefinal feature response is

    ∑( ) = ( ( ) − ( )) ×( )=

    −f BLB idx g C idx g C idx, sign , , 2 ,5

    ni

    n

    ii

    10

    1

    where

    ( ) =( ) =( ) =( ) = ( )

    ⎧⎨⎪

    ⎩⎪g C idx

    IntensitySum C idx

    GradientxSum C idx

    GradientySum C idx

    ,0

    1.2 6

    i

    i

    i

    i

    Compared to conventional LBP or LAB, the proposed BLB hasclear advantages. We know that if we average several pedestriansin the intensity domain, the result will be meaningless. On theother hand, if we average them in the gradient domain, we mayeasily find some common characteristic. This indicates the gra-dient information is quite helpful in detecting some object cate-gories. In addition, the variable binary patterns are effective incapturing some specific object structure which could not be de-scribed, such as the contrast of window, body, and wheel in a car.Moreover, the computation of BLB is as efficient as LBP and LAB.The sum of a region in both intensity image and gradient imagecould be easily extracted by the integral images [1]. So the effi-ciency is similar to traditional binary descriptors.

    4. RealAdaBoost with BLB

    We use RealAdaBoost to select the meaningful BLB descriptorsto describe the target object. In RealAdaBoost, the BLB descriptorcan be seen as a function from the image space to a real valuedrange → [ ]f f fx: ,min max . For the binary object/background classi-fication problem, suppose the input data as ( ) … ( )y yx x, , , , n1 n1where xi is the training sample and ∈ { − }y 1, 1i is the class label,we first divide the sample space into Nb equal sized sub-ranges Bj,and the weak classifier is defined as a piecewise function:

    ( ) =+ ϵ+ ϵ ( )

    +

    ⎛⎝⎜⎜

    ⎞⎠⎟⎟h ln W

    Wx

    12

    ,7

    j

    j

    where ϵ is the smoothing factor, ±W are the probability distribu-tions of the feature response for positive/negative samples, im-plemented as a histogram

    = ( ∈ ∈ { − }) = … ( )±W P X y j Nx , 1, 1 1, , . 8j j b

    The best descriptor is selected according to the classificationerror Z of the piecewise function

    ∑=( )

    + −Z W W2 .9j

    j j

    Considering of the structural diversity, we add a structure-aware criterion into RealAdaBoost, which is similar to the com-plexity factor used for evaluating different features [43]. The up-dated discriminative criterion is shown in Eq. (10), where a is aconstant factor, D is the structural diversity in Eq. (4), fp is the falsepositive rate in current stage

    ∑= + · ·( )

    + −Z W W a fp D2 .10j

    j j

    Eq. (10) could be explained as follows: in the beginning stages ofRealAdaBoost, because the false positive rate is larger, and the targetobject is still easy to be classified with the background, RealAdaBoostwill refer to the smaller-diversity descriptors with more confidentpatterns. In the following stages when the false positive rate is smaller,the classification problem becomes more difficult, so descriptors withdiverse patterns might be utilized. This strategy makes sense, becausethe overall performance and robustness of a cascade boosted classifieris strongly influenced by the beginning stages which filter most of thecandidate windows. Using confident features in the beginning stagewill contribute to the generalization power of the resulting classifier,which is important in real object detection system.

    The detail algorithm is illustrated in Fig. 7. To learn the bestfeature, the most intuitive way is to look through the whole fea-ture pool, which is rather time consuming. So we resort to asampling method to speed up the feature selection process. Bothof the sampling number M and bin number Nb will influence thetotal performance. Further experiments on these parameters aregiven in Section 5.1.

  • Fig. 8. Experiments on different BLB descriptors. (a) Different number of surrounding rectangles and feature extraction methods. (b) Different sampled feature number.

    H. Ren, Z.-N. Li / Pattern Recognition 60 (2016) 793–801 797

    5. Experiments

    In this section, we utilize the CalTech pedestrian dataset, FDDBface dataset, and PASCAL VOC 2007 dataset to show the effec-tiveness of the proposed method.

    5.1. Experiments on Caltech pedestrian dataset

    We first evaluate the proposed BLB descriptor using the CalTechpedestrian dataset [44]. This dataset is one of the largest publicavailable pedestrian datasets. It offers a large number of samples,which consists of approximately 10 h of 640�480, 30 Hz videotaken from a vehicle driving through regular traffic in an urbanenvironment. About 250,000 frames with a total of 350,000bounding boxes and 2300 unique pedestrians were annotated. Theindividuals in these datasets appear in many positions, orienta-tions, and background variety.

    We follow the training and evaluation protocol proposed in[44]. The pedestrians at least 50 pixels tall under no or partialocclusion are evaluated. The detection window is considered ascorrect if and only if its intersection-over-union ratio with theground-truth is at least 50%. The miss rate versus FPPI (False Po-sitive Per Image) curves are utilized to evaluate the performance ofour algorithms. 64�128 samples are used in all the experimentson this dataset. BLB descriptors with variable block size from 4�4to 20�40 are evaluated each iteration.

    The first experiment is to evaluate the performance of differentBLB descriptors. 4 classifiers using different BLB feature pools aretrained respectively. They are BLB-4, BLB-6, BLB-8, and the com-bination of all of them BLB-All. In each iteration, we keep the samenumber of sampling features as M¼60 and set the RealAdaBoostbin number as Nb¼32. The complexity-aware process is removedin order to evaluate the influence of the BLB patterns. Firstly weextract the above descriptors in both the gradient domain (gra)and the intensity domain (int). Fig. 8(a) shows the miss rate versusFPPI of these classifiers. It clearly shows that the accuracy is im-proved along with the increase of BLB surrounding rectangles. If allBLB descriptors are integrated together, the accuracy could befurther improved around 2%. This shows that the variable BLBpattern will contribute to the overall performance of detector. Thevariety of the feature response increases along with the number ofrectangle pairs, so that it will contribute to the description abilityof resulting detector. We also train two classifiers only using theintensity information for BLB-8 and BLB-All. It can be seen thatthere is a significant accuracy decrease without using gradientinformation. This result makes sense because in some complicateobject detection tasks such as pedestrian detection, few char-acteristics in the intensity domain are able to reflect the difference

    between the pedestrian and background. The advantage of usinggradient information is clear.

    The next experiment is about the sampled feature number ineach iteration. 4 classifiers are trained by setting the sampledfeature number to 30, 60, 200, and 500. All BLB descriptors areutilized, and the RealAdaBoost bin number is set to 32. From Fig. 8(b) we can find that there is a clear blank between the 30 and 60curve. In addition, evaluating 200 or 500 features per iterationonly leads to slightly improvement. Compared to the increasedtraining cost, this improvement is not valuable. According to [10],a random sub-sample of size = =N log 0.05/log 0.95 59f willguarantee that we can find the best 5% features with a probabilityof 95%. So using 60 features per iteration ensures us get a detectorwith the top-level features.

    Then the influence of the structure-aware process is evaluated.Based on previous experiments, 60 BLB descriptors are sampledper iteration. The RealAdaBoost bin number is fixed at 32. Fig. 9(a) gives several curves based on different structure-aware factorsC. It can be seen that the structure-aware process contributes to atleast 2% improvement on the detection accuracy. The best one,C¼0.2, contributes to almost 5% accuracy improvement comparedto the classifier without structure-aware process. We know that inimage-based object detection, the total accuracy is decided notonly by the discriminative ability of the detector, but also by thegeneralization power. Although the BLB descriptors with largerstructural diversity might capture some complicate characteristics,they are also easily influenced by noises. As a result, it leads tohigher risk to generate more false positives, especially in the largedatabase. Using the structure-aware process to balance them couldimprove the total performance of the resulting classifier. This issimilar to the efficiency-accuracy trade-off in some sense.

    The last parameter is the RealAdaBoost bin number Nb. FromFig. 9(b) we could find that using 64 bins achieves similar accuracycompared to using 32 bins. More bins such as 128 does not in-crease the accuracy much. This is another trade-off problem. Morebins means finer division of the feature space, which leads tobetter discriminative power but less robustness. On the otherhand, fewer bins will do harm to the overall accuracy, as the 2.4%gap between 16 bins curve and 32 bins curve. Based on our ex-periments, 32 bins seems to be a good choice for BLB.

    In order to compare our method with traditional binary de-scriptors and other state-of-the-art algorithms, we use the optimalparameters (BLB-All, 60 features per iteration, structure-awarefactor 0.2, 32 RealAdaBoost bins) to train a final classifier. Whenthe false positive rate decreases to 10�6, 6671 BLB descriptors areutilized. 4117 of them are BLB-8, 1701 are BLB-6, while the re-maining parts are BLB-4. 2739 of these 6671 descriptors are ex-tracted on the intensity domain, 1871 are in x-gradient domain,

  • Fig. 9. Experiments on the parameters of RealAdaBoost algorithm. (a) With and without structure-aware process. (b) Different bin number.

    Fig. 10. Experiments of traditional binary features on Caltech dataset. Fig. 11. FPPI evaluation on CalTech pedestrian dataset.

    H. Ren, Z.-N. Li / Pattern Recognition 60 (2016) 793–801798

    2061 are in y-gradient domain. The LBP and LAB descriptors arealso implemented. For LBP, 3 kinds of LBP descriptors, LBP(8,1), LBP(8,2), and LBP(16,2) are integrated together to train a cascade de-tector. For LAB descriptor, variable block size from 4�4 to 20�40are used. We also evaluate 60 descriptors per iteration and set thebin number to 32 for these two descriptors. In Fig. 10, we illu-strated the results of traditional Haar feature, LBP(8,1) feature, thecombination of all LBP features, and LBP feature. The “int” curvesshow that this feature is only extracted on the intensity image,while the “intþgra” curves indicate that the features are extractedfrom both the intensity and gradient domain. It could be seen thatthe best group among these traditional features, LBP(intþgra)achieves 53% miss rate, which is still 6% lower compared to theproposed BLB (47%). This result shows the effectiveness of ourvariable patterns compared to fixed patterns. In addition, usingmore patterns for LBP (All LBP) achieves better accuracy comparedto single pattern (LBP(8,1)). This indicates the advantage of vari-able binary patterns. We also notice that using both the intensityand gradient information is more effective compared to only usingthe intensity information. This indicates that the gradient char-acteristic is important for the pedestrian detection.

    Fig. 11 illustrates the experimental results of our methodcompared to other approaches. Even compared to the most com-monly used HOG feature (DBN-ISOL, MultiResc, DBM-Mut, Multi-Rescþ2Ped), the performance is still comparable. Although theintensity and gradient contrast used in BLB is not as strong as theweighted oriented gradient used in HOG, this gap could be com-pensated by more flexible rectangle arrangement. Using the

    structure-aware term, the accuracy is further improved, whichoutperforms some multiple features methods (MultiftrþMotion,MOCO). These results show the effectiveness of our method onlarge scale database. Although the accuracy of the proposed binaryfeature is comparable to some traditional deep learning methods(DBN-ISOL 53%, DBM-Mut 48%), it is lower than the state-of-the-art deep features (e.g., Fast R-CNN 9.3% [57], Deep Parts 11.8% [58]).The reason is that the discriminative ability of the proposed binaryfeatures is clearly lower compared to the deep features generatedby the CNNs.

    Moreover, we compare the convergence speed of the trainingprocess of different descriptors. Fig. 12 plots the curves of falsepositive rate against the number of weak classifiers. The Re-alAdaBoost without structure-aware process is utilized in thisexperiment. It shows that no matter the gradient information isuse or not, the BLB always converges faster, at the rate of ap-proximately two times compared to LAB and LBP. We also noticethat using gradient information contributes to a significantly fasterconvergence speed on both LBP, LAB, and BLB. So both the variablefeature patterns and the gradient information improve the dis-criminative ability. In addition, the performance of the boostedclassifier is shown to be positively proportional to the convergencespeed in the training process. This signifies that the proposed BLBperforms better on the training accuracy and the training speed ofboosted classifiers. We test the resulting classifier on a desktop PCwith 2.8 GHz I7 dual core CPU and 8 GB memory. It only takes17 ms to detect all pedestrians in a 640�480 input image whilethe minimum pedestrian size is 30�60. Fig. 13(a) shows the first

  • Fig. 12. Convergence speed of different descriptors.

    Fig. 13. The first selected two descriptors of pedestrians and faces.

    Fig. 14. Experiments on FDDB face database.

    H. Ren, Z.-N. Li / Pattern Recognition 60 (2016) 793–801 799

    selected two descriptors of the pedestrians. It clearly captures thebody structures of the pedestrian.

    5.2. Experiments on FDDB dataset

    Next, we tested the detectors on the publicly available FDDBdata set [45]. This data set contains 5171 faces in 2845 images.Most images include face in various pose, variable size and com-plicate background. We collect 2400 faces from the website andresize them to 24�24 as training images. Variable size BLB de-scriptors from 2�2 to 6�6 block size are generated and furtherselected using the algorithm in Fig. 7. Fig. 14 plots the curves of ourmethod as well as popular binary descriptors, including Haar, LBP,LAB, and the state-of-the-art algorithms including [46–53], interms of the number of false positives with respect to the detec-tion rate. As shown in Fig. 14, our detector achieves 85.4% detec-tion rate at 200 false positive. To our knowledge, it is the bestresult for 200 false positive on this dataset. The performance ofour method is also similar to the state-of-the-art algorithms withmore complicate classification algorithms such as joint cascade[52] or deformable part model [51]. Different from general objectdetection, there are more binary characteristics in human face,such as the eyebrow to the forehead, or the pupil to the white ofeye. These binary relationships also exist even in multi-view faces.Unfortunately, it is difficult for traditional binary descriptors suchas Haar or LBP to capture these relationships due to their limitedfeature patterns. In contrast, BLB successfully captures thesemeaningful patterns, as illustrated in Fig. 13(b), the contrast of theeyes to the surrounding parts are well reflected in the selected twoBLBs.

    5.3. Experiments on PASCAL VOC 2007 dataset

    Finally, we employed the standard benchmark object detectiondataset, PASCAL VOC 2007, to test our detector. The PASCAL datasetcontains images from 20 different categories with around 5000images for training and validation and a test set of size around 5000.We measure the object detection performance on the test imagesusing the standard protocol: average precision (AP) per class, as wellas the mean AP (mAP) across all classes. For both measures, weconsider that a window is correct if it has an intersection-over-unionratio of at least 50% with a ground-truth object instance.

    The sample size is different for these object categories in ourtraining process. For the aeroplane, bird, bottle, chair, diningtable,person, pottedplant, sofa, and TV monitor categories, all the samplesare placed together to train a single detector. For all other categories,the training samples are divided into the front/rear view samples andside-view samples, and then trained into two detectors respectively.The final detection result is based on the voting of these two de-tectors. The sample sizes (w,h) used to train these classifiers are listedin the second column of Table 1. The label ‘f’ means the sample sizefor front/rear images. Only one detector for this category if there is no‘f’ label. The size of the BLB rectangles used in these categories rangesfrom 4�4 to ′ × ′w h , where ′ ′w h, are the maximum multiple of4 smaller than w/3 and h/3 respectively.

    In Table 1, we compare our method with the state of the artalgorithms [39,54–56] in terms of detection AP on the test set. Itcan been seen that the BLB still achieves 5% better mAP comparedto LBP and LAB. Firstly, we notice that BLB performs poor on somecategories with complicate appearance. For instance, in the busdetection, because the bus bodies are always painted by variousadvertisements with various colors, there are few meaningfulbinary patterns to be captured in both the intensity domain andthe gradient domain. Although the detector does select a few BLBdescriptors which captures the contrast of the window to the busbody, but the classification margins are relatively small, whichshows that the detector does not have sufficient confidence of thisdecision. Similar circumstance happens in motorbike detectionand horse detection, where a large number of pictures include thedrivers and riders wearing various clothes. For some object cate-gories with simple appearance such as car, BLB works well ascomplicate features such as SIFT fisher vectors [54], pyramid HOG[56], covariance matrix [55] and co-occurrence features [39]. In

  • Table 1Experimental results on PASCAL VOC 2007 dataset.

    Sample size LBP-gra LAB-gra BLB SA-BLB Chen [39] Cinbis [54] Wang [55] Felzenszwalb [56]

    aeroplane 128� 64 30.8 30.5 43.5 43.7 41.0 56.1 54.2 36.6bicycle 128� 64 40� 100(f) 55.8 42.9 52.7 60.4 64.3 56.4 52.0 62.2bird 100� 100 22.4 22.2 24.8 24.9 15.1 21.8 20.3 12.1boat 100� 100 40� 100(f) 20.3 15.1 23.3 24.3 19.5 26.8 24.0 17.6bottle 40� 120 21.2 22.7 26.5 29.2 33.0 19.9 20.1 28.7bus 128� 64 100� 100(f) 33.5 39.2 46.2 46.6 57.9 49.5 55.5 54.6car 100� 60 100� 100(f) 51.4 49.1 64.7 67.7 63.2 57.9 68.7 60.4cat 100� 60 100� 100(f) 26.5 25.0 43.1 44.8 27.8 46.2 42.6 25.5chair 100� 100 21.1 17.9 28.7 29.4 23.2 16.4 19.2 21.1cow 100� 60 60� 100(f) 26.4 30.2 29.1 33.3 28.2 41.4 44.2 25.6dining table 100� 60 24.1 25.1 38.7 40.2 29.1 47.1 49.1 26.6dog 100� 60 80� 100(f) 23.2 24.6 24.4 29.7 16.9 29.2 26.6 14.6horse 100� 100 60� 100(f) 40.0 41.4 44.0 44.5 63.7 51.3 57.0 60.9motorbike 128� 64 40� 100(f) 39.0 37.3 41.3 42.6 53.8 53.6 54.5 50.7person 60� 100 35.1 39.1 44.2 47.9 47.1 28.6 43.4 44.7plant 60� 100 20.1 17.3 21.4 24.3 18.3 20.3 16.4 14.3sheep 100� 100 60� 100(f) 26.5 31.8 29.1 34.1 28.1 40.5 36.6 21.5sofa 100� 100 22.0 17.5 27.3 29.9 42.2 39.6 37.7 38.2train 128� 64 100� 100(f) 45.5 48.0 46.2 49.0 53.1 53.5 59.4 49.3tv 128� 64 41.3 47.8 44.0 46.3 49.3 54.3 52.3 43.6mAP – 32.1 32.8 37.2 41.7 38.7 40.5 41.7 35.4

    H. Ren, Z.-N. Li / Pattern Recognition 60 (2016) 793–801800

    addition, if the target has stable structure, such as the pottedplantwhich is always with a base, or the chair which consists of severalrigid parts, such that the major characteristics of these objects arebinary relationship or contours, BLB is always able to capture theseinformation, which might be ignored by other high-level gradientfeatures. Unfortunately, this case does not happen on the aero-plane and sofa. The reason is that in the aeroplane detection, manypictures are truncated, which include only a small part of theaeroplanes. In the sofa detection, at least half of the sofa imagesare occluded by person or other backgrounds. As a result, part-based model [55,56] has advantages dealing with such cases ra-ther than using single boosted classifier.

    6. Conclusions

    In this paper, we propose an object detector that achieves bothhigh accuracy and fast speed. A large binary feature pool with vari-able-location and variable-size blocks on both the intensity domainand the gradient domain is built. The RealAdaBoost algorithm isfurther utilized to select informative features by evaluating differentblocks pairs while considering the structural diversity. As a result,both the discriminative ability and generalization power are con-sidered in the feature selection procedure. Experimental results showthat our approach achieves high accuracy on face, pedestrian andgeneral object detection tasks. It also speeds up the convergencecompared to standard binary descriptors.

    Acknowledgment

    This work was supported in part by the Natural Sciences andEngineering Research Council of Canada under the GrantRGP36726.

    References

    [1] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simplefeatures, in: IEEE Conference on Computer Vision and Pattern Recognition,2001, pp. 511–518.

    [2] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEEConference on Computer Vision and Pattern Recognition, 2005, pp. 886–893.

    [3] O. Tuzel, F. Porikli, P. Meer, Human detection via classification on Riemannianmanifolds, in: IEEE Conference on Computer Vision and Pattern Recognition,2007, pp. 1–8.

    [4] L. Nanni, A. Lumini, Heterogeneous bag-of-features for object/scene recogni-tion, Appl. Soft Comput. 13 (4) (2013) 2171–2178.

    [5] X. Wang, T. X. Han, S. Yan, An HOG-LBP human detector with partial occlusionhandling, in: IEEE Conference on Computer Vision and Pattern Recognition,2009, pp. 32–39.

    [6] S. Oh, S. McCloskey, I. Kim, A. Vahdat, K.J. Cannons, H. Hajimirsadeghi, G. Mori,A.A. Perera, M. Pandey, J.J. Corso, Multimedia event detection with multimodalfeature fusion and temporal concept localization, Mach. Vis. Appl. 25 (1)(2014) 49–69.

    [7] P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained, mul-tiscale, deformable part model, in: IEEE Conference on Computer Vision andPattern Recognition, 2008, pp. 1–8.

    [8] C. Zhang, J.C. Platt, P.A. Viola, Multiple instance boosting for object detection,in: Advances in Neural Information Processing Systems, 2005, pp. 1417–1424.

    [9] J. Gall, V. Lempitsky, Class-specific hough forests for object detection, in: Ad-vances in Computer Vision and Pattern Recognition, 2013, pp. 143–157.

    [10] Q. Zhu, M.-C. Yeh, K.-T. Cheng, S. Avidan, Fast human detection using a cascadeof histograms of oriented gradients, in: IEEE Conference on Computer Visionand Pattern Recognition, 2006, pp. 1491–1498.

    [11] B. Li, T. Wu, S.-C. Zhu, Integrating context and occlusion for car detection byhierarchical and-or model, in: European Conference on Computer Vision,2014, pp. 652–667.

    [12] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotationinvariant texture classification with local binary patterns, IEEE Trans. PatternAnal. Mach. Intell. 24 (7) (2002) 971–987.

    [13] D.-Y. Kim, J.-Y. Kwak, B. Ko, J.-Y. Nam, Human detection using wavelet-basedCS-LBP and a cascade of random forests, in: IEEE International Conference onMultimedia and Expo, 2012, pp. 362–367.

    [14] R. Nosaka, Y. Ohkawa, K. Fukui, Feature extraction based on co-occurrence ofadjacent local binary patterns, in: Advances in Image and Video Technology,2012, pp. 82–91.

    [15] D.T. Nguyen, Z. Zong, P. Ogunbona, W. Li, Object detection using non-re-dundant local binary patterns, in: IEEE International Conference on ImageProcessing, 2010, pp. 4609–4612.

    [16] S. Yan, S. Shan, X. Chen, W. Gao, Locally assembled binary (lab) feature withfeature-centric cascade for fast and accurate face detection, in: IEEE Con-ference on Computer Vision and Pattern Recognition, 2008, pp. 1–7.

    [17] H. Ren, Z.-N. Li, Boosted local binaries for object detection, in: 2014 IEEE In-ternational Conference on Multimedia and Expo, 2014, pp. 1–6.

    [18] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition underdifficult lighting conditions, IEEE Trans. Image Process. 19 (6) (2010)1635–1650.

    [19] Z. Guo, L. Zhang, D. Zhang, Rotation invariant texture classification using LBPvariance (LBPV) with global matching, Pattern Recognit. 43 (3) (2010) 706–719.

    [20] N.-S. Vu, A. Caplier, Face recognition with patterns of oriented edge magni-tudes, in: European Conference on Computer Vision, 2010, pp. 313–326.

    [21] Z. Guo, D. Zhang, A completed modeling of local binary pattern operator fortexture classification, IEEE Trans. Image Process. 19 (6) (2010) 1657–1663.

    [22] N.-S. Vu, A. Caplier, Mining patterns of orientations and magnitudes for facerecognition, in: IEEE International Joint Conference on Biometrics, 2011, pp. 1–8.

    [23] S. Zhang, C. Bauckhage, A. Cremers, Informed Haar-like features improve pe-destrian detection, in: IEEE Conference on Computer Vision and Pattern Re-cognition, 2013, pp. 947–954.

    http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref4http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref4http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref4http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref6http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref6http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref6http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref6http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref6http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref12http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref12http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref12http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref12http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref18http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref18http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref18http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref18http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref19http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref19http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref19http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref21http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref21http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref21

  • H. Ren, Z.-N. Li / Pattern Recognition 60 (2016) 793–801 801

    [24] T. Ahonen, J. Matas, C. He, and M. Pietikäinen, Rotation invariant image de-scription with local binary pattern histogram fourier features, in: Scandina-vian Conference on Image Analysis, 2009, pp. 61–70.

    [25] G. Zhao, T. Ahonen, J. Matas, M. Pietikainen, Rotation-invariant image andvideo description with local binary pattern features, IEEE Trans. Image Pro-cess. 21 (4) (2012) 1465–1477.

    [26] D.G. Lowe, Object recognition from local scale-invariant features, in: IEEE In-ternational Conference on Computer Vision, 1999, pp. 1150–1157.

    [27] F. Shahbaz Khan, R.M. Anwer, J. van de Weijer, A.D. Bagdanov, M. Vanrell, A.M.Lopez, Color attributes for object detection, in: IEEE Conference on ComputerVision and Pattern Recognition, 2012, pp. 3306–3313.

    [28] M. Heikkilä, M. Pietikäinen, C. Schmid, Description of interest regions withlocal binary patterns, Pattern Recognit. 42 (3) (2009) 425–436.

    [29] L.-J. Li, H. Su, Y. Lim, L. Fei-Fei, Object bank: an object-level image representationfor high-level visual recognition, Int. J. Comput. Vis. 107 (1) (2014) 20–39.

    [30] H.-H. Yeh, K.-H. Liu, C.-S. Chen, Salient object detection via local saliency es-timation and global homogeneity refinement, Pattern Recognit. 47 (4) (2014)1740–1750.

    [31] R. Benenson, M. Mathias, R. Timofte, L. Van Gool, Pedestrian detection at 100frames per second, in: IEEE Conference on Computer Vision and Pattern Re-cognition, 2012, pp. 2903–2910.

    [32] P. Dollár, S. Belongie, P. Perona, The fastest pedestrian detector in the west, in:British Machine Vision Conference, 2010, pp. 1–11.

    [33] S. Maji, A.C. Berg, J. Malik, Classification using intersection kernel supportvector machines is efficient, in: IEEE Conference on Computer Vision andPattern Recognition, 2008, pp. 1–8.

    [34] J. Yan, X. Zhang, Z. Lei, S. Liao, S.Z. Li, Robust multi-resolution pedestrian de-tection in traffic scenes, in: IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 3033–3040.

    [35] P. Dollár, R. Appel, S. Belongie, P. Perona, Fast feature pyramids for objectdetection, IEEE Trans. Pattern Anal. Mach. Intell. 36 (8) (2014) 1532–1545.

    [36] B. Wu, R. Nevatia, Optimizing discrimination-efficiency tradeoff in integratingheterogeneous local features for object detection, in: IEEE Conference onComputer Vision and Pattern Recognition, 2008, pp. 1–8.

    [37] D. Levi, S. Silberstein, A. Bar-Hillel, Fast multiple-part based object detectionusing kd-ferns, in: IEEE Conference on Computer Vision and Pattern Re-cognition, 2013, pp. 947–954.

    [38] A. Bar-Hillel, D. Levi, E. Krupka, C. Goldberg, Part-based feature synthesis forhuman detection, in: European Conference on Computer Vision, 2010, pp.127–142.

    [39] G. Chen, Y. Ding, J. Xiao, T.X. Han, Detection evolution with multi-order con-textual co-occurrence, in: IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 1798–1805.

    [40] R. Benenson, M. Mathias, T. Tuytelaars, L. Van Gool, Seeking the strongest rigiddetector, in: IEEE Conference on Computer Vision and Pattern Recognition,2013, pp. 3666–3673.

    [41] B. Wu, R. Nevatia, Cluster boosted tree classifier for multi-view, multi-pose objectdetection, in: IEEE International Conference on Computer Vision, 2007, pp. 1–8.

    [42] Z. Tu, Probabilistic boosting-tree: Learning discriminative models for classifi-cation, recognition, and clustering, in: IEEE International Conference onComputer Vision, 2005, pp. 1589–1596.

    [43] H. Ren, Z.-N. Li, Gender recognition using complexity-aware local features, in:International Conference on Pattern Recognition, 2014.

    [44] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: an evaluation ofthe state of the art, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2012)743–761.

    [45] V. Jain, E. Learned-Miller, Fddb: A Benchmark for Face Detection in Un-constrained Settings, Technical Report UM-CS-2010-009, University of Mas-sachusetts, Amherst, 2010.

    [46] X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark localiza-tion in the wild, in: IEEE Conference on Computer Vision and Pattern Re-cognition, 2012, pp. 2879–2886.

    [47] X. Shen, Z. Lin, J. Brandt, Y. Wu, Detecting and aligning faces by image retrieval,in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp.3460–3467.

    [48] H. Li, G. Hua, Z. Lin, J. Brandt, J. Yang, Probabilistic elastic part model for un-supervised face detector adaptation, in: IEEE International Conference onComputer Vision, 2013, pp. 793–800.

    [49] J. Li, Y. Zhang, Learning surf cascade for fast and accurate object detection, in:IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3468–3475.

    [50] H. Li, Z. Lin, J. Brandt, X. Shen, G. Hua, Efficient boosted exemplar-based facedetection, in: IEEE Conference on Computer Vision and Pattern Recognition,2014.

    [51] J. Yan, Z. Lei, L. Wen, S.Z. Li, The fastest deformable part model for objectdetection, in: IEEE Conference on Computer Vision and Pattern Recognition,2014.

    [52] D. Chen, S. Ren, Y. Wei, X. Cao, J. Sun, Joint cascade face detection and align-ment, in: European Conference on Computer Vision, 2014, pp. 109–122.

    [53] M. Mathias, R. Benenson, M. Pedersoli, L. Van Gool, Face detection withoutbells and whistles, in: European Conference on Computer Vision, 2014, pp.720–735.

    [54] R.G. Cinbis, J. Verbeek, C. Schmid, Segmentation driven object detection withFisher vectors, in: IEEE International Conference on Computer Vision, 2013,pp. 2968–2975.

    [55] X. Wang, M. Yang, S. Zhu, Y. Lin, Regionlets for generic object detection, in:IEEE International Conference on Computer Vision, 2013, pp. 17–24.

    [56] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detectionwith discriminatively trained part-based models, IEEE Trans. Pattern Anal.Mach. Intell. 32 (9) (2010) 1627–1645.

    [57] J. Li, X. Liang, S. Shen, T. Xu, S. Yan, Scale-aware fast R-CNN for pedestriandetection. ArXiv preprint arXiv:1510.08160, 2015.

    [58] Y. Tian, P. Luo, X. Wang, X. Tang, Xiaoou, Deep learning strong parts for pe-destrian detection, in: IEEE International Conference on Computer Vision,2015, pp. 1904–1912.

    Haoyu Ren received his B.Sc. degree from TsingHua University in 2007, and M.Sc. degree from the Institute of Computing Technology, Chinese Academy of Science in 2010.He worked as the Staff Researcher in the multimedia group of Beijing Samsung Telecom R&D from 2011 to 2013. He is currently a Ph.D. candidate at the School of ComputingScience, Simon Fraser University. His research interests include object detection, tracking and recognition based on machine learning methods.

    Ze-Nian Li is a Professor in the School of Computing Science at Simon Fraser University, British Columbia, Canada. Dr. Li received his undergraduate education in ElectricalEngineering from the University of Science and Technology of China, and M.Sc. and Ph.D. degrees in Computer Sciences from the University of Wisconsin-Madison under thesupervision of the late Professor Leonard Uhr. His current research interests include computer vision, multimedia, pattern recognition, image processing, and artificialintelligence. Dr. Li has published over 150 refereed papers in journals and conference proceedings. He is the co-Director of the Vision and Media Lab. He is also the co-authorof the book “Fundamentals of Multimedia”, 2nd ed. published by Springer.

    http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref25http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref25http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref25http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref25http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref28http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref28http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref28http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref29http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref29http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref29http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref30http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref30http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref30http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref30http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref35http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref35http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref35http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref44http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref44http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref44http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref44http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref56http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref56http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref56http://refhub.elsevier.com/S0031-3203(16)30159-5/sbref56http://arXiv:1510.08160

    Object detection using boosted local binariesIntroductionRelated workBoosted binary patternsTraditional binary descriptorsBoosted local binaries

    RealAdaBoost with BLBExperimentsExperiments on Caltech pedestrian datasetExperiments on FDDB datasetExperiments on PASCAL VOC 2007 dataset

    ConclusionsAcknowledgmentReferences