71
CENTER FOR MACHINE PERCEPTION CZECH TECHNICAL UNIVERSITY IN PRAGUE MASTER THESIS Pictorial Structural Models for Human Detection in Videos Rostislav Prikner {priknr1,svobodat}@fel.cvut.cz CTU–CMP–2008–11 May 22, 2008 Available at ftp://cmp.felk.cvut.cz/pub/cmp/articles/svoboda/Prikner-TR-2008-11.pdf Thesis Advisor: Tom´ s Svoboda The work has been supported by the Czech Academy of Sciences un- der Project 1ET101210407. Tom´ s Svoboda acknowledges support as of the Czech Ministry of Education under Project 1M0567. Research Reports of CMP, Czech Technical University in Prague, No. 11, 2008 Published by Center for Machine Perception, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University Technick´ a 2, 166 27 Prague 6, Czech Republic fax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz

Pictorial Structural Models for Human Detection in Videos

  • Upload
    voque

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Pictorial Structural Models for Human Detection in Videos

CENTER FOR

MACHINE PERCEPTION

CZECH TECHNICAL

UNIVERSITY IN PRAGUE

MAST

ER

TH

ESIS

Pictorial Structural Models forHuman Detection in Videos

Rostislav Prikner

{priknr1,svobodat}@fel.cvut.cz

CTU–CMP–2008–11

May 22, 2008

Available atftp://cmp.felk.cvut.cz/pub/cmp/articles/svoboda/Prikner-TR-2008-11.pdf

Thesis Advisor: Tomas Svoboda

The work has been supported by the Czech Academy of Sciences un-der Project 1ET101210407. Tomas Svoboda acknowledges supportas of the Czech Ministry of Education under Project 1M0567.

Research Reports of CMP, Czech Technical University in Prague, No. 11, 2008

Published by

Center for Machine Perception, Department of CyberneticsFaculty of Electrical Engineering, Czech Technical University

Technicka 2, 166 27 Prague 6, Czech Republicfax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz

Page 2: Pictorial Structural Models for Human Detection in Videos
Page 3: Pictorial Structural Models for Human Detection in Videos
Page 4: Pictorial Structural Models for Human Detection in Videos
Page 5: Pictorial Structural Models for Human Detection in Videos
Page 6: Pictorial Structural Models for Human Detection in Videos
Page 7: Pictorial Structural Models for Human Detection in Videos

Acknowledgment

I would like to thank Tomas Svoboda for his leadership of my diploma thesis,for theoretical advice and comments. I would like to thank my family for thesupport they have given me during my education.

Page 8: Pictorial Structural Models for Human Detection in Videos
Page 9: Pictorial Structural Models for Human Detection in Videos

Abstract

This paper describes the detection of a human body in images. Two differentapproaches are used. First approach detects a human body by using a singledetection window based on features of the image gradients (HOGs) and ituses a cascade of classifiers to speed up the computing time.

Second approach is based on matching of pictorial structures. An artic-ulate model of the human body is assembled from individual parts (head,torso, limbs, etc). A human body model is represented as a collection ofthe parts arranged in a deformable configuration. Single body parts are de-tected by using color characteristics, that are gained from training examples.We advance the standard implementation by detecting shapes of multiplescales. The method is accelerated by vertical symmetry of a human body.The windowed human detector is applied in order to reduce the state space.

We propose a method for unsupervised learning of color appearance of thehuman body parts. This approach makes the detection (using matching ofthe pictorial structures) more robust. The method integrates the fast humandetector, the pictorial structures matching and image segmentation based ongraph cuts.

All the used methods are tested on real datasets.

1

Page 10: Pictorial Structural Models for Human Detection in Videos
Page 11: Pictorial Structural Models for Human Detection in Videos

Abstrakt

Prace popisuje detektory lidskeho tela v obrazech. Jsou pouzity dva odlisneprıstupy. Prvnı rozpoznava lidske telo jako celek pomocı detekcnıho okna.Detektor je zalozeny na vlastnostech obrazoveho gradientu a vyuziva kaskadyklasifikatoru pro urychlenı vypoctu.

Druhy prıstup vychazı z obrazovych struktur. Artikulovany model lidskehotela je slozen z jednotlivych castı (hlava, telo, koncetiny). Lidske telo jereprezentovano jako objekt slozeny z techto castı, ktere jsou usporadany dodeformovatelne konfigurace. Jednotlive casti lidskeho tela jsou detekovany nazaklade barevnostnıch vlastnostı naucenych z trenovacıch dat. Metodu jsmeurychlili vyuzitım symetrie lidskeho tela a rozsırili o moznost detekce postavruznych velikostı. Pomocı rychleho detektoru lidskeho tela jsme zmensiliprohledavany stavovy prostor.

Navrhli jsme metodu pro automaticke ucenı barevnych vlastnostı castılidskeho tela. Tento prıstup umoznuje vetsı robustnost detekce pomocı obra-zovych struktur. Metoda kombinuje rychly detektor lidskeho tela, detekciobrazovych struktur a segmentaci zalozenou na rezech grafu.

Vsechny pouzite metody byly testovany na realnych datech.

3

Page 12: Pictorial Structural Models for Human Detection in Videos
Page 13: Pictorial Structural Models for Human Detection in Videos

Contents

1 Introduction 71.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Goals of the diploma thesis . . . . . . . . . . . . . . . . . . . . 91.4 Structure of the document . . . . . . . . . . . . . . . . . . . . 10

2 Fast human detector 112.1 Histograms of Oriented Gradients . . . . . . . . . . . . . . . . 122.2 Cascade of HoGs for fast human detection . . . . . . . . . . . 132.3 Our implementation . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.2 HoG features . . . . . . . . . . . . . . . . . . . . . . . 142.3.3 Computing features . . . . . . . . . . . . . . . . . . . . 152.3.4 Variable block size . . . . . . . . . . . . . . . . . . . . 172.3.5 Cascade of rejectors . . . . . . . . . . . . . . . . . . . . 182.3.6 Training the cascade . . . . . . . . . . . . . . . . . . . 202.3.7 Final cascade . . . . . . . . . . . . . . . . . . . . . . . 222.3.8 Selection of one detection window . . . . . . . . . . . . 24

3 Pictorial structures 253.1 Probability model of pictorial structure . . . . . . . . . . . . . 253.2 Detector of parts . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 Learning process . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Unsupervised learning of appearance 334.1 Introduction to the Graph Cut theory . . . . . . . . . . . . . . 344.2 Graph Cut segmentation . . . . . . . . . . . . . . . . . . . . . 344.3 Learning of a color model of a human body . . . . . . . . . . . 38

5 Experiments 435.1 Fast human detector . . . . . . . . . . . . . . . . . . . . . . . 43

5

Page 14: Pictorial Structural Models for Human Detection in Videos

6

5.2 Pictorial structures . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Unsupervised learning of appearance . . . . . . . . . . . . . . 50

5.3.1 Our dataset . . . . . . . . . . . . . . . . . . . . . . . . 505.3.2 Stickman data set . . . . . . . . . . . . . . . . . . . . . 51

6 Conclusions 55

7 Appendixes 577.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.1.1 Fast human detector . . . . . . . . . . . . . . . . . . . 587.1.2 Pictorial structures matching . . . . . . . . . . . . . . 587.1.3 Unsupervised learning of appearance . . . . . . . . . . 58

7.2 Enclosed CD . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Page 15: Pictorial Structural Models for Human Detection in Videos

Chapter 1

Introduction

Detection and tracking objects in video sequences and in images is a veryimportant theme, especially a detecting human is a challenging task for to-days computer systems. It is the main idea for a wide range of applica-tions, including visual surveillance systems, content based image storage andretrieval, virtual reality, telepresence, intelligent robotics, human activityanalysis, machine perception etc. This field of study has many commercialapplications as well.

1.1 State of the art

There appear to be two leading approaches to the problem of human de-tection. One uses a single detection window with binary results. Theseapproaches use many methods to solve for example Haar-based representa-tion [17], compare edge images to an exemplar dataset using the chamferdistance [8] or extended Haar-like wavelets to handle space-time informationfor moving-human detection [16]. Dalal and Triggs [5] presented a human de-tection algorithm using the Histograms of Oriented Gradients with excellentdetection results.

Others methods have taken a parts-based approach that aims at dealingwith the great variability in appearance due to body articulation. Each part(head, body, torso, arms legs etc.) is detected separately and a human isdetected if some or all of its parts are presented in a geometrically plausibleconfiguration. In [10] are parts represented as projections of straight cylindersand propose efficient ways to incrementally assemble these segments into afull body assembly. In [15] are parts represented as co-occurrences of localorientation features. Felzenszwalb and Huttenlocher [6] presented the humandetector based on pictorial structures. The idea is to represent a body as a

7

Page 16: Pictorial Structural Models for Human Detection in Videos

8 CHAPTER 1. INTRODUCTION

collection of body parts arranged in a deformable configuration.All methods are based on learnt parameters from training positive and

negative example sets. The sets must be large enough for sufficient robustdetection. The learning process can be time consuming, but the main goal ofdetectors is to detect a human in a short time, because industrial applicationsrequire important detectors with near real-time evaluation.

1.2 Our approach

Figure 1.1: Schema of detection. Human body is detected first. Pictorialstructures matching is used to find human body configuration.

We concentrate on three methods of human detection in our work. Firstis a fast human detector, that detects a single window with a person. Weimplement a detector of Zhu, Avidan, Yeh and Cheng [26], which is basedon Dalal and Triggs [5] HoG descriptors. They speed up the algorithm usinga cascade of classifiers. The searched image is covered by an overlappingdetection window and for each detection window is a binary result. To finda rectangular window with a person in images is sufficient for many appli-cations. This detector can be useful as a starting point for other detectors,because it reduces a searched state space.

In another step we connect the single window detector with a detectorbased on pictorial structures matching. We use implementation of [14], whichimplements the pictorial structures matching presented in Felzenswalg andHuttenlocher [6].

The goal of this method is to find people in video sequences using learnedmodels of both the appearance of body parts (head, torso, arm, leg etc.),and of the geometry of their structures. Three different part detectors, whichreturn probabilities that a part is located in given location, is implemented in

Page 17: Pictorial Structural Models for Human Detection in Videos

1.3. GOALS OF THE DIPLOMA THESIS 9

[14], but in our work we use only one, based on color properties of each part- color based segmentation. The probabilities of the part detector are used,when the entire structural model is put together. The best structural modelis found using an efficient minimization algorithm. For the best location weuse maximizing posterior probability (MAP estimate) implemented in [14].

To find a person by a single window detector is useful for this method,because it makes a smaller image area which must be searched. A wholeimage cannot be searched.

We combine the fast human detector with another method, which uses aminimum cut of the graph - Graph Cut [2]. This detector finds a contour of aforeground which can be a human silhouette. For this algorithm, it is useful toknow where foreground and background pixels can be for good segmentation.This information is initialized by the single window human detector. Wepropose the method which allows unsupervised learning of appearances ofa human body. It is useful when the color model of body pats is unknownfor the pictorial structure matching. This method integrates the fast humandetector, the Graph Cut segmentation and the pictorial structures matchingusing a sampling of a posterior probability. Learning of the appearance modelis useful when video sequences are tracked and for individualising of person.

Close to the end of working on the thesis we realized that very similaryapproach was independently proposed by [7].

1.3 Goals of the diploma thesis

Main goals of our diploma thesis is the detection of a human body, findingits configuration and learning of the color appearance model of human bodyparts in images. Sub goals are listed below in the listed in the followingpoints:

• To implement the fast human detector [26].

• Integrate this detector with a detector based on the pictorial structuresmatching [6].

• Integrate the fast human detector, the Graph Cut segmentation and thepictorial structures matching to unsupervised learning of appearancesof a human body.

• Test the algorithms on real datasets.

Page 18: Pictorial Structural Models for Human Detection in Videos

10 CHAPTER 1. INTRODUCTION

1.4 Structure of the document

The diploma thesis is divided into the following sections:

• Fast human detector (Chapter 2) describes the implementation of afast human detector using HoG descriptors and its learning process.

• Pictorial structures (Chapter 3) provides the review of pictorialstructures matching [14] and describes our innovation.

• Unsupervised learning of appearance (Chapter 4) describes theintegration of the fast human detector, the Graph Cut segmentationand the pictorial structures matching.

• Experiments (Chapter 5) contains an overview of results of all usedmethods.

• Conclusions (Chapter 6) summarise all the work done during thisdiploma thesis.

Page 19: Pictorial Structural Models for Human Detection in Videos

Chapter 2

Fast human detector

In this chapter we describe a fast human detector using a cascade of classifiersbased on histograms of oriented gradients. In section 2.3 we explain ourimplementation of the detector and we analyse the learning process.

Approaches of description of object appearance in images exists. Methodbased on Histograms of Oriented Gradients (HoG) describes an appearanceof a human body by the statistical distribution of the orientation and themagnitude of the image gradients [5]. Figure 2.1 shows magnitudes of theimage gradients.

Figure 2.1: Left is original image with two persons, right are normalizedmagnitudes of images gradient.

11

Page 20: Pictorial Structural Models for Human Detection in Videos

12 CHAPTER 2. FAST HUMAN DETECTOR

2.1 Histograms of Oriented Gradients for hu-

man detection

We start with a short description of the Dalal and Triggs algorithm [5] inorder to explain the essentials. They propose a human detection algorithmusing HoG with excellent detection results. The method is based on evaluat-ing a dense grid of normalised local histograms of image gradient orientationsover the image windows. The hypothesis is that local object appearance andshape can often be characterised rather well by the distribution of local in-tensity gradient or edge directions, even without precise knowledge of thecorresponding gradient or edge positions.

The method uses a dense grid of HoG, computed over blocks of size16 × 16 pixels to represent a detection window. This representation provesto be powerful enough to classify humans using a linear SVM [12].

Each detection window is divided into cells of size 8× 8 pixels and eachgroup of 2 × 2 cells is integrated into a block in a sliding fashion, so blocksoverlap each other. Each cell consist of a 9-bin HoG and each block containsa concatenated vector of all its cells. Each block is thus represented by a 36- D feature vector that is normalized by an L2 unit length. Each 64 × 128detection window is represented by 7 × 15 blocks, giving a total of 3780features per detection window. These features are then used to train a linearSVM classifier.

The Dalal and Triggs algorithm makes use of three key components:

1. The use of HoG as a basic building block.

2. The use of a dense grid of HoGs across the entire detection window toprovide a good description of the detection window.

3. A normalization step within each block that emphasizes relative behav-ior, with respect to the neighbouring cells, as opposed to the absolutevalues.

An important factor that is missing in their approach, is the use of blocksof a multiple scale. They use a fairly small block size 16 × 16 pixels whichmight miss the ”big picture”, or global features of the entire detection win-dow. Indeed, they report that adding blocks/cells of different scales wouldsomewhat improve the results but would also significantly increase the com-mutation cost. The capturing of the ”big picture” is dependent on the denseset of small-scale blocks across the entire detection window.

Page 21: Pictorial Structural Models for Human Detection in Videos

2.2. CASCADE OF HOGS FOR FAST HUMAN DETECTION 13

2.2 Cascade of HoGs for fast human detec-

tion

Zhu, Avidan, Yeh and Cheng [26], improved the HoG detector by using acascade of classifiers, which sped up the original algorithm. The cascade ofclassifiers is the decision tree, which eliminates no-human detection windowin each stage. They used AdaBoost [20] to choose which features to evaluatein each stage. However, the small size of the blocks proved to be the majorobstacle. They found that none of these small size blocks was informativeenough to reject asufficient number of patterns accelerate the detection pro-cess. Therefore they increase feature space to include blocks of different sizes,locations and aspect ratios. As a result they have 5031 blocks to choose from,compared to the 105 blocks used in the Dalal-Triggs algorithm. Moreover,they found that the first few stages of the cascade that reject the majorityof detection windows actually use large blocks and the small blocks are usedmuch later in a cascade.

To support the fast evaluation of specific blocks chosen by AdaBoost-based feature selection algorithm, they used the integral image representationto efficiently compute the HoG of each block.

2.3 Our implementation

In this section we describe details of our implementation of the fast humandetector.

2.3.1 Dataset

A data set INRIA [5] is used for learning and testing the fast human de-tector. The data set contains training and testing images. The training setcontains 2418 of positive examples. The positive examples are normalised tosize 64 × 128 pixels. Training negative examples contains 1218 images vari-able sizes without persons. We use all positive training examples and fromeach training negative example we generated by random 73 negative exam-ples with variable size and normalized to size 64 × 128 pixels for learningprocess. Finally we have 2418 of the positive examples and 88914 of the neg-ative examples for the training cascade. The INRIA dataset is available onhttp://pascal.inrialpes.fr/data/human/. Figure 2.2 shows some posi-tive examples. Figure 2.3 shows some negative examples.

Page 22: Pictorial Structural Models for Human Detection in Videos

14 CHAPTER 2. FAST HUMAN DETECTOR

Figure 2.2: Some positive examples.

Figure 2.3: Some negative examples.

2.3.2 HoG features

The histograms of oriented gradients are descriptors of image areas. It de-scripts characteristic of a subarea in an image and characterizes a feature of

Page 23: Pictorial Structural Models for Human Detection in Videos

2.3. OUR IMPLEMENTATION 15

an object in images.The gradients are computed by using simple mask [−1 0 1] for x and y

direction. The gradient angle and magnitude are computed in each pixel ofan image. The angels of the gradients are computed from 0◦ to 180◦ and itis discretized into 9 discrete directions. Figure 2.4 shows an original trainingexample and its gradients. Figure 2.5 shows average magnitude of all positiveand negative training data.

Figure 2.4: Gradients of a training example. On the left is a training ex-ample, middle are magnitudes of gradients and on the right areangles of gradients discretized into 9 orientations bins, each binis represented by the shades of the gray color.

2.3.3 Computing features

An integral histogram [18] is used for a fast evaluation, which allows very fastcomputation of the Harr-wavelet type features, known as rectangular filters.It efficiently computes histograms over arbitrary rectangular image regions.

Orientation and magnitude of a gradient is computed for each pixel inan image. The orientation of the gradient is weighted by its magnitude. Itmeans that for each orientation bin one integral image is computed, in whichmagnitudes of the gradients are stored.

For computing of region’s features we have 9 integral images for eachbin of the HoG and they are used to efficiently compute the HoG for anyrectangular image region. The block is divided into 2 × 2 subregions. Thisrequires 4 × 9 image access operations to compute the feature of one block.

Page 24: Pictorial Structural Models for Human Detection in Videos

16 CHAPTER 2. FAST HUMAN DETECTOR

Figure 2.5: The image of the average magnitude of all data, bright positionscorrespond to the high values of the magnitude. Left is for posi-tive examples and right for negative examples

.

For each bin we have 4 numbers that characterise its features. Figure 2.6shows 9 access points to the integral histogram which compute 4 features ofone bin. The features of one bin are computed by:

feature1 = a + e− b− d, feature2 = b + f − c− e,feature3 = d + h− e− g, feature4 = e + i− f − h.

(2.1)

Each block is characterised by 4×9 = 36 dimensional vector, which is normal-ized by L1 normalization. Let be v unnormalized 36 dimensional descriptorvector and ε be a small constant, L1-norm is defined like: v → v/(|v|1 + ε),where |v|1 =

∑36i=1 |vi|. We don’t use L2-norm, because L1-norm is faster

and in [26] it is demonstrated, that L1-norm and L2-norm provide similaryresults.

Page 25: Pictorial Structural Models for Human Detection in Videos

2.3. OUR IMPLEMENTATION 17

Figure 2.6: Histogram back-projection. The block and 9 access points to theintegral histogram for compute features of one bin. The featuresvector of one bin has length 4. 9 integral images like this areused for computing of all features of one block.

2.3.4 Variable block size

The use of HoG features in the Dalal and Trigger approach was restricted toa single scale (105 blocks of size 16× 16 pixels). Moreover, they report thatusing blocks and cells at multiple scales improves while the computationalcost greatly increase. Zhu, Avidan, Yeh and Chen circumvent this problemby using feature selection. Specifically, for a 64× 128 detection window theyhave 5031 blocks of different size and position.

We have 7206 blocks of variable size and position for detection window ofsize 64×128 pixels in our implementation. Minimal block size is 12×12 pixelsand maximal 64×128 pixels. The height of blocks is 12, 16, 20 . . . , 128 pixelsand the width of blocks is 12, 16, 20 . . . , 64 pixels (the height and width ofblocks are changed by step 4 pixels). For each block is a ratio between heightand width bigger then 0.5 and smaller then 2. The blocks are dislocated onthe detection window with the step 8 pixels.

The advantage of using set of variable size blocks is twofold. First, to-wards a specific object category, useful patterns tend to spread over differentscales. The original 105 fixed-size blocks only encode very limited infor-mation. Second, some of the blocks in this large set of 7206 blocks mightcorrespond to a semantic part in people like a human torso. A small numberof fixed-size blocks is less likely to establish such mappings. The numberof blocks has effect on computing time of learning cascade, but a greaternumber of blocks decrease the finding of good classification blocks.

Another way to view is an implicit way od doing parts-based detectionusing a single window approach. The most informative parts, i.e. the blocksare automatically selected unsing the AdaBoost algorithm [20]. The HoG

Page 26: Pictorial Structural Models for Human Detection in Videos

18 CHAPTER 2. FAST HUMAN DETECTOR

features are robust to small and local changes, while the variable size blockscan capture the ”global picture”.

2.3.5 Cascade of rejectors

Series of classifiers are applied to every sub-window. In each stage classifierseliminate a large number of negative examples with little processing time.Only the sub-windows with complete human body go over all stages of thecascade. Figure 2.7 shows schematic depiction of the detection cascade. Formore details refer to [25].

Figure 2.7: Schematic depiction of the detection cascade. The figure is takenfrom [25].

Classifiers, which evaluate the blocks, are used as a weak classifiers in eachstage of the cascade. The blocks with variable location and size are selectedin each level of the cascade. There are trained linear SVM classifiers for eachblock. For training of the SVM classifiers we use SVMlight implementation,which is described in [12] and available on http://svmlight.joachims.

org/.Linear SVM classifiers are simple classifiers, which minimize the struc-

tural risk. It uses a separating hyperplane. Each block feature correspondsto the 36D vector (4 sub-regions and 9 bins). An output from the linearSVM classifier is a binary result. Let be x n-dimensional vector. Equation2.2 shows evaluation of linear SVM classifier.

R(x) = sign(w1x1 + w2x2 + w3x3 + . . . + wnxn + b), (2.2)

Page 27: Pictorial Structural Models for Human Detection in Videos

2.3. OUR IMPLEMENTATION 19

where w1, w2, w3, . . . , wn, b are parameters learned by linear SVM.Main classifier of the cascade level is put together from weak classifiers

selected by the AdaBoost algorithm [20]. The AdaBoost algorithm selectsbest classifiers from a large set of weak classifiers. Each selected weak classi-fier has its own weight. The final classifier is a summation of weighted resultsof the weak classifiers. For more details refer to [11]. Figure 2.8 describesthe AdaBoost algorithm.

Input:(x1, y1), . . . , (xm, ym); xi ∈ X , yi ∈ {−1, +1}H set of weak classifiers

Initialise weights D1(i) = 1m

For t = 1, . . . , T

1. Find

ht = arg minhj∈H

εj =m∑

i=1

Dt(i)[[yi 6= hj(xi)]]

2. If εt ≥ 12

then stop

3. Set αt = 12log(1−εt

εt)

4. Update

Dt+1 =Dt exp(−αtyiht(xi))

Zt

, where

Zt =m∑

i=1

Dt(i) exp(−αtyiht(xi))

Output:

Final classifier: H(x) = sign(T∑

t=1

αtht(x))

Figure 2.8: The AdaBoost algorithm. In our case xi is positive or negativeexample and H is the set of all linear SVM blocks classifiers.

We modify the AdaBoost algorithm. The AdaBoost is stopped, whenthe error on the training set is smaller then the defined value. To the final

Page 28: Pictorial Structural Models for Human Detection in Videos

20 CHAPTER 2. FAST HUMAN DETECTOR

classifier is added a threshold because in each level of the cascade we need togo over all the positive examples and a level must reduce a smaller percentigeof the negative examples. Final classifier is:

H(x) = sign(th +T∑

t=1

αtht(x)), (2.3)

where x is the input window, th is the added threshold, T is the number ofthe selected weak classifiers, αt is the weight of the classifier and ht(x) is thebinary result of the linear SVM weak classifier.

2.3.6 Training the cascade

We train the cascade for all the 2418 positive and 88914 negative examples.We construct the rejection cascade similar to the one proposed in [26]. Weset up the minimum detection rate to 0.9975 and the maximum false positiveratio to 0.8 for each level. It means that only 0.25% positive examples can beclassified as negative examples and minimally 20% of the negative examplesmust be reduced from the negative set examples in one level of the cascade.

The negative example’s set is boosted by cascade’s levels. In each levelof the cascade, the whole set of positive examples and the same numberof the negative examples are used. The first 2418 negative examples areused in the first level. Minimally 20% of the negative examples are reducedin each level of the cascade and new negative examples are added. Addedare only negative examples from the negative set, which are classified asimages containing a human by existing cascade. In the latest cascade levelsit happens that the negative example set contains a smaller number of theexamples than the positive example set. That is why we randomly duplicatesome negative examples. The same number of the negative and the positiveexamples are used too.

Finally, because evaluating each of 7206 possible blocks in each stage isvery time consuming, a sampling method suggested in [21] is adopted. Theyshow that one can find, with a high probability, the maximum of m randomvariables, in a small number of trials. In practice 360 blocks (5% of all7206 blocks) are sampled at random in each round. Figure 2.9 describes thelearning process for one stage of the cascade.

Page 29: Pictorial Structural Models for Human Detection in Videos

2.3. OUR IMPLEMENTATION 21

Input:Ftarget: target overall false positive ratefmax: maximum acceptable false positive rate per cascade leveldmin: minimum acceptable detection per cascade levelPos: set of positive samplesNeg: set of negative samples

initialize: i = 0, Fi = 1.0while Fi > Ftarget

• i = i + 1

• fi = 1.0

while fi > fmax

1. train 360(5% at random) linear SVMs using Pos and Neg samples

2. add the best SVM selected by AdaBoost with threshold into thestrong classifier

3. evaluate Pos and Neg by current strong classifier

4. compute fi

loop end

• Compute Fi for all negative examples in Neg

• Remove true-classified sample from Neg

• Add misclassified samples from large set of negative sample to Neg

• If number of Neg is smaller than number of Pos duplicate some Negby random

loop end

Output:A i-levels cascadeeach levels has a threshold and weak classifiers with owns weights

Figure 2.9: Training of the cascade.

Page 30: Pictorial Structural Models for Human Detection in Videos

22 CHAPTER 2. FAST HUMAN DETECTOR

2.3.7 Final cascade

In the previous section we described the process of learning the cascade.The learned cascade has 50 levels. In each level is a diverse number of theclassifiers. Figure 2.10 shows the number of the SVM classifiers in the levelsof the cascade. Figure 2.10 shows the selected blocks in some levels of thecascade.

Figure 2.10: Different number of the SVM classifiers in each level of the cas-cade.

Figure 2.11: Selected blocks in some levels of the cascade. From left it is 1st,10th, 2th, 30th, 40th and 50th level.

We compute the Detection Error Tradeoff (DET) curve on the testingdata from INRIA dataset. We use the normalised positive data and images

Page 31: Pictorial Structural Models for Human Detection in Videos

2.3. OUR IMPLEMENTATION 23

without persons - negative data.

Detection Error Tradeoff (DET) curves form natural criterion for thebinary classification tasks as they measure the proportion of true detectionsagainst the proportion of false positives. There is plot miss rate versus falsepositives (False Positives Per Window tested or FPPW) on a log-log scale.Miss rate is defined as:

MissRate =FalsePositiv

TreePositiv + FalsePositiv. (2.4)

Lower values of the DET curve denote better classifier performance. Theypresent the same information as Receiver Operating Characteristic (ROC)curves but allow small probabilities to be distinguished more easily. A falsepositive rate of 10−4 FPPW is often used as a reference point for results.Figure 2.12 shows DET curve for our detector. For computing DET curvewe move the threshold in each level to about the same value.

Figure 2.12: DET curve.

It depends on the application and its working point which is set up on thecascade. For FPPW 104 we substract from threshold in each cascade value1.3.

Page 32: Pictorial Structural Models for Human Detection in Videos

24 CHAPTER 2. FAST HUMAN DETECTOR

2.3.8 Selection of one detection window

The fast human detector can find many detection windows of one humanbody (it shows blue rectangles on Figure 2.13). Only one detection widnowis needed to use the pictorial structure magching and the Graph Cut segmen-tation. We select one positive detectection window by using simple functionaveraging. The center of window is computed by averaging of all detectingpositive window. Height and width of the window is computed in the sameway as the center. Figure 2.13 shows a selection of one window for sometesting images. It is possible, because only one human body can be found ineach image in our dataset.

Figure 2.13: Selecting of one positive window from all positive detection.Blue is all positive detection, red is one selected window.

Page 33: Pictorial Structural Models for Human Detection in Videos

Chapter 3

Pictorial structures

In this chapter we explain the method for the human detection based onthe pictorial structures matching [6]. We review [14]. The method detectsparts of the human body and search the best mutal position of the detectedpart. Detection of individual parts is difficult, because many object in theimage can look as a body part. Mutal position of body parts help make thedetection correct.

To speed up the computing time of this algorithm, a human is detected inan image by the fast human detector described in Section 2 first, because aselected window is smaller than the whole image. We exploited the symmetryof a human body, too.

3.1 Probability model of pictorial structure

For the sake of completness we re-use [14] in this section. Human body isrepresented as a collection of parts which are arranged in a deformable con-figuration. Local visual properties are encoded in models for individual parts,deformable configuration is represented by spring-like connections betweencertain pairs of parts. Parts correspond to real big rigid parts of the humanbody, connections correspond to joints. Figure 3.1 shows the model of humanbody.

The model is described as an undirected graph G = (V, E). The verticesV = {v1, v2...vn} correspond to individual parts and (vi, vj) ∈ E only when ajoint between vertices vi and vj exists. An instance of the object is given witha configuration L = {l1, l2, ...ln}, where li = (xi, yi, si, θi) specifies location ofa part vi. xi, yi are coordinates of the center of the part in the picture, θi isorientation of the part and s means foreshortening of part only for its longerside, it is caused with a scaled orthographic projection.

25

Page 34: Pictorial Structural Models for Human Detection in Videos

26 CHAPTER 3. PICTORIAL STRUCTURES

Figure 3.1: The model of human body. Body is assembled of rectangularparts, which are connected with flexible joints in deformable con-figuration. Size of the rectangles is deformed by the parameterscale and foreshortening.

The best match is found as the one which minimizes energy function ofobserving an object in the location L given image I:

L∗ = arg minL

( n∑i=1

mi(li) +∑

(vi,vj)∈E

dij(li, lj)), (3.1)

where n is number of parts, mi(li) measures degree of mismatch, when partvi is placed on location li in the picture, dij(li, lj) measures degree of defor-mation of the model, when a part vi is placed on location li and part vj isplaced on location lj in the image.

While searching for the best match it is not minimised directly the func-tion (3.1), but its negative logarithm is maximised:

p(L|I, θ) ∝

n∏i=1

p(I|li, ui)∏

(vi,vj)∈E

p(li, lj|cij)

, (3.2)

where p(I|li, ui) is probability of observing picture and p(li, lj|cij) is proba-bility of mutual position of two parts.

The entire task can be divided into three independent partial tasks ac-cording to equation (3.2):

1. Computing of probability p(I|li, ui) and learning of its parameters fromtraining examples.

Page 35: Pictorial Structural Models for Human Detection in Videos

3.2. DETECTOR OF PARTS 27

2. Computing of probability p(li, lj|cij) and learning of its parameters fromtraining examples.

3. Efficient searching of the state space and finding the match, whichmaximize the probability (3.2). The MAP estimate is used.

3.2 Detector of parts, color based segmenta-

tion

For the sake of completness we re-use [14] in this section. Color based seg-mentation [3] suggests to segment individual parts using information abouttheir color. Projection of one part is approximated as a rectangle, whichshould cover this part in the given image. Width of the rectangle is fixedand corresponds to width of a limb observed in the image. Maximal lengthof the part corresponds with a length of a limb seen in the picture, whenit is oriented perpendicularly to the optical axis. The length can vary dueto foreshortening. Human body is modeled by the projection of a part asa rectangle parametrized by parameters (x, y, s, θ), where (x, y) are coordi-nates of the centre in an image, s ∈ 〈0, 1〉 is amount of foreshortening and θis an angle of an orientation of the part. Each part can have its unique size,which is put to our detector by human before learning procedure.

Figure 3.2: Model of one rectangular part with surrounding area.

Color histograms of foreground and background are learned from train-ing examples. Histograms are three-dimensional in RGB space. Probability,that a pixel of the given colour belongs to the given part, is evaluated with aquadratic logistic regression classifier [9]. Quadratic logistic regression clas-sifier is a special example of generalised linear classifiers.

Parts are represented by rectangles consisting of foreground rectangle andsurrounding area, which shows Figure 3.2. Each pixel in the image has its

Page 36: Pictorial Structural Models for Human Detection in Videos

28 CHAPTER 3. PICTORIAL STRUCTURES

probability, that it is a component of the given part. Probability of observingpart in the location li in the given image is computed:

p(I|li, ui) = e−[(area1−count1)+count2]/s, (3.3)

where count1 is the sum of pixels masked with a part weighted with proba-bilities, that they belong to the foreground, count2 is sum of pixels maskedwith a surrounding of a part weighted with probabilities, that they belongto the foreground. In smaller rectangle misclassified pixels are summed(area1 − count1) (probabilities that they belong to the background). Di-vision with the scale s is used because of normalisation of results.

3.3 Learning process

We use and improve the implementation [14]. Some training examples fromthe dataset are hand labeled. The color model of parts and the human modelare learned from labeled training images. A height of a person is changed,when it moves from and to a camera. With the changing of a figure’s height,the proportions of each part change, too. Therefore we use multi-scale.

When the distance of a human body from the camera changes, the colorof the body parts is changed, too. Therefore we must learn another colormodel for each scale value. Figure 3.3 shows some labeled training exampleswith the various height of a person.

We suppose that the foreshortening parameter of a human torso variesonly gently, because in the training data the persons stay only in an uprightposition. Therefore we compute the scale value of a training example fromthe height of the torso. The foreshortening of the human torso is set to value1. The proportions of other parts are recomputed from the scale of the wholehuman body. The scale of one training example is computed as the nearestvalue to predefined scales values set. For each scale a color model is learned.

To speed up the algorithm we presume that the human body and clothcolor are symmetric along the vertical axis. We presume that the probabilitiespart location is the same for left and right body part. It means that wecompute probabilities for 6 body parts (torso, head, arm, forearm, calf, thigh)instead of 10 parts (torso, head, left/right arm, left/right forearm, left/rightcalf, left/right thigh). Figure 3.4 shows the probability of the fact that eachpixel of image contains each part. Figure 3.5 shows the probability of locationof some parts in position of discretized state space.

Sometimes it is impossible to distinguish parts only by their colours.Many false detections are found. If the segmentation is wrong it is almost

Page 37: Pictorial Structural Models for Human Detection in Videos

3.3. LEARNING PROCESS 29

Figure 3.3: Some of interactively labelled training examples with multi-scaleheight of person. Person have a symmetric clothing, which isused to speed up the algorithm.

impossible to find parts like the head (head is leave of the graph). Locationof these parts can be found by using the pictorial structures matching.

A deformable human body model is learned from all training examplesover all scales, because pose of the human body is same whan it is far or nearfrom the camera. Figure 3.6 shows the learned human body model.

Using of the multiple scales slows down the computing time. The prob-abilities must be computed separately for each scale value. Maximize ofposteriory probability of MAP estimate selects the maximal configurationfrom all the position and scales.

Page 38: Pictorial Structural Models for Human Detection in Videos

30 CHAPTER 3. PICTORIAL STRUCTURES

Figure 3.4: Example of colour based segmentation for each part. The colorof pixel represents probability, that pixel contain to each part.White is probability 1, black is probability 0.

Figure 3.5: Probabilities of location of parts in position of discretized statespace. From left there are torso, arm and forearm. Red colorrepresents the probabilities.

Page 39: Pictorial Structural Models for Human Detection in Videos

3.3. LEARNING PROCESS 31

Figure 3.6: Structural model learned from labelled training examples. Allparts are in the most probable mutal position.

Page 40: Pictorial Structural Models for Human Detection in Videos

32 CHAPTER 3. PICTORIAL STRUCTURES

Page 41: Pictorial Structural Models for Human Detection in Videos

Chapter 4

Unsupervised learning ofappearance

In this chapter we explain the method for an unsupervised learning of anappearance model of human body parts. Chapter 3 describes the method forthe human detection based on the pictorial structures matching and the colorbased segmentation of the body parts. The disadvantage of this method isthe need of the knowledge of a human body parts color model before thedetection starts.

New approach is based on an image segmentation using minimal cut of agraph. The Graph Cut [2] algorithm finds optimal foreground and back-ground labeling as an energy minimization problem. The minimum cutcan be computed very efficiently by max flow algorithms. In our workwe use the implementation of the Graph Cut available on http://www.

csd.uwo.ca/faculty/olga/code.html and MATLAB wrapper available onhttp://www.wisdom.weizmann.ac.il/~bagon/matlab.html.

Learning of appearance of body parts is divided into 4 steps:

1. A human body is detected in an image by using the human detectorbased on the HoG descriptors.

2. A selected area is segmented by the Graph Cut algorithm. The colorappearance of the whole body is learned from the segmentation result.

3. A configuration of a human body in the detection window is found bythe pictorial structure matching.

4. A color appearance of each body part is learned by using the results ofthe pictorial structure matching.

33

Page 42: Pictorial Structural Models for Human Detection in Videos

34 CHAPTER 4. UNSUPERVISED LEARNING OF APPEARANCE

4.1 Introduction to the Graph Cut theory

The goal of the Graph Cut minimisation is to label each pixel either asbackground or as foreground. A model of an image is represented by the gridgraph G(V, E). Pixels are nodes v ∈ V and pairs of neighboring pixels areedges vv′ ∈ E of the graph G. xv ∈ {F, B} denotes pixel label, F foregroundand B background. Each pixel is characterized by the color fv and the imageis described by the vector f = {fv | v ∈ V }. The segmentation is computingof the ”best” labeling x = (xv | v ∈ V ).

Each pixel with color fv has its own probabilities p(B | fv) that belongto the background and p(F | fv) that belong to the foreground. The methoduses the fact that it is more likely that two neighboring pixels both belongto the background or foreground. The version that one neighbouring pixelbelong to the background and one to the foreground is much less probable.The probability for two neighbouring pixels is defined:

p(v, v′) =

{a if xv = xv′

b if xv 6= xv′where a > b. (4.1)

The goal of the best labeling is to maximize the product∏

v∈V p(xv | fv)∏

vv′∈E p(v, v′).A negative logarithm is taken for computing. The image energy, which iscomputed as:

F (x | f) =∑v∈V

g(xv | fv) +∑

vv′∈E

g(v, v′) (4.2)

is minimized. g(xv | fv) is the negative logarithm of p(xv | fv) and g(v, v′) isthe negative logarithm of p(v, v′). Figure 4.1 shows schema of minimisationof the image energy using the graph cut.

Max-flow algorithm is used for the effective minimisation. For more de-tails refer to [2].

4.2 Graph Cut segmentation

The model of body parts is not learned from labeled training examples likein Section 3.2, because algorithm must work for a human body with variousproportions and various color of clothing. The background color may vary,too.

One way to solve this problem is to learn the color model from a supposedhuman body location. A human body is detected by the fast human detectorusing HoG descriptors and the detection window is selected by the averagingof all detections, that is described in Section 2.3.8. Figure 4.2 shows thedetected human body in the image.

Page 43: Pictorial Structural Models for Human Detection in Videos

4.2. GRAPH CUT SEGMENTATION 35

Figure 4.1: Schema of the minimisation of image energy using the GraphCut. Figure is taken from [22].

Figure 4.2: Human body detection. Blue rectangle is selected detection win-dow.

Forearms may go outside the detection window. Therefore the detectionwindow is expanded by the multiple of its height and width by constant

Page 44: Pictorial Structural Models for Human Detection in Videos

36 CHAPTER 4. UNSUPERVISED LEARNING OF APPEARANCE

greater than 1. In our implementation we use 1.2. This value is establishedfrom experimental results. Figure 4.3 shows the original and resized detectionwindow.

Figure 4.3: Human body detection. Blue is the original detection. Red is theresized detection window. Forearms are in the detection windownow.

The probabilities that a pixel belongs to foreground and background arecomputed after the human body is detected in the image. A normalised colorhistogram is used for an estimate of the probabilities that a pixel with RGBcolor belongs to the foreground and the background. There are two computedcolor histograms, one for the foreground and one for the background. TheRGB color was discretized into 8 × 8 × 8 bins for computing of the colorhistograms.

The image is divided into three areas. The color histogram of the fore-ground is learned from the first area, the color histogram of the foregroundis learned from the second area and third area is not used to learn any colorhistogram. We suppose that the foreground pixels are inside the detectionwindow. However, the background pixels may be inside the detection win-dow, too. Therefore we learn the color histogram of foreground from theinside of the detected area, but not from the inside area near the border ofthe detection window. We suppose that the background pixels with similarcolor values like background pixels inside of the detection window are near

Page 45: Pictorial Structural Models for Human Detection in Videos

4.2. GRAPH CUT SEGMENTATION 37

the border of detection window. Therefore the color histogram of the back-ground is computed from pixels near the border of detection window. Figure4.4 shows areas assigned to learn color histogram.

Figure 4.4: Areas for computing color histogram of foreground and back-ground. Red pixels are for learning of foreground color histogram.Green pixels are for learning of background color histogram. Bluepixels are not used to learn any color histogram. Black rectangleis the detection window.

It will be better to divede the detection window into more sophisticatedareas for the more precise segmentation. For examle take the wider area foran up part of a body (where arms are located) and the narrower area for adown part of a body, when the color histogram is learned.

Figure 4.5 shows the initialization probabilities pixels belong to the fore-ground and background in a detection window.

The Graph Cut segmentation can start after learning the color histograms.Each pixel outside of the detection window is assigned to the background. Itis computed that probabilities for each inside pixel belong to the foregroundand background from learned color histogram. Figure 4.6 shows an exampleof the segmentation.

Page 46: Pictorial Structural Models for Human Detection in Videos

38 CHAPTER 4. UNSUPERVISED LEARNING OF APPEARANCE

Figure 4.5: Initialization of the Graph Cut algorithm. Left is the originaldetection window, middle are the probabilities of pixels belongingto the foreground, right is probabilities of pixels belonging to thebackground . Black is 0, white si maximal value.

4.3 Learning of a color model of a human

body

New color histograms of the foreground and background pixels are computedfrom the Graph Cut segmentation’s result. The Graph Cut refines the ini-tial rectangular segmentation. The quadratic logistic regression classifier islearned from the color histograms. There is only one classifier for all the partsagainst the color based segmentation for the pictorial structures matching(chapter 3), where each part has its own classifier. Figure 4.7 shows prob-abilities pixels belong to the foreground in the detection window computedby the quadratic logistic regression classifier.

Next step to the unsupervised learning of the appearance of human bodyparts is searching of a configuration of human body in the detection window.There is used the method based on the pictorial structures matching verysimilar to the one in chapter 3 with little differences:

• Only deformable human body model is learned by labeling of trainingexamples.

• Color model is learned from the Graph Cut segmentation of an actualimage. It is the same for all parts of human body. It means that eachbody part does not have own color model, but there is only one.

Page 47: Pictorial Structural Models for Human Detection in Videos

4.3. LEARNING OF A COLOR MODEL OF A HUMAN BODY 39

Figure 4.6: Example of good segmentation using the Graph Cut. Left is orig-inal image. Right is the detection window with the red contourof the foreground pixels.

• The value of multiple scale is estimated from the size of a detecttionwindow. It means that only one multiple scales value can be used.But in practice we use 3 closest values of multiple scales, because theestimation of high of human body does not be to precise.

• There is not used maximize of the MAP estimation as pictorial struc-tures matching), but sampling from the posterior probability is used.

Sampling from the posterior probability is used because with maximizeof the MAP estimatation it could happen that the best match is missed.Sometimes the probability distribution, which is computed with the matchingalgorithm, has more than one peak. This can be caused by more reasons. Forexample self-occlusion of a human body, the algorithm can be confused witha noise or the Graph Cut segmentation may not be precise, because someparts of the human body can be segmented like background pixels. Figure4.9 show wrong segmentation. Therefore the stochastic method is used and itis sampled from posterior probability according to the Monte Carlo Method.That means that more probable locations in the space are sampled more

Page 48: Pictorial Structural Models for Human Detection in Videos

40 CHAPTER 4. UNSUPERVISED LEARNING OF APPEARANCE

Figure 4.7: Probabilities of pixel belonging to the foreground. Left is theoriginal detection window, right is the probability of each pixelbelong to the foreground.

frequently. Some locations have minimal chance that they will be sampled.For more details of sampling refer to [14]. Figure 4.8 shows some selectedsamples from posterior probability.

One best sample is not selected opposed to [14]. For each pixel of theimage is computed a frequency of the occurrence from the samples (of eachpart of the human body). Result of this operation is a probability map (onefor each part of human body), where each pixel matchs the frequency of theoccurrence of the part. The frequency map is normalised to the interval from0 to 1. Figure 4.10 shows the frequency map for the torso and head. Forother parts, the frequency map is similar.

Last step of the learning of a human body appearance is to learn the colormodels of each part. A color histogram of the foreground and backgroundis computed for each part of a human body. The counts of color values areweighted by the frequency of occurrence map of each part, when foregroundhistograms are computed. The inverted frequency occurrence map is used forcomputing of histograms of the background. The quadratic logistic regressionclassifiers are computed from the foreground and background histograms foreach body part.

Page 49: Pictorial Structural Models for Human Detection in Videos

4.3. LEARNING OF A COLOR MODEL OF A HUMAN BODY 41

Figure 4.8: Examples of sampling from posterior probability. The torso issampled same in this case often. The position of the head isdifferent in each sample.

Page 50: Pictorial Structural Models for Human Detection in Videos

42 CHAPTER 4. UNSUPERVISED LEARNING OF APPEARANCE

Figure 4.9: Example of the wrong segmentation. Forearms belong to back-ground.

Figure 4.10: Frequency of occurrence of the parts normalised to interval from0 to 1. Left is original image, middle is frequency of occurrenceof the torso and right is frequency of occurrence of head. Thetorso is sampled in similary position in all the samples.

Page 51: Pictorial Structural Models for Human Detection in Videos

Chapter 5

Experiments

In this chapter we show the results of the fast human detector, the pictorialstructures matching and the unsupervised learning of appearance of humanparts. We talk first about the fast human detector using HoG descriptor,because it is the inital step for other methods. Then we depict results of thepictorial structures matching and the unsupervised learning of appearanceof human body parts.

5.1 Fast human detector

Fast human detector using cascade HoG is described in Chapter 2. Thealgorithm is learned by using 2418 positive examples and 88914 negativeexamples. All the training examples are normalised to size 64 × 128 pixels.The final cascade has 50 levels and about 1000 weak classifiers. The trainingprocess took about ten days.

The algorithm is tested on several datasets. Height of a human body in atesting image is variable. Advantage of the detector based on HoG descriptoris its robustness against the human height changes. There are human heightsfrom 100 to 560 pixels in the testing data. Several sizes of the detectionwindow are used in the testing process. The blocks of the weak classifiers ofcascade levels are resized in the same ratio as the detection window againstthe training detection window with fixed size 64 × 128 pixels. The HoGfeatures are computed in the same way in the resized detection window as inthe window with a fixed size.

We compute an average rejection rate for each level of the cascade. Figure5.1 shows the rejection rate as a cumulative sum over the cascade levels. Thefirst five levels in the cascade reject about 90 % of the detection windows.The average number of the blocks to be evaluated for one detection window

43

Page 52: Pictorial Structural Models for Human Detection in Videos

44 CHAPTER 5. EXPERIMENTS

is about 20. Table 5.1 shows the computing time of the detection of a humanbody of one testing image depended on dense scan. The time was measuredon PC 2.33 GHz CPU and 976 MB of RAM.

Figure 5.1: Rejection rate as cumulative sum over cascade levels.

Windows per image 780 1098 2954 4206 6650 11476Required time [ms] 30 45 100 140 210 360

Table 5.1: Time required to evaluate a 240 × 320 image. The time is theaverage value of 100 runs.

Testing data from the INRIA [5] data set are very challenging. The peopleare usually standing, but appear in any orientation and against a wide varietyof background images including crowds. Many are bystanders taken from theimage backgrounds, so there is not particular bias on their pose. Figure 5.2shows examples of good detection by the fast human detector on the INRIAtesting data.

The algorithm has problems with objects like a front elevation. A frontelevation has various shapes that can be similar to the human body. Figure5.4 shows detection in a street scene.

We test the learned cascade on less challenging data, too. We use owndataset, which is used to testing of the pictorial structures matching andunsupervised learning of appearance of the body part color model. It containsmore than 300 images of one person with a various size. The pictures are

Page 53: Pictorial Structural Models for Human Detection in Videos

5.1. FAST HUMAN DETECTOR 45

Figure 5.2: Examples of good detections on INRIA data set.

only front or back views with relatively limited range of pose. The humanbody is well separate from the background. Our detector give essentiallyperfect results on this data set. Examples of the segmentation are shown onfigure 2.13 in Chapter 2.

Figure 2.12 in Chapter 2 shows Detection Error Tradeoff (DET) curves for

Page 54: Pictorial Structural Models for Human Detection in Videos

46 CHAPTER 5. EXPERIMENTS

Figure 5.3: Examples detection in a street scene. A front elevation is de-tected like a person too. Human is detected correctly on firsttwo images. The human body miss classify on the 3rd image.

Figure 5.4: Examples of wrong detections.

our implementation of the fast human detector using cascade of HoG. DETcurve gives a summary of our implementation. Measured curve is similar tothe DET curve presented in [26] and in [5].

Page 55: Pictorial Structural Models for Human Detection in Videos

5.2. PICTORIAL STRUCTURES 47

5.2 Pictorial structures

Chapter 3 describes the method for human detection based on the pictorialstructures matching. We extend the implementation [14]. We tested thisalgorithm on our own dataset containing more then 300 examples. Therewas only one person on the testing images, because our implementation doesnot support a multiple presence of persons in images. The dataset containsmultiple scales of human height from 180 to 560 pixels.

We speed up this implementation by using vertical symmetry of the hu-man body and clothing. Some part of the human body like arms or calfs arein a pair. They have same shape and usually human clothing is symmetric,too. It is the reason why it is not needed to compute the appearance proba-bility for both these parts. We compute probabilities for 6 body parts (torso,head, arm, forearm, calf, thigh) instead of 10 parts (torso, head, left/rightarm, left/right forearm,left/right calf, left/right thigh) like in [14].

Second improvement, which speeds up the original implementation, is theuse of the fast human detector. The detection of a rectangular area in animage, where people are situated, make a smaller searched space. Outputfrom the human detector is a rectangular detection. There is only one positivedetection window selected by using simple averaging of all positive detections,that is described in Section 2.3.8.

We extend the original algorithm by using the multiple scales of the hu-man person’s height. We use 4 scale values from 0.4 to 1, when scale value1 corresponds to the human body height of size 560 pixels. To the fore-shortening s we add one value higher then 1, which can complete missingdiscrete values in continuous changing of the human body height. Using ofthe multiple scales slows down the computing time, because the appearanceprobabilities and the MAP estimate are computed for all 4 scales.

The color model is learned for each from 4 multiple scale values, becausewhen the distance of a human body from camera change, the color of thehuman body parts change, too. The deformable human body model for eachscale value is learned from all the training examples, which are resized to thecorresponding scale value.

The model of human body is learned from 40 training examples, whichwere hand labeled. About 10 examples contain one of 4 multiple scale values.The learned human body model is shown on Figure 3.6 in Chapter 3.

The human body in the image is detected first. The results of the humandetector are shown on Figure 2.13 in Chapter 2. An output from the fasthuman detector is correct for all the testing images in our dataset. Thebest configuration of the human body is selected by the maximalize MAPestimate. But it could happen that the best match is missed. Another way,

Page 56: Pictorial Structural Models for Human Detection in Videos

48 CHAPTER 5. EXPERIMENTS

how to select the best configuration, is application of sampling from theposterior probability which is used in Chapter 4. Figure 5.5 shows examplesof the good results.

Figure 5.5: Examples of the good detection by pictorial structures match-ing. Blue rectangle is searched area selected by the fast humandetector.

Figure 5.6 shows examples of a bad detection. In some cases, there isonly little difference compared the real configuration. Sampling from theposterior probabilities and selecting of the best sample can help, but we donot implement a method for selecting the best sample. Other inaccuracy is aselection of wrong multiple scale value by selecting of the maximal posterioryprobability. An unusual position can forbid finding of the good result, too.For example, when a person is in a knee-bend position or back to the camera,algorithm can find a wrong human body position. Unusual positions are notincluded in the training set examples.

We inspected the results manually, because we do not have any machinevaluation tool available. The pictorial structure matching was badly localizedin about 10% of the training data with a human in the usual position. 60%of the results was absolutely correct and other 30% was correct with a smalldeviation. The pictorial structures matching algorithm is very slow comparedto the fast human detector. The computing time of one image take from 0.5to 5 minutes depending on the size of a detection window.

Page 57: Pictorial Structural Models for Human Detection in Videos

5.2. PICTORIAL STRUCTURES 49

Figure 5.6: Examples of the bad detection by pictorial structures matching.

Page 58: Pictorial Structural Models for Human Detection in Videos

50 CHAPTER 5. EXPERIMENTS

5.3 Unsupervised learning of appearance

In this section we explain results of the unsupervised learning of appearanceof the human body parts. We show outputs from the Graph Cut algorithm,which is the input of this detector. We test implementation on the samedataset as the matching of pictorial structures (dataset is described in Section5.2).

In addition we test the algorithm on more challenging data ”Stickman”presented in [7]. Progressive Search Space Reduction for Human Pose Es-timation [7] is not officially published yet, we recieved it within the frameof the informal cooperation. The objective of paper[7] is the estimate 2Dhuman pose as a spatial configuration of body parts in TV and movie videoshots.

5.3.1 Our dataset

We test the algorithm on same dataset as the pictorial structure matching. 40training examples are hand labeled and the deformable human body modelis learned for each scale values. We use 7 values of multiple scales comparedto the pictorial structures matching, where 4 values is used.

The human body in the image is detected first in the same way as byusing of the pictorial structures matching. The results of the human bodydetector are shown on figure 2.13 in Chapter 2.

The Graph Cut segmentation on this data set provide mostly good results.The background has completely different color characteristics compared tothe foreground. Figure 5.7 shows an example of the Graph Cut segmentationon our dataset, figure shows some results of the Graph Cut segmentation onthe INRIA dataset.

The color models of the body parts are learned from the Graph Cut seg-mentation and sampling of posteriory probability of the pictorial structuresmatching. We run the pictorial structures algorithm with the learned ap-pearance model maximize the posteriory probability. The results are similarto an appearance model, when the labeled data is used to learning. Thealgorithm does not find good position of human body in some cases. Forexample the detected human body is upside down, leaf parts of the humanbody do not correspond to the real position in the image or the human bodyscale is wrongly estimated. Figure 5.9 shows results of our detector.

We inspected the results manually. Our approach wrongly localized about20% of the training data with a human in the usual position. About 55% ofresults was correct with small deviation and other 25% absolutely correct.

Page 59: Pictorial Structural Models for Human Detection in Videos

5.3. UNSUPERVISED LEARNING OF APPEARANCE 51

Figure 5.7: Examples of the Graph Cut segmentation on our data set. First5 is good segmentation. In the 6th image is segmented part ofthe background pixels like the foreground.

Figure 5.8: Examples of the Graph Cut segmentation on the INRIA data set.The result are not mostly good.

5.3.2 Stickman data set

The data comes from the TV show Buffy the Vampire Slayer. The data isprepared for [7]. We used the annotations to detect the rectangular areawith the human body. Coordinates of human body parts are in annotationsdata[7]. We do not use implementation of the fast human detector, becausemajority of the images contain only up half part of the human body and welearned our implementation only on the full body. The rest of human bodiesis not in an image or is draped with some other object in the scene. Data setcontains frames from 4 episodes. Each episode contains around 100 images.Human body deformable model was learned from hand labeling of 20 images.The model consists of 6 body parts (torso, head, left/right arm and left/rightforearm).

We do not run the whole algorithm on this dataset. Only sampling fromthe posterior probability of the pictorial structures matching by using learned

Page 60: Pictorial Structural Models for Human Detection in Videos

52 CHAPTER 5. EXPERIMENTS

color model of whole body is applied. Human bodies are hard to separatefrom the background. That is the reason why results of the Graph Cutsegmentation are often incorrect. Data is not segmented well and color modelof human body are incorrect, too. Figure 5.10 shows examples of results ofthe Stickman dataset.

We inspected the results manually. The detection was not successful onthis dataset. In about 60% of examples was the detection absolutely wrongand 20% was correct with a small deviation. Other 20% of examples wasdetected corectly. There are persons with variable body part proportions inthe testing images. But our detection framework uses only one proportionfor all the detected persons. It make the results worse.

The Graph Cut segmentation is not used in [7], but they use more so-phisticated method - Grabcut[19]. They use spatio-temporal inference, whichimproves the results. The do not use color based segmentation for body partsdetector. They presented better detection results than we have in our work.Their method correctly estimates 56% body parts in the testing images.

Page 61: Pictorial Structural Models for Human Detection in Videos

5.3. UNSUPERVISED LEARNING OF APPEARANCE 53

Figure 5.9: Results of the unsupervised learning of appearance of the humanbody parts. The color model of the body parts is learned byour method. Final result is selected by the MAP estimate withmaximal probability.

Page 62: Pictorial Structural Models for Human Detection in Videos

54 CHAPTER 5. EXPERIMENTS

Figure 5.10: Some examples of the human part detector applicate on theStickman data set. Mostly of the detection is incorrect. Theimages shows manually selected samples.

Page 63: Pictorial Structural Models for Human Detection in Videos

Chapter 6

Conclusions

In this chapter we summarize the results of this paper. The goal of thisthesis is the implementation and description of the methods for the humandetection in images.

We implement the fast human detector using HoG descriptors [26]. Thedetection is nearly real-time. This is achieved by integrating the concept ofthe cascade of rejectors.

We test the detection system on real images containing human persons invarious poses and with various cluttered backgrounds. The detector providesimilar results as [5].

We extend the implementation of the pictorial structures matching de-scribed in [14]. We make possible the use of multiple scales of a person’sheight. The integration with the fast human detector make the searchedstate space smaller and speed up the computing time of the algorithm. Wetest this method on our own dataset containing one person in different poses.The different poses means we could work with various heights of a person.

The results are good when the person is up front in an image. Thedetector often miss, when the person is in an unusual position. The methodis limited because the body part appearance had to be learned first.

We propose the method which learned the appearance of human bodyparts unsupervised. The method integrate the fast human detector, theGraph Cut segmentation and the pictorial structures matching. Color prop-erties of human body parts prove to be useful, when the video sequence istracked or for identification of persons in images.

We test this method on our own dataset and on more challenging data[7] with various persons, colors of clothing and cluttered backgrounds.

55

Page 64: Pictorial Structural Models for Human Detection in Videos

56 CHAPTER 6. CONCLUSIONS

Page 65: Pictorial Structural Models for Human Detection in Videos

Chapter 7

Appendixes

7.1 Implementation

We implemented described detectors in MATLAB 7.5. The source codes areplatform independent, we tested them under Linux and Windows XP.

The source code of the fast human detector is implemented as the mexfunction, which is supported by MATLAB. It make possible fast evaluationof testing images.

We used external sources for some tasks. Used software and dataset arelisted below.

• Pictorial structure matching [14].

• SVM light [12], available on http://svmlight.joachims.org/.

• Statistical Pattern Recognition Toolbox for Matlab [24] available onhttp://cmp.felk.cvut.cz/cmp/software/stprtool/.

• Graph Cut segmentation [2], [13] and [1] available on http://www.cs.

cornell.edu/~rdz/graphcuts.html.

• GraphCut MATLAB wrapper available on http://www.wisdom.weizmann.

ac.il/~bagon/matlab.html.

• Image Processing, Analysis, and Machine Vision: A MATLAB Com-panion [23] available on http://cmp.felk.cvut.cz/.

• INRIA dataset [5] available on http://pascal.inrialpes.fr/data/

human/.

57

Page 66: Pictorial Structural Models for Human Detection in Videos

58 CHAPTER 7. APPENDIXES

7.1.1 Fast human detector

Learning of cascade

go compile hog.m - compilation of the mex files.configdata.m - settings needed for the learning process.go prep neg.m - preparing of a negative example set from INRIA dataset.go prep pos.m - preparing of a positive example set from INRIA dataset.go train.m - training of the cascade.

Run detector

go compile hog.m - compilation of the mex files.configdata.m - settings needed for the classification.go - run the fast human detector.

7.1.2 Pictorial structures matching

go compile hog.m - compilation of the mex files.configdata.m - settings data needed for the learning and matching.go labeling training data.m - labeling of a training data.go count histograms.m - computing of color histograms from a labeledtraining data.go learn parameters.m - learning of parameters from a training data (acolor appearance of human body parts and a deformable configuration of ahuman body).go ps.m - run the pictorial structures matching.

7.1.3 Unsupervised learning of appearance

go compile hog.m - compilation of the mex files.configdata.m - settings data needed for the unsupervised learning of ap-pearance.go labeling training data.m - labeling of a training data.go learn parameters.m - learning of parameters from a training data (de-formable configuration of a human body).go gc.m - run the unsupervised learning of appearance.

Page 67: Pictorial Structural Models for Human Detection in Videos

7.2. ENCLOSED CD 59

7.2 Enclosed CD

The enclosed CD is divided into five directories.

• codes: This directory contains source codes of this diploma thesis.

• data: This directory contains our datasets.

• results: This directory contains output images from implementeddetectors.

• text: This directory contains the text of this diploma thesis (includingsource codes in LATEX), the scan assignment of this diploma thesis andthe scale statement with the signature.

• wwwdemo. This directory contains the www demo of the diploma thesis.

Page 68: Pictorial Structural Models for Human Detection in Videos

60 CHAPTER 7. APPENDIXES

Page 69: Pictorial Structural Models for Human Detection in Videos

Bibliography

[1] An experimental comparison of min-cut/max-flow algorithms for en-ergy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell.,26(9):1124–1137, 2004. Member-Yuri Boykov and Member-VladimirKolmogorov.

[2] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energyminimization via graph cuts. In ICCV, pages 377–384, 1999.

[3] D.A. Forsyth D. Ramanan and A. Zisserman. Strike a pose: Trackingpeople by finding stylized poses. In CVPR ’05: Proceedings of the 2005IEEE Computer Society Conference on Computer Vision and PatternRecognition, pages 271–278, 2005.

[4] Navneet Dalal. Finding people in images and videos. PhD thesis, InstitutNational Polytechnique de Grenoble, 2006.

[5] Navneet Dalal and Bill Triggs. Histograms of oriented gradients forhuman detection. In International Conference on Computer Vision &Pattern Recognition, pages 886–893, 2005.

[6] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Pictorial structuresfor object recognition. International Journal of Computer Vision, pages55–79, 2005.

[7] Marin-Jimenez M. Ferrari V. and Zisserman A. Progressive search spacereduction for human pose estimatio. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, 2008.

[8] D. Gavrila and V. Philomin. Real-time object detection for smart vehi-cles. In Conference on Computer Vision and Pattern Recognition 1999,pages 87–93, 1999.

61

Page 70: Pictorial Structural Models for Human Detection in Videos

62 BIBLIOGRAPHY

[9] Michael H.Kutner. Applied linear statistical models. In CVPR ’05:Proceedings of the 2005 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition. The McGraw-Hill Companies,Inc., New York, NY, 2005.

[10] S. Ioffe and D. A. Forsyth. Probabilistic methods for finding people. InInternational Journal of Computer Vision, 2001.

[11] Sochman J. and Matas J. Adaboost. Unpublished lecture.

[12] T. Joachims. Imaging large-scale SVM learning practical. Advances inkernel methods - support vector learning. MIT Press, 1999.

[13] Vladimir Kolmogorov and Ramin Zabih. What energy functions can beminimized via graph cuts? In ECCV ’02: Proceedings of the 7th Eu-ropean Conference on Computer Vision-Part III, pages 65–81, London,UK, 2002. Springer-Verlag.

[14] Fajt Lukas. Pictorial structural models, learning and recognition inimage sequences. Master’s thesis, CVUT, FEL, 2007.

[15] Schmid-C. Mikolajczyk, K. and A. Zisserman. Human detection basedon a probabilistic assembly of robust part detectors. In European Con-ference on Computer Vision, 2004.

[16] Michael J. Jones P. Viola and Daniel Snow. Detecting pedestrians usingpatterns of motion and appearance. In ICCV ’03: Proceedings of theNinth IEEE International Conference on Computer Vision, page 734,Washington, DC, USA, 2003. IEEE Computer Society.

[17] Constantine Papageorgiou and Tomaso Poggio. A trainable system forobject detection. International Journal of Computter Vision, pages 15–33, 2000.

[18] F. Porikli. Integral histogram: A fast way to extract histograms incartesian spaces. In CVPR ’05: Proceedings of the 2005 IEEE Com-puter Society Conference on Computer Vision and Pattern Recognition(CVPR’05) - Volume 1, pages 829–836. IEEE Computer Society, 2005.

[19] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. ”GrabCut”:interactive foreground extraction using iterated graph cuts. ACM Trans.Graph., 23(3):309–314, August 2004.

[20] Robert E. Schapire and Yoram Singer. Improved boosting usingconfidence-rated predictions. pages 297–336, 1999.

Page 71: Pictorial Structural Models for Human Detection in Videos

BIBLIOGRAPHY 63

[21] Bernhard Scholkopf and Alexander J. Smola. Learning with kernels:Support vector machines, regularization, optimization, and beyond(adaptive computation and machine learning). The MIT Press, 2001.

[22] Werner T. Image segmentation using minimum st cut. Unpublishedlecture.

[23] Jan Kybic Tomas Svoboda and Vaclav Hlavac. Image Processing, Anal-ysis, and Machine Vision: A MATLAB Companion. Cengage Learning,2007.

[24] V. Hlavac V. Franc. Statistical pattern recognition toolbox for matlab,2004.

[25] Paul A. Viola and Michael J. Jones. Rapid object detection using aboosted cascade of simple features. In CVPR, pages 511–518, 2001.

[26] Qiang Zhu, Mei-Chen Yeh, Kwang-Ting Cheng, and Shai Avidan. Fasthuman detection using a cascade of histograms of oriented gradients.In CVPR ’06: Proceedings of the 2006 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition, pages 1491–1498,Washington, DC, USA, 2006. IEEE Computer Society.