Towards a Video Annotation System using Face Recognition · The nal goal would be to try to combine face detection and recognition with object tracking forwards and backwards in time

Towards a Video AnnotationSystem using Face Recognition

Lucas Lindstrom

January 18, 2014Master’s Thesis in Computing Science, 30 credits

Supervisor at CS-UmU: Petter EricsonExaminer: Fredrik Georgsson

Umea UniversityDepartment of Computing Science

SE-901 87 UMEASWEDEN

Abstract

A face recognition software framework was developed to lay the foundation for a futurevideo annotation system. The framework provides a unified and extensible interface to mul-tiple existing implementations of face detection and recognition algorithms from OpenCVand Wawo SDK. The framework supports face detection with cascade classification usingHaar-like features, and face recognition with Eigenfaces, Fisherfaces, local binary patternhistograms, the Wawo algorithm and an ensemble method combining the output of thefour algorithms. An extension to the cascade face detector was developed that covers yawrotations. CAMSHIFT object tracking was combined with an arbitrary face recognitionalgorithm to enhance face recognition in video. The algorithms in the framework and theextensions were evaluated on several different test databases with different properties interms of illumination, pose, obstacles, background clutter and imaging conditions. The re-sults of the evaluation show that the algorithmic extensions provide improved performanceover the basic algorithms under certain conditions.

ii

Contents

1 Introduction 1

1.1 Report layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Introduction to face recognition and object tracking 5

2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Categories of techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.2 Cascade classification with Haar-like features . . . . . . . . . . . . . . 7

2.3 Face identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Categories of approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.3 Studied techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.4 Other techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Face recognition in video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Multiple observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Temporal continuity/Dynamics . . . . . . . . . . . . . . . . . . . . . . 20

2.4.3 3D model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Object representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.2 Image features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5.3 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5.4 Trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Face recognition systems and libraries 29

3.1 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Installation and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Wawo SDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

iii

iv CONTENTS


3.3 OpenBR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


4 System description of standalone framework 31

4.1 Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 CascadeDetector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.2 RotatingCascadeDetector . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Recognizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.1 EigenFaceRecognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.2 FisherFaceRecognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.3 LBPHRecognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.4 WawoRecognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.5 EnsembleRecognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Normalizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4.1 SimpleTechnique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4.2 TrackingTechnique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5 Other modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5.1 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5.2 Gallery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5.3 Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.6 Command-line interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.6.1 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Algorithm extensions 39

5.1 Face recognition/object tracking integration . . . . . . . . . . . . . . . . . . . 39

5.1.1 Backwards tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Rotating cascade detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Performance evaluation 43

6.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Testing datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2.1 NRC-IIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2.2 News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2.3 NR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.3.1 Regular versus tracking recognizers . . . . . . . . . . . . . . . . . . . . 46

6.3.2 Regular detector versus rotating detector . . . . . . . . . . . . . . . . 46

6.3.3 Algorithm accuracy in cases of multiple variable conditions . . . . . . 46

6.4 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4.1 Comparison of algorithm accuracy and speed over gallery size . . . . . 47

CONTENTS v

6.4.2 Regular detector versus rotating detector . . . . . . . . . . . . . . . . 48

6.4.3 Evaluation of algorithm accuracy in cases of multiple variable conditions 50

7 Conclusion 53

7.1 Limitations of the evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8 Acknowledgements 55

References 57

vi CONTENTS

List of Figures

2.1 Example features relative to the detection window. . . . . . . . . . . . . . . . 8

2.2 Eigenfaces, i. e., visualizations of single eigenvectors. . . . . . . . . . . . . . . 12

2.3 The first four Fisherfaces from a set of 100 classes. . . . . . . . . . . . . . . . 14

2.4 Binary label sampling points at three different radiuses. . . . . . . . . . . . . 15

2.5 A given sampling point is labeled 1 if its intensity value exceeds that of the

central pixel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 A number of object shape representations. . . . . . . . . . . . . . . . . . . . . 23

2.7 CAMSHIFT in action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Conceptual view of a typical application. . . . . . . . . . . . . . . . . . . . . . 31

4.2 IDetector interface UML diagram. . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 IRecognizer interface UML diagram. . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 INormalizer interface UML diagram. . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 ITechnique interface UML diagram. . . . . . . . . . . . . . . . . . . . . . . . . 35

4.6 SimpleTechnique class UML diagram. . . . . . . . . . . . . . . . . . . . . . . 36

4.7 TrackingTechnique class UML diagram. . . . . . . . . . . . . . . . . . . . . . 36

4.8 Gallery class UML diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.9 Renderer class UML diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 Example illustrating the face recognition/tracking integration. . . . . . . . . . 40

5.2 Illustrated example of the rotating cascade detector in action. . . . . . . . . . 42

6.1 The performance of each algorithm as measured by subset accuracy, . . . . . . 47

6.2 The real time factor of each algorithm as the gallery size increases. . . . . . . . 48

6.3 The performance, as measured by subset accuracy, . . . . . . . . . . . . . . . . 49

6.4 The real time factor of each algorithm as the gallery size increases. . . . . . . . 49

vii

viii LIST OF FIGURES

List of Tables

2.1 Combining rules for assembly-type techniques. . . . . . . . . . . . . . . . . . 19

6.1 NRC-IIT test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 News test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.3 NR test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

ix

x LIST OF TABLES

Chapter 1

Introduction

Face recognition is finding more and more applications in modern society as time and tech-nology progresses. Traditionally, the main application area has been biometrics for securityand law enforcement purposes, similar to fingerprints. Lately, it has also been used for crimeprevention by identifying suspects in live video feeds[4][41]. With the rise of the world wideweb and the Web 2.0 in particular, an application area that is more relevant to the generalpublic has emerged; namely, the automatic annotation of metadata for images and video.By automatically analyzing and attaching metadata to image and video files, end users aregiven the power to search and sort among them more intelligently and efficiently.

Codemill AB is an Umea-based software consultant, employing around 25 employeeswith an annual turnover of 12.9 million SEK as of 2011. They developed a face recognitionplugin for the media asset management platform of a client, Vidispine AB, which theyretained ownership of. Now they want to extract the face recognition functionality into aseparate product for sophisticated, automated annotation and searching of video content.The end goal would be to create a product that combines face and voice recognition forthe identification of individuals present in a video clip, speech recognition for automaticsubtitling and object recognition for the detection and identification of significant signs,e. g. letters or company logos. A product like this could have broad application areas,with everything from automatically annotating recordings of internal company meetings foreasy cataloguing to annotating videos uploaded to the web to increase the power of searchengines. A first step towards that goal is the extraction of the existing face recognitionfunctionality of the Vidispine platform into a standalone application that would serve as thebasis for the continued development of the future product.

The focus of this thesis lies foremost in the extraction of the Vidispine face recognitionmodule into a standalone software package. A secondary goal was to attempt to improve theaccuracy and/or performance of the existing face recognition system using existing softwarelibraries. In particular, possibilities for utilizing object tracking and profile face recognitionwere to be explored.

1.1 Report layout

Chapter one gives an introduction to the background of the project, the purpose and thegoals. The specific problem that is being addressed is described and an overview of themethods employed is given. A short summary of related work that has been investigatedover the course of the project is also presented.

1

2 Chapter 1. Introduction

Chapter two provides a quick introduction to the theory behind face recognition systems.The general problems of face detection and face recognition are described, as well as briefdescriptions of the most common approaches to solving them. This chapter also includes abrief introduction to object tracking.

Chapter three lists and describes the most common existing face recognition libraries andsystems. Special emphasis is given to OpenCV and Wawo, which are the libraries evaluatedin this report.

Chapter four gives a detailed system description of the face recognition system developedin the course of this project. In particular, the modular nature of the system is described,as well as how it can be extended with additional algorithms and techniques in the future.

Chapter five describes an original integration of face recognition algorithms and objecttracking, and discusses its merits and flaws. This chapter also describes an extension tobasic face detection techniques by rotating the input images prior to detection.

Chapter six describes the methods, metrics and test data used in the evaluation of thedifferent algorithms in the system implementation. The results are presented and discussed.

Chapter seven summarizes the conclusions drawn from the results of the evaluation,discusses problems encountered over the course of the project and suggestions for futurework.

1.2 Problem statement

The primary task of this project was to extract the Vidispine face recognition plugin moduleinto a standalone application. Possibilities for improving the accuracy and performance ofthe system was to be investigated and different options systematically evaluated. In practice,this would mainly consist of finding existing face recognition libraries and evaluating theirrelative accuracy and performance.

The research questions intended to be addressed in this report are:

1. Using currently available face recognition libraries, using standard test databases andoriginal test databases suitable for the intended application, what is the optimal trade-off between accuracy and performance of the task of face recognition?

2. Can frontal face recognition and profile face recognition be combined to improve thetotal accuracy, and at what performance cost?

3. Can face detection and recognition be combined with object tracking forwards andbackwards in time to improve accuracy, and at what performance cost?

1.3 Goals

The first goal of this project is to extract the Vidispine face detection and recognitionplugin into a standalone application. The system design of the application should be highlymodular, to allow for low-cost replacement of the underlying libraries. The applicationshould accept a gallery of face images or videos for a set of subjects, as well as a probevideo. The output will be an annotation of the probe video, describing at different pointsin time which subjects are present.

The second goal is to conduct an evaluation of the tradeoff between performance andaccuracy of a number of common libraries and algorithms, for different parameter configu-rations and under different scene and imaging conditions. The third goal is to investigate

1.4. Methods 3

the possibility of combining frontal face recognition with profile recognition to improve thetotal recognition accuracy and what the relative performance of such a method would be.

The final goal would be to try to combine face detection and recognition with objecttracking forwards and backwards in time to improve accuracy and to possibly cover partsof the video during which the face is partially or completely occluded.

1.4 Methods

A study of the literature on face detection, recognition and tracking is performed to gainunderstanding of the inner workings of the libraries, what the challenges to successful facedetection and recognition are, the significance of the parameters for different algorithmsand how they can be used to improve the accuracy and performance of the system. Thestandalone application is written in C++, for multiple different reasons. To start with,the original Vidispine plugin was written in C++, and using the same language makes itpossible to reuse some code. In addition, C++ is widely considered to be a good choicefor performance-intensive applications while still giving the programmer the tools to createscalable, high-level designs. Finally, C++ being a massively popular programming language,the likelihood of finding compatible face detection, face recognition, object tracking andimage processing libraries is high.

Existing test databases and protocols are investigated in order to produce results thatcan be compared with existing literature. To the extent that it is possible, the evaluation isperformed with standard methods, but when necessary, original datasets that resemble theintended use cases are created and used. The optimal configuration of libraries, algorithmsand parameters are implemented as the default of the resulting system, for presentation andlive usage purposes.

1.5 Related work

In his master’s thesis, Cahit Gurel presented a face recognition system including subsystemsfor image normalization, face detection and face recognition using a feed-forward artificialneural network[27]. Similarly to the present work, Gurel aimed at creating a completeintegrated software system for face identification instead of simply presenting a single algo-rithm. Unlike the present work, however, Gurel’s system does not support different choicesof method for each step in the face identification pipeline. Hung-Son Le, in his Ph. D. thesis,presented a scalable, distributed face database and identification system[35]. The systemprovides the entire face identification pipeline and spreads the various phases, such as stor-age, user interaction, detection and recognition over different processes, allowing differentphysical servers to handle different tasks. This system only implements Le’s original algo-rithms, while the present work interfaces with different underlying libraries implementing avariety of existing algorithms. These can easily be combined in a multitude of configurationsaccording to the requirements of the intended application. Acosta et al[15] presented an in-tegrated face detection and recognition system customized for video indexing applications.The face detection is performed by an original algorithm based on segmenting the inputimage into color regions and using a number of constraints such as shape, size, overlappings,texture and landmark features to distinguish face from non-face. The recognition stageconsists of a modified Eigenfaces approach based on storing multiple views of each gallerysubject. Acosta’s design and choice of algorithms are tuned to the task of face recognitionin video, but again provides only a single alternative.

4 Chapter 1. Introduction

Chapter 2

Introduction to face recognitionand object tracking

Face recognition is a field that deals with the problem of identifying or verifying the identityof one or more persons in either a static image or a sequence of video frames by making acomparison with a database of facial images. Research has progressed to the point wherevarious real-world applications have been developed and are in active use in different set-tings. The typical use case has traditionally been biometrics in security systems, similar tofingerprint or iris analysis, but the technology has also been deployed for crime preventionmeasures with limited success[4][41]. Recently, face recognition has also been used for websearching in different contexts[15][36].

The complexity of the problem varies greatly depending on the conditions imposed bythe intended application. In the case of identity verification, the user can be assumed to becooperative and makes an identity claim. Thus, the incoming probe image only needs to becompared to a small subset of the database, as opposed to the case of recognition, wherethe probe will be compared to a potentially very large database. On the other hand, anauthentication system will need to operate in near real-time to be acceptable to users whilesome recognition systems could operate during much longer time frames.

In general, face recognition can be divided into three main steps, although depending onthe application, not all steps may be required:

1. Detection, the process of detecting a face in a potentially cluttered image.

2. Normalization, which involves transforming, filtering and converting the probe im-age into whatever format the face database is stored in.

3. Identification, the final step where the normalized probe is compared to the facedatabase.

2.1 Preliminaries

The following general notation is used in this thesis: x and y represent image coordinates, Ian intensity image of dimensions r× c and I(x, y) is the intensity at position (x, y). ~Γ is therc-dimensional vector acquired by concatenating the rows of an image. i,j and k representgeneric sequence indices and l,m,n sequence bounds. indP is the indicator function, whichequals 1 if proposition P is true, and otherwise equals 0.

5

6 Chapter 2. Introduction to face recognition and object tracking

2.2 Face detection

In order to perform face recognition, the face must first be located in the probe image.The field of face detection deals with this problem. The main task of face detection canbe defined as follows: given an input image, determine if a face is present and if so, itslocation and boundaries. Many of the factors that complicate this problem are the same asfor recognition:

– Pose: The orientation of the face relative to the camera may vary.

– Structural components: Hairstyle, facial hair, glasses or other accessories can varygreatly between individuals.

– Facial expression: A person can wear a multitude of facial expressions like smiling,frowning, screaming, etc.

– Occlusion: Other objects, including other faces, can partially occlude the face.

– Imaging conditions: Illumination and camera characteristics can vary between images.

The following sections describe the different categories of face detection techniques andthe technique primarily used in this project, cascade classification with Haar-like features.

2.2.1 Categories of techniques

Techniques that deal with detecting faces in single intensity or color images can be roughlyclassified into the following four categories[40]:

– Knowledge-based methods: These methods utilize human knowledge of what consti-tutes a face. Formal rules are defined based on human intuitions of facial propertieswhich are used to differentiate regions that contain faces from those that do not.

– Feature invariant approaches: Approaches of this type attempt to first extract facialfeatures that are invariant under differing conditions from an image and then infer thepresence of a face based on those.

– Template matching methods: Standard face pattern templates are manually con-structed and stored, either of the entire face or separate facial features. Correlationsbetween input images and the stored patterns are computed which detection is basedon.

– Appearance-based methods: These methods differ from template matching methodsin that instead of manually constructing templates, they are learned from a set oftraining images in order to capture facial variability.

It should be noted that not all techniques fall neatly into a single category, but rather,some clearly overlap two or more categories. However, these categories still provide a usefulconceptual structure for thinking about face detection methods.

2.2. Face detection 7

2.2.2 Cascade classification with Haar-like features

A very popular method of face detection, and object detection in general, is the cascadeclassifier with Haar-like features introduced by Viola and Jones in 2001[61]. The concept is tocharacterize a subwindow in an image with a sequence of simple classifiers, each consistingof one or more features, described below. Each level in the cascade is constructed byselecting the most distinguishing features out of all possible features using the AdaBoostalgorithm. Each individual classifier in the cascade performs relatively poorly, but in concertthe cascade achieves very good detection rates. Numerous features of this algorithm makeit very efficient, such as immediately discarding subwindows that are rejected by a classifierearly in the sequence, as well as computing the value of a simple classifier on a specializedimage representation in constant time.

Haar-like features

The features used by the method are illustrated in figure 2.1. They are called ”Haar-like”because they are reminiscent of Haar basis functions which have been used previously[14].The value of each feature is the sum of the pixel intensities in the dark rectangles subtractedfrom the sum of the intensities in the white rectangles. The features can be of any size withina detection window of a fixed dimension, but the original paper used 24x24 pixels. In thiscase, the total number of features are approximately 180,000. A classifier based on a singlefeature is defined as

hj(W ) = indpjfj(W ) < pjθj

where fj is the feature, W is the detection window, θj the threshold and pj the parityindicating the direction of the inequality sign. The false negative and false positive rateof the classifier can be modulated by varying the threshold, which will become importantlater.

Integral image

The features described above can be computed in constant time using a specialized imagerepresentation called an integral image. The value of the integral image at location x, y issimply the sum of the rectangle above and to the left of the location or

II(x, y) =∑

x′≤x,y′≤y

I(x′, y′)

Using this image representation, the sum of pixels in an arbitrary rectangle can becomputed in only four array references. Due to the fact that the rectangles in the featuresare adjacent, a feature with two rectangles can be computed in six array references, a featurewith three rectangles can be computed in eight references and a feature with four rectanglescan be computed in only nine references. The integral image can be computed in a singlepass using the recurrence relations

s(x, y) = s(x, y − 1) + I(x, y)

II(x, y) = II(x− 1, y) + s(x, y)

where s(x, y) is the cumulative column sum, s(x,−1) = 0 and II(−1, y) = 0.


Figure 2.1: Example features relative to the detection window.

AdaBoost learning

Given a set of positive and negative training samples of the same size as the detectionwindow, in this case 24x24 pixels, we want to select a subset of the total 180,000 featuresthat best distinguishes between them. We do this using the generic machine learning meta-algorithm, AdaBoost[19], which is also used in conjunction with many other algorithms toimprove their performance. The general idea of the algorithm is to build classifiers that aretweaked in favor of samples misclassified by previous classifiers. This is done by assigning aweight to each sample, which is initially equal for all samples, and for each round select thefeature that is able to minimize the sum weighted error of the predictions. The weights arethen adjusted so that the samples that were misclassified by the selected classifier receive agreater weight, and in subsequent rounds classifiers that are able to correctly classify thosesamples become more likely to be selected. The resulting set of features are then integratedinto a composite classifier:

1. Given a set of sample images (I1, b1), . . . , (In, bn) where bi = 0, 1 for negative andpositive samples respectively.

2. Initialize weights w1,i = 12m ,

12l for bi = 0, 1 respectively, where m and l are the number

of negative and positive samples respectively.

3. For rounds t = 1, . . . , T :

(a) Normalize the weights

2.2. Face detection 9

wt,i =wt,i∑nj=1 wt,j

so that wt is a probability distribution.

(b) For each feature, j, train a classifier hj . The error is evaluated with respect towt,

εj =∑i

wt,i |hj(Ii)− bi|

(c) Choose the classifier ht with the lowest error εt.

(d) Update the weights:

wt+1,i = wt,iβ1−eit

where ei = 0 if sample Ii is classified correctly, ei = 1 otherwise, and βt = εt1−εt .

4. The final composite classifier is:

h(I) = indT∑t=1

αtht(I) ≥ 1

2

T∑t=1

αt

where αt = log 1βt

.

Training the cascade

As was previously mentioned, the cascade consists of a sequence of classifiers. Each classifieris applied in turn to the detection window, and if any one rejects it, the detection windowis immediately discarded. This is desirable because the large majority of detection windowswill not contain a face and a large amount of computation time can be saved by discardingtrue negatives early. For this reason, it is important for each individual stage to have avery low false negative rate, as this rate will be compounded as the window is passed downthe cascade. For example, in a 32-stage cascade, each stage will need a detection rate of99,7% to achieve a total detection rate of 90%. However, the reverse applies to the falsepositive rate, which means that each stage can have a fairly high false positive rate andstill achieve a low compounded rate. As previously stated, these rates can be modulatedby modifying the threshold parameter, and improved by adding additional features (i.e.running more AdaBoost rounds). However, the total performance of the cascade classifieris highly dependent on the number of features, so in order to maintain efficiency we wouldlike to keep this number low.

Thus, we select a desired final false positive rate and a required false positive rate γ perstage and run the AdaBoost method the required number of rounds to achieve close to 0%false negative rate and γ false positive rate when the θ has been modulated. The rates aredetermined by testing the classifier on a validation set. For the first stage, the entire sampleset is used and a very low number of features is likely to be needed. The samples used forthe next stage are those that the first stage classifier misclassified, which are likely to be”harder” and thus require more features to achieve the desired rates. This is acceptablebecause the large majority of detection windows will be discarded by the earliest stages,


which are also the fastest. We keep adding stages until the final desired detection/falsepositive rate has been achieved.

Since computing the value of a feature can be done in constant time regardless of the size,the resulting classifier has the interesting property of being scalable to any size. When weapply the detector in practice, we can scan the input image by placing the detection windowat different locations and scale it to different sizes. Thus, we can easily trade performancefor accuracy by doing a more or less coarse scan.

2.3 Face identification

In this section, the process of determining the identity of a detected face in an image, orverifying an identity claim, is introduced. First, the main obstacles to successful identifi-cation are discussed and the various categories of approaches are described. After that, adetailed technical description of the techniques used in this project is given. Finally, briefdescriptions of other techniques are listed.

2.3.1 Difficulties

There is a variety of factors that can make the problem of facial recognition or verificationmore difficult. The illumination of the probe image commonly varies greatly and this cancripple the performance of certain techniques. For some use cases the user can be assumedto look directly at the camera but in many others the view angle could be different, andalso vary. The performance of some techniques are dependent on the pose of the face beingin a certain angle, and are more or less sensitive to deviations from the preferred angle.For some scenarios the subject cannot be relied on to have a neutral facial expression, andsome techniques are very sensitive to this complication. It might also be of interest to allowfor variation in the style of the face, such as facial hair, hairstyle, sunglasses or articlesof clothing. Any combination of these factors might potentially need to be dealt with aswell. Many solutions to these issues have been proposed and some techniques are markedlybetter at dealing with some types of variation. In general, it seems that the performance offace recognition systems decreases significantly whenever multiple sources of variation arecombined in a single probe. When conditions are ideal, however, current techniques workvery well.

2.3.2 Categories of approaches

Techniques for face recognition can be classified in a multitude of ways. Some of the mostcommon categorizations are briefly described below[6].

Fully versus partially automatic

A system that performs all three steps listed earlier, detection, normalization and identifi-cation, is referred to as fully automatic. It is given only a facial image and performs therecognition process unaided. A system that assumes the detection and normalization stepshave already been performed is referred to as partially automatic. Commonly, it is given afacial image and the coordinates of the center of the eyes.

2.3. Face identification 11

Static versus video versus 3D

Methods can be subdivided by the type of input data they utilize. The most basic formof recognition is performed on a single static image. It is the most widespread approachboth in literature and in real-world applications. Recognition can also be applied to videosequences, which give the extra advantage of multiple perspectives and possibly imagingconditions, as well as temporal continuity. Some scanners, such as infrared cameras, caneven provide 3D geometric data which some techniques make use of.

Frontal versus profile versus view-tolerant

Some techniques are designed to handle only frontal images. This is the classical approachand the alternatives are more recent developments. View-tolerant techniques allow for avariety of poses and are often more sophisticated, taking the underlying geometry, physicsand statistics into consideration. Techniques that handle profile images are rarely used forstand-alone applications, but can be useful for coarse pre-searches to reduce the computa-tional load of a more sophisticated technique, or in combination with another technique toimprove recognition precision.

Global versus component-based approach

A global approach is one in which a single feature vector is computed based on the entire faceand fed as input to a classifier. These tend to be very good at classifying frontal images.However, they are not robust to pose changes since global features tend to be sensitiveto translation and rotation of the face. This weakness can be addressed by aligning theface prior to classification. The alternative to the global approach is to classify local facialcomponents independently of each other, and thus allowing a flexible geometrical relationbetween them. This makes component-based techniques naturally more robust to posechanges.

Invariant features versus canonical forms versus variation-modeling

As has been previously stated, variation in appearance depending on illumination, pose,facial hair, etc., is the central issue to performing face recognition. Approaches to dealingwith it can be divided into three main categories. The first focuses on utilizing features thatare invariant to the changes being studied. The second seeks to either normalize away thevariation using clever image processing or to synthesize a canonical or prototypical versionof the probe image and performing classification on that. The third attempts to create aparameterized model of the variation and estimating the parameters for a given probe.

2.3.3 Studied techniques

This section gives an overview of the major face recognition techniques that have beenevaluated in this report, and describes the advantages and disadvantages of each.

Eigenfaces

Eigenfaces are one of the earliest successful and most thoroughly investigated approaches toface recognition[59]. Also known as Karhunen-Loeve expansion or eigenpictures, it makesuse of principal component analysis (PCA) to efficiently represent pictures of faces. A setof eigenfaces are generated by performing PCA on a large set of images representing human


Figure 2.2: Eigenfaces, i. e., visualizations of single eigenvectors.

faces. Informally, the eigenfaces can be considered a set of ”standardized face ingredients”derived by statistical analysis of a set of real faces. For example, a real face could berepresented by the average face plus 7% of eigenface 1, 53% of eigenface 2 and -3% ofeigenface 3. Interestingly, only a few eigenfaces combined are required to arrive at a fairapproximation of a real human face. Since an individual face is represented only by a vectorof weights, one for each eigenface, this representation is highly space-efficient. Empiricalresults show that eigenfaces are robust to variations in illumination, less so to variations inorientation and even less to variations in size[33], but despite this illumination normalizationis usually required in practice[6].

Mathematically, we wish to find the principal components of the distribution of faces,represented by the covariance matrix of the face images. These eigenvectors can be thoughtof as the primary distinguishing features of the image. Each pixel element contributes to alesser or greater extent to each eigenvector, and this allows us to visualize each eigenvectoras a ghostly image, which we call eigenfaces (see figure 2.2). Each image in the gallery canbe represented exactly in terms of a linear combination of all eigenfaces, but can also beapproximated by combining only a subset of the eigenvectors. The ”best” approximation isachieved by using the eigenvectors with the largest eigenvalues, as they account for most ofthe variance in the gallery set. This feature can be used to improve computational efficiencywithout necessarily losing much precision. The bestM ′ eigenvectors span anM ′-dimensionalsubspace of all possible images, a ”face space”[59].

Algorithm The algorithm can be summarized as follows:

1. Acquire the gallery set and compute its eigenfaces, which define the face space.


2. When given a probe image, project it onto each of the eigenfaces in order to computea set of weights to represent it in terms of those eigenfaces.

3. Determine if the image contains a known face by checking if it is sufficiently close tosome gallery face class, or unknown if the distance exceeds some threshold.

Let the gallery set of face images be ~Γ1, ~Γ2, ~Γ3, . . . , ~Γn. The average face of the set isdefined by ~Ψ = 1

n

∑ni=1

~Γi. Each face differs from the average by the vector ~Φi = ~Γi − ~Ψ.This set of vectors is then subjected to principal component analysis, which seeks a setof M orthonormal vectors ~uj and their associated eigenvalues λj . The vectors ~uj andscalars λj are the eigenvectors and eigenvalues, respectively, of the covariance matrix C =1n

∑nk=1

~Φi ~ΦiT

= AAT where A = [ ~Φ1~Φ2 . . . ~Φn].The matrix C is rc × rc and computing

the eigenvectors and eigenvalues is intractable for typical images. However, this can beworked around by solving a smaller n × n matrix problem and taking linear combinationsof the resulting vectors (see [58] for details). An arbitrary number of eigenvectors M ′ withthe largest associated eigenvalues are selected. The probe image Γ is transformed into itseigenface components by the simple operation ωk = ~uk

T (~Γ − ~Ψ) for k = 1, . . . ,M ′. The

weights form a vector ~ΩT = [ω1ω2 . . . ωM ′ ] that describes the contribution of each eigenvectorin representing the input image. This vector is then used to determine which face class bestdescribes the probe. The simplest method is simply to select the class k that minimizes the

Euclidean distance εl =∣∣∣~Ω− ~Ωl

∣∣∣ where Ωl is a vector describing the lth face class, provided

it falls below some threshold θε. Otherwise the face is classified as ”unknown”.

Fisherfaces

The Eigenfaces method projects face images to a low-dimensional subspace with axes thatcapture the greatest variance of the input data. This is desirable, but not necessarilyoptimal for classification purposes. For example, the difference in facial features betweentwo individuals is a type of variance that one would like to capture, but the difference inillumination between two images of the same individual is not. A different but relatedapproach is to project the input image to a subspace which minimizes within-class variationbut maximizes inter-class variation. This can be achieved by applying linear discriminantanalysis (LDA), a technique that traces back to work by R. A. Fisher[17]. The resultingmethod is thus called Fisherfaces[44].

Given C classes, assume that the data in each class are of homoscedastic normal dis-tributions (i.e., each class is normally distributed with equal covariance matrices to each

other). We denote this ~Γi ∼ N(~µi, ~Σ) for a sample of class i. We want to find a subspace ofthe face space which minimizes the within-class variation and maximizes the between-classvariation. Within-class differences can be estimated by the within-class scatter matrix whichis given by

Sw =

C∑j=1

nj∑i=1

( ~Γij − ~µj)( ~Γij − ~µj)T

where ~Γij is the ith sample of class j, ~µj is the mean of class j, and nj is the number ofsamples in class j. Likewise, the between-class differences are computed using the between-class scatter matrix,


Figure 2.3: The first four Fisherfaces from a set of 100 classes.

Sb =

C∑j=1

( ~µj − ~µ)( ~µj − ~µ)T

where ~µ is the mean of all classes. We now want to find the matrix V for which |V TSbV ||V TSwV |

is maximized. The columns ~vi of V corresponds to the basis vectors of the desired subspace.This can be done by the generalized eigenvalue decomposition SbV = SwV Λ where Λ is thediagonal matrix of the corresponding eigenvalues of V . The eigenvectors of V associatedwith non-zero eigenvalues are the Fisherfaces.[16][44]

Local binary pattern histograms

The LBP histograms approach builds on the idea that a face can be viewed as a compositionof local subpatterns that are invariant to monotonic grayscale transformations[62]. Byidentifying and combining these patterns, a description of the face image which includesboth texture and shape information is obtained. The LBP operator labels each pixel in animage with a binary string of length P by selecting P sampling points evenly distributedaround the pixel at a specific radius r. If the sampling point exceeds the intensity of thecentral pixel, the corresponding bit in the binary string is 1, and otherwise 0. If the samplingpoint is not in the center of a pixel, bilinear interpolation is used to acquire the intensityvalue of the sampling point.

Let fl be the labeled image. We can define the histogram for the labeled image as

Hi =∑x,y

indfl(x, y) = i, i = 0, 1, . . . , n− 1


where n is the number of different labels produced by the LBP operator. This histogramcaptures the texture information of the subpatterns of the image. We can also capturespatial information as well by subdividing the image into regions R0, R1, . . . , Rm−1. Thespatially enhanced histogram becomes

Hi,j =∑x,y

indfl(x, y) = iind(x, y) ∈ Rj, i = 0, 1, . . . , n− 1, j = 0, 1, . . . ,m− 1.

We can classify a probe image by comparing the corresponding histograms of the probeand the gallery set using some dissimilarity measure. Several options exist, including

– Histogram intersection: D(S,M) =∑imin(S,M).

– Log-likelihood statistic: L(S,M) =∑i Silog(Mi).

– Chi-square statistic: χ2(S,M) =∑i(Si−Mi)

2

Si+Mi

Each of these can be extended to the spatially enhanced histogram by simply summingover both i and j.[8]

Figure 2.4: Binary label sampling points at three different radiuses.

Figure 2.5: A given sampling point is labeled 1 if its intensity value exceeds that of thecentral pixel.

Hidden Markov models

Hidden Markov models (HMMs) can be applied to the task of face recognition by treatingdifferent regions of the human face (eyes, nose, mouth, etc) as hidden states. HMMs requireone-dimensional observation sequences, and thus the two-dimensional facial images need tobe converted into either 1D temporal or spatial sequences. This way, an HMM is createdfor each subject in the database, the probe image is fed as an observation sequence to eachand the match with the highest likelihood is considered best.


Wawo The core face recognition algorithm of the Wawo system is based on an extendedHMM scheme called Joint Multiple Hidden Markov Models (JM-HMM)[35]. The primaryobjective of the algorithm is capturing the 2D nature of face images while only requiring asingle gallery image per subject to achieve good performance. The input image is treatedas a set of horizontal and vertical strips. Each strip consists of small rectangular blocks ofpixels and each strip is managed by an individual HMM. When an HMM subsystem of aprobe is to be compared to the corresponding one in a gallery image, the block strips of eachimage are first matched according to some similarity measure, and the observation sequenceis formed by the indices of the best-matching blocks.

2.3.4 Other techniques

These are approaches from the literature that have not been evaluated in this report.

Neural networks

A variety of techniques based on artificial neural networks have been developed. The reasonfor the popularity of artificial neural networks may be due to the non-linearity allowingfor more effective feature extraction than eigenface-based methods. The structure of thenetwork is essential to the success of the system, and which is suitable is dependent onthe application. For example, multilayer perceptrons and convolutional neural networkshave been applied to face detection, and a multi-resolution pyramid structure[52][49][32]to face verification. Some techniques combine multiple structures to increase precision andcounteract certain types of variation[49]. A probabilistic decision-based neural network(PDBNN) has been shown to function effectively as a face detector, eye localizer and facerecognizer[48]. In general, neural network approaches suffer from computational complexityissues as the number of individuals increases. They are also unsuitable for single modelimage cases due to the fact that they tend to require multiple model images to train tooptimal parameter settings.

Dynamic link architecture

Dynamic link architectures are an extension of traditional artificial neural networks[6]. Mem-orized objects are represented by sparse graphs, whose vertices are labeled with a multires-olution description in terms of a local power spectrum, and whose edges are geometricaldistance vectors. Distortion invariant object recognition can be achieved by employing elas-tic graph matching to find the closest stored graph. The method tends to be superiorto other methods in terms of coping with rotation variation, but the matching process iscomparatively expensive.

Geometrical feature matching

This technique is based on the computation of a set of geometrical features from the pictureof a face. The overall configuration can be represented by a vector containing the positionand size of a set of main facial features, such as eyes, eyebrows, mouth, face outline, etc.It has been shown to be successful for large face databases such as mug shot albums[30].However, it is dependent on the accuracy of automated feature location algorithms, whichgenerally do not achieve a high degree of accuracy and require considerable computationaltime.


3D model

The 3D face model is based on a vector representation of a face that is constructed suchthat any convex combination of shape and texture vectors represents describes a realistichuman face. Fitting the 3D model to, or extracting it from, images can be used in two waysfor recognition across different viewing conditions:

– After fitting the model, the comparison can be based on model coefficients thatrepresent intrinsic features of shape and texture that are independent of imagingconditions[25].

– 3D face reconstruction can be employed to generate synthetic views from differ-ent angles. The views are then transferred to a second view-dependent recognitionsystem[63].

3D morphable models have been combined with computer graphics simulations of illu-mination and projection[10]. Among other things, this approach allows for modeling moresophisticated lighting conditions such as specular lighting and cast shadows (most tech-niques only consider Lambertian illumination). Scene parameters in probe images, suchas head position and orientation, camera focal length and illumination direction, can beautomatically estimated.

Line edge map

Edge information is useful for face recognition because it is partially insensitive to illu-mination variation. It has been argued that face recognition in the human brain mightmake extensive use of early-stage edge detection without involving higher-level cognitivefunctions[53]. The Line Edge Map (LEM) approach extracts lines from a face edge mapas features. This gives it the robustness to illumination variation that is characteristic offeature-based approaches while simultaneously retaining low memory requirements and highrecognition performance. In addition, LEM is highly robust to size variation. It has beenshown to be less sensitive to pose changes than the eigenface method, but more sensitive tochanges in facial expression[24].

Support vector machines

Support vector machines (SVM) is considered to be an effective method for general purposepattern recognition due to its high generalization performance without the need to addother knowledge[60]. Intuitively, given a set of points belonging to two classes, an SVMfinds the hyperplane that separates the largest possible set of points of the same class onthe same side while maximizing the distance from either class to the hyperplane. A largevariety of SVM-based approaches have been developed with regards to a number of differentapplication areas[38][22][45][9][34][42]. The main features of SVM-based approaches is thatthey are able to extract relevant discriminatory information automatically, and are robust toillumination changes. However, they can become overtrained on data sanitized by featureextraction and/or normalization and they involve a large number of parameters so theoptimization space can become difficult to explore completely.

Multiple classifier systems

Traditionally, the approach used in the design of pattern recognition systems has been toexperimentally compare the performance of several classifiers in order to select the best one.


Recently, the alternative approach of combining the output of several classifiers has emerged,under various names such as multiple classifier systems (MCSs), committee or ensembleclassifiers, with the purpose of improved accuracy. A limited number of approaches of thiskind have been developed with good results for established face databases[54][55][28][47].

2.4 Face recognition in video

Since a video clip consists of a sequence of frame images, face recognition algorithms thatapply to single still images can be applied to video virtually unchanged. However, a videosequence possesses a number of additional properties that can potentially be utilized todesign face recognition techniques with improved accuracy and/or performance over singlestill image techniques. Three properties of major importance are:

– Multiple observations: A video sequence by its very nature will yield multiple obser-vations of any probe or gallery. Additional observations means additional constraintsand potentially increased accuracy.

– Temporal continuity/Dynamics: Successive frames in a video sequence are contin-uous in the temporal dimension. Geometric continuity related to changes in facialexpression or head/camera movement, or photometric continuity related to changesin illumination provide additional constraints. Furthermore, changes in head move-ment or facial expression obey certain dynamics that can be modeled for additionalconstraints.

– 3D model: We can attempt to reconstruct a 3D model of the face using a videosequence. This can be achieved both by treating the video as a set of multiple ob-servations or by making use of temporal continuity and dynamics. Recognition canthe be based on the 3D model, which, as previously described, has the potential to beinvariant to pose and illumination.

Below, these properties, and how they can be exploited to design better face recognitiontechniques, will be discussed in detail. We will also study some existing techniques thatmake use of these properties.

2.4.1 Multiple observations

This is the most commonly used feature of video sequences. Techniques exploiting thisproperty treat the video sequence as a set of related still images but ignores the temporaldimension. The discussion below assumes that images are normalized before being subjectedto further analysis.

Assembly-type algorithms

A simple approach to dealing with multiple observations is to apply a single still imagetechnique to each individual frame of a video sequence and combining the results by somerule. In many cases the combining rule is very simple, and some common examples aregiven in table 2.1.

Let Fi; i = 1, 2, . . . , n denote the sequence of probe video frames. Let Ij ; j =1, 2, . . . ,m denote the set of gallery images. Let d(Fi, Ij) denote the distance functionbetween the ith frame of a video sequence and the jth gallery image of some single still

2.4. Face recognition in video 19

image technique. Let Ai(Fi) denote the gallery image selected by the algorithm applied tothe ith frame of the probe video.

Table 2.1: Combining rules for assembly-type techniques.

Method Rule

Minimum arithmetic mean j = argminj=1,2,...,m1n

∑ni=1 d(Fi, Ij)

Minimum geometric mean j = argminj=1,2,...,mn√∏n

i=1 d(Fi, Ij)

Minimum median j = argminj=1,2,...,m[medi=1,2,...,nd(Fi, Ij)]

Minimum minimum j = argminj=1,2,...,m[mini=1,2,...,nd(Fi, Ij)]

Majority voting j = argmaxj=1,2,...,m

∑ni=1 indAi(Fi) = j

One image or several images

Multiple observations can be summarized into a smaller number of images. For example, onecould use the mean or median image of the probe sequence, or use clustering techniques toproduce multiple summary images. After that, single still image techniques or assembly-typealgorithms can be applied to the result.

Matrix

If each frame of the probe video is vectorized by some means, the video can be representedas a matrix V = [F1F2 . . . Fn]. This representation can make use of the various methods ofmatrix analysis. For example, matrix decompositions can be invoked to represent the datamore efficiently. Matrix similarity measures can be used for recognition[43].

Probability density function

Multiple observations F1, F2, . . . , Fn can be regarded as independent realizations drawnfrom the same underlying probability distribution. PDF estimation techniques can be uti-lized to learn this distribution[23]. If both the probe and the gallery consists of video footage,PDF distance measures can be used to perform recognition. If the probe consists of videoand the gallery of still images, recognition becomes a matter of determining which galleryimage is most likely to be generated from the probe distribution. In the reverse case, wherethe gallery consists of video and the probe of a still image, recognition tests which gallerydistribution is most likely to generate the probe.

Manifold

Face appearances of multiple observations form a highly nonlinear manifold. If we cancharacterize the manifold[18], recognition reduces to (i) comparing two manifolds if boththe probe and gallery is video, (ii) comparing the distance between a data point and variousmanifolds if the probe is a still image and the gallery is video or (iii) comparing the distancebetween various data points and a manifold if the probe is video and the gallery consists ofstill images.


2.4.2 Temporal continuity/Dynamics

Successive frames in a video clip are continuous along the temporal axis. Temporal continu-ity provides an additional constraint for modeling face appearance. For example, smoothnessof face movement can be used in face tracking. It was previously stated that these tech-niques assume that the probe and gallery have been prenormalized, but it can be noted thatin the case of video, face tracking can be used instead of face detection for the purposes ofnormalization due to the temporal continuity.

Simultaneous tracking and recognition

Zhou and Chellappa proposed[64] an approach that models tracking and recognition in a sin-gle probabilistic framework using time series analysis. A time series model is used, consistingof the state vector (at, θt), where at is the identity variable at time t and θt is the trackingparameter, as well as the observation yt (the video frame), the state transition probabilityp(at, θt|at−1, θt−1) and the observational likelihood p(yt|θt, nt). The task of recognition thusbecomes computing the posterior probability p(at|y0:t) where y0:t = y0, y1, . . . , yt.

Probabilistic appearance manifolds

A probabilistic appearance manifold[31] models each individual in the gallery as a set oflinear subspaces, each modelling a particular pose variation, called pose manifolds. Theseare generated by extracting samples from a training video which are divided into groupsthrough k-means clustering. Principal component analysis is performed on each group tocharacterize that subspace. Temporal continuity is captured by computing the transitionprobabilities between pose manifolds in the training video. Recognition is performed byintegrating the likelihood that an input frame is generated by a pose manifold and theprobability of transitioning to that pose manifold from the previous frame.

Adaptive hidden Markov model

Liu and Chen proposed[39] an HMM-based approach that captures temporal informationby using temporally indexed observation sequences. The approach makes use of principalcomponent analysis to reduce each gallery video to a sequence of low-dimensional featurevectors. These are then used as observation sequences in the training of the HMM models. Inaddition, the algorithm gradually adapts to probe videos by using unambiguously identifiedprobes to update the corresponding gallery model.

System identification

Aggarwal, Chowdury and Chellappa presented[21] a system identification approach to facerecognition in video. Each video sequence is represented by a first-order auto-regressive andmoving average (ARMA) model

θt+1 = Aθt + vt, It = Cθt + wt

where θ is a state vector characterizing the pose of the face, It the frame and v andw independent identically distributed white noise factors drawn from N(0, Q) and N(0, R)respectively. System identification is the process of estimating the model parameters A, C,Q and R based on the observations I1, I2, . . . , In. Recognition is performed by selecting thegallery model that is closest to the probe model by some distance function of the modelparameters.

2.4. Face recognition in video 21

2.4.3 3D model

We can attempt to reconstruct a 3D model of a face from a video sequence. One way todo this is by utilizing light field rendering, which involves treating each observation as a2D slice of a 4D function - the light field, which characterizes the flow of light throughunobstructed space. Another method is structure from motion (SfM), which attempts torecover 3D structure from 2D images coupled with local motion signals. The 3D model willpossess two components: geometric and photometric. The geometric component describesdepth information of the face and the photometric component depicts the texture map.Structure from motion is more focused on recovering the geometric component, and lightfield rendering on recovering the photometric component.

Structure from motion

There is a large body of literature on SfM, but despite this current SfM algorithms cannotreconstruct the 3D face model reliably. The difficulties are three-fold: (i) the ill-posed natureof the perspective camera model that results in instability of SfM solutions, (ii) the fact thatthe face is not a truly rigid object, especially when the face presents facial expressions andother deformations and (iii) the input to the SfM algorithm. This is usually a sparse setof feature points provided by a tracking algorithm with its own flaws. Interpolation froma sparse to a dense set of feature points is very inaccurate. The first difficulty can beaddressed by using an ortographic or paraperspective model to approximate the perspectivecamera model[56][46]. The second problem can often be resolved by imposing a subspaceconstraint on the face model[13]. A dense face model can be used to overcome the sparse-to-dense issue. However, the dense face model is generic and not appropriate for a specificindividual. Bundle adjustment has been used to adjust the generic model to accomodatevideo observation[20].

Light field rendering

The SfM algorithm mainly recovers the geometric component of the face model, i. e., thedepth value of every pixel. Its photometric component is naively set to the appearencein one reference video frame. An image-based rendering method recovers the photometriccomponent of the 3D model instead, and light field rendering bypasses even this stage byextracting novel views directly[37].


2.5 Object tracking

The field of object tracking deals with the combined problems of locating objects in videosequences, tracking their movement from frame to frame and analyzing object tracks torecognize behavior. In its simplest form, object tracking can be defined as the problem ofconsistently labeling a tracked object in each frame of a video. Depending on the tracker,additional information about the object can also be detected, such as area, orientation,shape, etc. Conditions that create difficulties in object tracking include:

– information loss due to projecting a 3D world onto a 2D image,

– noise and cluttered, dynamic backgrounds,

– complex rigid motion (drastic changes in velocity),

– nonrigid motion (deformation),

– occlusion,

– complex object shape,

– varying illumination.

Tracking can be simplified by imposing constraints on the conditions of the scene. Forexample, most object tracking methods assume smooth object motion, i. e., no abruptchanges in direction and velocity. Assuming constant illumination also increases the numberof potential approaches that can be used. The approaches to choose from mainly differ inhow they represent the objects to be tracked, which image features they use and how themotion is modeled. Which approach performs best depends on the intended application.Some trackers are even specifically tailored to the tracking of certain classes of objects, forexample humans.

2.5.1 Object representation

Objects can be represented both in terms of their shapes and their appearances. Someapproaches use only the shape of the object to represent them, but some also combineshape with appearance. Shape and appearance representations are usually chosen to fit acertain application domain[7]. Major categories of shape representations include:

– Points. The object is represented by one or more points. This is generally suitable forobjects that occupy a small image region.

– Geometric primitives. Objects are represented by geometric primitives, such as rect-angles, circles or ellipses. This representation is particularly suitable for rigid objectsbut can also be used to bound non-rigid ones.

– Silhouette and contour. The contour is the boundary of an object, and the area insideit is called the silhouette. Using this representation is suitable for tracking complexnon-rigid objects.

– Articulated shape models. These models consist of body parts held together withjoints. The human body, for example, consists of a head, torso, upper and lower arms,etc. The motion of the parts are constrained by kinematic models. The constituentparts can be modeled by simple primitives such as ellipses or cylinders.

2.5. Object tracking 23

– Skeletal models. The skeleton of an object can be extracted using medial axis transfor-mation. This model is commonly used as a shape representation in object recognitionand can be used to model both rigid and articulated objects.

Common appearance representations of objects are:

– Probability densities. Probability density appearance representations can be para-metric, such as a Gaussian, or non-parametric, such as histograms. The probabilitydensities of object appearance can be computed from features (color, texture or morecomplex features) of the image region specified by the shape representation, such asthe interior of a rectangle or a contour.

– Templates. Templates are formed from a composition of simple geometric objects.The main advantage of templates is that they carry both spatial and appearanceinformation, but they tend to be sensitive to pose changes.

– Active appearance models. Active appearance models simultaneously model shape andappearance by defining objects in terms of a set of landmarks. Landmarks are oftenpositioned on the object boundary but can also reside inside the object region. Eachlandmark is associated with an appearance vector containing, for example, color andtexture information. The models need to be trained using a set of samples by sometechnique, e.g. PCA.

Figure 2.6: A number of object shape representations. a) Single point. b) Multiple points.c) Rectangle. d) Ellipse. e) Articulated shape. f) Skeletal model. g) Control points oncontour. h) Complete contour. i) Silhouette.


2.5.2 Image features

The image features to use are an integral part of any tracking algorithm. The most desir-able property of a feature is how well it distinguishes between the object region and thebackground[7]. Features are usually closely related to the object representation. For exam-ple, color is mostly used for histogram representations while edges are more commonly usedfor contour-based representations. The most common features are:

– Color. The apparent color of an object is influenced both by the light source and thereflective properties of the object surface. Different color spaces, such as RGB, HSV,L*u*v or L*a*b, each with different properties, can be used, depending on applicationarea.

– Edges. Object boundaries generally create drastic changes in image intensity and edgedetectors identify these changes. These features are mostly used in trackers that usecontour-based object representations.

– Optical flow. Optical flow is a field of displacement vectors that describe the motion ofeach pixel in an image region. It is computed by assuming that the same pixel retainsthe same brightness between consecutive frames.

– Texture. Texture measures the variation of intensity across a surface, describing prop-erties like smoothness and regularity.

Methods for automatic feature selection have also been developed. These can mostlybe categorized as either filter or wrapper methods[11]. Filter methods derive a set of fea-tures from a much larger set (such as pixels) based on some general criteria, such as non-correlation, while wrapper methods select features based on their usefulness in a particularproblem domain.

2.5.3 Object detection

Every tracking method requires some form of detection mechanism. The most commonapproach is to use information from a single initial frame, but some methods utilize temporalinformation across multiple frames to reduce the number of false positives[7]. This is usuallyin the form of frame differencing, which highlights regions that change between frames. Inthis case, it is then the tracker’s task to establish correspondence between detected objectsacross frames. Some common techniques include:

– Point detectors. Detectors used to find points of interest whose respective loci haveparticular qualities[29]. Major advantages of point detectors are insensitivity to vari-ation in illumination and viewpoint.

– Background subtraction. Approach based on the idea of building a model for thebackground of the scene and detecting foreground objects based on deviations fromthis model[51].

– Segmentation. Segmentation aims to detect objects by partitioning the image intoperceptually similar regions and characterizing them[50].

– Supervised learning. Based on learning a mapping between object features and objectclass and then applying the trained model to different parts of an image. Theseapproaches include neural networks, adaptive boosting, decision trees and supportvector machines.


2.5.4 Trackers

The goal of an object tracker is to track the trajectory of an object over time by locatingit in a series of consecutive frames in a video. This can be done in two general ways[7].Firstly, the object can be located in each frame individually using a detector, the trackerbeing responsible for establishing a correspondence between the regions in separate frames.Secondly, the tracker can be provided with an initial region located by the detector and theniteratively update its location in each frame. The shape and appearance model limits thetypes of transformations it can undergo between frames. The main categories of trackingalgorithms are:

– Point tracking. With objects represented by points, tracking algorithms use the stateof the object in the previous frame to associate it to the next. This state can includethe position and motion of the object. This requires an external mechanism to detectthe object in each frame beforehand.

– Kernel tracking. The kernel refers to a combination of shape and appearance model.For example, a kernel can be the rectangular region of the object coupled with a colorhistogram describing its appearance. Tracking is done by computing the motion ofthe kernel across frames.

– Silhouette tracking. Tracking is performed by estimating the object region in eachframe. This is done by using information encoded in the object region from previousframes. This information usually takes the form of appearance density, or shapemodels such as edge maps.

CAMSHIFT

The Continuously Adaptive Mean Shift (CAMSHIFT) algorithm[12] is a color histogram-based object tracker based on a statistical method called mean shift. It was designed to beused in perceptual user interfaces and minimizing computational costs was thus a primarydesign criterion. In addition, it is relatively tolerant to noise, pose changes and occlusion,and to some extent also illumination changes. It tracks object movement along four degreesof freedom: x, y, z position as well as roll angle. x and y is given directly by the searchwindow, the z position can be derived by estimating the size of the object and relating itto the current size of the tracking window. Roll can be derived from the second momentsof the probability distribution in the tracking window. It was initially developed to trackfaces, but it can also be applied to other object classes.

Color probability distribution The first step of the CAMSHIFT algorithm is to createa probability distribution image of each frame, based on an externally selected initial trackwindow which contains exactly the object to track. This is done by generating a colorhistogram of the window and using it as a lookup table to convert an incoming frame intoa probability-of-object map. CAMSHIFT uses only the hue dimension of the HSV colorspace, and ignores saturation and brightness, which gives it some robustness to illuminationchanges. For the purposes of face tracking, it also minimizes the impact of differing skincolors. Problems with this approach can occur if the brightness value is too extreme, or ifthe saturation is too low, due to the nature of the HSV color space causing the hue valueto vary drastically. The solution is to simply ignore pixels to which these conditions apply.This means that very dim scenes need to be preprocessed to increase the brightness priorto tracking.


Mean shift The mean shift algorithm is a non-parametric statistical technique whichclimbs the gradient of a probability distribution to find the local mode/peak. It involvesfive steps:

1. Choose a search window size.

2. Choose the initial location of the search window.

3. Compute the location of the mode inside the search window. This is done as follows:Let p(x, y) be the probability at position (x, y) in the image, and x and y range overthe search window.

(a) Find the zeroth moment

M00 =∑x

∑y

p(x, y)

(b) Find the first moment for x and y

M10 =∑x

∑y

xp(x, y);M01 =∑x

∑y

yp(x, y)

(c) Then the mode of the search window is

xc =M10

M00; yc =

M01

M00

4. Center the search window on the mode.

5. Repeat steps 3 and 4 until convergence.

CAMSHIFT extension The mean shift algorithm only applies to a single static distri-bution, while CAMSHIFT operates on a continuously changing distribution from frame toframe. The zeroth moment, which can be thought of as the distribution’s ”area”, is used toadapt the window size each time a new frame is processed. This means that it can easilyhandle objects changing size when, for example, the distance between the camera and theobject changes. The steps to compute the CAMSHIFT extension are as follows:

1. Choose the initial search window.

2. Apply mean shift as described above and store the zeroth moment.

3. Set the search window size to a function of the zeroth moment found in step 2.

4. Repeat steps 2 and 3 until convergence.

When applying the algorithm to a series of video frames, the initial search window ofone frame is simply the computed region of the previous one. The window size functionused in the original paper is

s = 2

√M00

256


Figure 2.7: CAMSHIFT in action. The graph in each step shows a cross-section of thedistribution map, with red representing the probability distribution, yellow the trackingwindow and blue the current mode. In this example the algorithm converges after six steps.

This is arrived at by first dividing the zeroth moment by the maximum pixel intensityvalue to convert it into units of number of cells, which makes sense for a window size measure.In order to convert the 2D region into a 1D length, we take the square root. We desire anexpansive window that grows to encompass a connected distribution area, so we multiplythe result by two.


Chapter 3

Face recognition systems andlibraries

This chapter describes the face recognition systems and libraries examined in this reportand gives a brief review of the installation process for each of them.

3.1 OpenCV

OpenCV (open computer vision) is a free open source library for image and video processingwith support for a wide variety of different algorithms in many different domains. It hasextensive documentation and community support, with complete API documentation aswell as various guides and tutorials for specific topics and tasks[2].

3.1.1 Installation and usage

Using Ubuntu 12.04 LTS, OpenCV was very simple to install. There is a relatively newbinary release available through the package manager, but the latest version, which amongother things includes the face recognition algorithms, had to be downloaded separately andcompiled from source. However, the compilation required few external dependencies andthe process was relatively simple. The library itself has an intuitive interface and it is veryeasy to create quite powerful sample applications. A great advantage of using OpenCVfor building a face recognition system is that it supports many of the tasks that are notdirectly related to recognition, such as loading, representing and presenting video and imagedata, image processing and normalization, face detection and so on. The implemented facerecognition algorithms have a unified API which make them easily interchangable in anapplication. One minor problem in evaluating their performances is that they only returnthe rank-1 classification which prevents the methods from being evaluated using rank-basedmetrics. They also return a confidence value for the classification, but its definition is neverformally documented, and the only way to find this out is by reading the source code itself.

29

30 Chapter 3. Face recognition systems and libraries

3.2 Wawo SDK

Wawo Technology AB[3] is a Sweden-based company developing face recognition technolo-gies, with its main product being the Wawo face recognition system, which is based onoriginal algorithms and techniques presented in the doctoral dissertation of Hung-Son Le.It is advertised as being capable of performing rapid and accurate illumination-invariantface recognition based on a very small number of samples per subject. This is done using anoriginal HMM-based algorithm as well as original contrast-enhancing image normalizationprocedures[35].


The binary distribution of Wawo used in this work was acquired through Codemill and notdirectly from Wawo, and thus it might not be the most up-to-date version of the library.The distribution I was first given access to also lacked some files that I had to piece togethermyself from an old hard drive used in previous projects, so while I was able to get the libraryrunning, it is possible that some of the problems described below could have occurred be-cause of this. The documentation that came along was quite limited and mainly consistedof code samples with some rudimentary comments partially describing the API. Despitethis, producing functional code was not very difficult when Wawo was used in conjunc-tion with the basic facilities provided by OpenCV. I occasionally encountered segmentationfaults originating inside library calls, Again, this is possibly due to using an old, incompletedistribution. I also discovered that Wawo sets a relatively low upper limit to the size ofthe training gallery. This is reasonable given that Wawo’s strength is advertised to be goodperformance with very few training samples.

3.3 OpenBR

OpenBR (OpenBiometrics) is a collaborative research project initiated by the MITRE com-pany, a US-based non-profit company operating several federally-funded research centers.Its purpose is to develop biometrics algorithms and evaluation methodologies. The cur-rent system supports several different types of automated biometric analysis, including facerecognition, age estimation and gender estimation[1].


I could not find a binary release of OpenBR 0.4.0 on the official website, which is theadvertised version of OpenBR at the time of writing. The only available option seemedto be building from source. The only instructions available were specific to Ubuntu 13.04.Following them on Ubuntu 12.04 LTS did not work, and thus, I was unable to install and testthe library. There were a large number of different steps involved, which I suspect makesthe process error-prone even if the correct OS version is used. Overall, the installationprocedure seemed immature and documentation was limited.

Chapter 4

System description ofstandalone framework

This chapter gives a practical description of the framework developed over the course ofthis project. First, a conceptual overview is given and then each component is described indetail. The framework depends on OpenCV (≥2.4), cmake (tested with 2.8). The WawoSDK is included in the source tree but may not be the latest version.

There are four primary types of objects that are of concern to client programmers andframework developers: detectors, recognizers, normalizers and techniques. Detectors andrecognizers encapsulate elementary face detection and recognition algorithms. Normalizersperform image preprocessing to deal with varying imaging conditions in the data and algo-rithm requirements. Techniques integrate the lower-level components and perform high-levelalgorithmic functions. These components are interchangable and can be mixed and matchedto suit the intended application and the source data.

Figure 4.1: Conceptual view of a typical application.

31

32 Chapter 4. System description of standalone framework

4.1 Detectors

A detector is a class that wraps the functionality of a face detection algorithm. Every de-tector implements the IDetector interface. The interface specifies a detect method, whichaccepts an image represented by an OpenCV matrix and returns a list of rectangles repre-senting the image regions containing the detected faces. In order to deal with varying inputimage formats and imaging conditions, and the fact that different detection algorithms maybenefit from different types of image preprocessing, the IDetector interface also specifies amethod for setting a detector-specific normalizer. See below for details on normalizers.

Figure 4.2: IDetector interface UML diagram.

The framework currently supports two different detectors, but adding additional detec-tors to the framework is simply a matter of implementing the interface.

4.1.1 CascadeDetector

CascadeDetector implements the cascade classifier with Haar-like features described in detailin 2.2.2. Algorithm parameters can be configured using the following methods:

– CascadeDetector(std::string cascadeFileName) - cascadeFileName specifies thefile that contains the cascade training data.

– setScaleFactor(double) - Specifies the degree to which the image size is reduced ateach image scale.

– setMinNeighbours(int) - Specifies how many neighbors each candidate rectangleshould have to retain.

– setMinWidth(int), setMinHeight(int) - Minimum possible face dimensions. Facessmaller than this are ignored.

– setMaxWidth(int), setMaxHeight(int) - Maximum possible face size. Faceslarger than this are ignored.

– setHaarFlags(int) - Probably obsolete, see OpenCV documentation.

4.1.2 RotatingCascadeDetector

RotatingCascadeDetector inherits from CascadeDetector and also implements the rotationextension to the cascade classifier described in detail in 5.2. In addition to the ones providedby CascadeDetector, algorithm parameters are supplied through the constructor:

– RotatingCascadeDetector(std::string cascadeFileName, double maxAngle,double stepSize) - maxAngle is the angle of the maximum orientation deviationfrom the original upright position, and stepSize is the size of the step angle in eachiteration.

4.2. Recognizers 33

4.2 Recognizers

A recognizer wraps a face recognition algorithm. It is a class that implements the IRecognizerinterface. A face recognition algorithm is always a two-step process. In the first step, thealgorithm needs to be trained with a gallery of known subjects, and thus a recognizer needsto implement the train method. It accepts a list of gallery images, a corresponding list ofimage regions containing the faces of the subjects and a list of labels indicating the identityof the subject in the image. All three arguments need to be of equal length and a givenindex refers to the same subject in all three lists.

Figure 4.3: IRecognizer interface UML diagram.

The recognize method is responsible for actually performing the recognition. It acceptsan image and a face region as input arguments, and returns the estimated label indicatingthe identity of the subject in the image. As in the case with detectors, different imageformats and imaging conditions can require varying image preprocessing in order to optimizethe performance of the recognition algorithm, and so image normalization is required. Inaddition, since recognizers deal with images from two different sources, the gallery and theprobe, two different normalizers may be necessary. Thus, an implementing class needs toaccept a gallery normalizer and a separate probe normalizer.

The framework currently supports five different recognition algorithms:

4.2.1 EigenFaceRecognizer

EigenFaceRecognizer implements the Eigenfaces algorithm, described in detail in 2.3.3. Al-gorithm parameters can be configured using the following methods:

– setComponentCount(int) - Set the number of eigenvectors to use.

– setConfidenceThreshold(double) - Set the known/unknown subject threshold. Avalue including and between 0.0 and DBL MAX.

4.2.2 FisherFaceRecognizer

FisherFaceRecognizer implements the Fisherfaces algorithm, described in detail in 2.3.3.Algorithm parameters can be configured using the following methods:

– setComponentCount(int) - Set the number of eigenvectors to use.



4.2.3 LBPHRecognizer

LBPHRecognizer implements the local binary pattern histograms algorithm, described indetail in 2.3.3. Algorithm parameters can be configured using the following methods:

– setRadius(int) - The radius used for building the circular local binary pattern.

– setNeighbours(int) - The number of sample points to build a circular local binarypattern from.

– setGrid(int x, int y) - The number of cells in the horizontal and vertical directionrespectively.


4.2.4 WawoRecognizer

WawoRecognizer implements the Wawo algorithm, described briefly 2.3.3 and 3.2. Algo-rithm parameters can be configured using the following methods:

– setRecognitionThreshold(float) - Set the known/unknown subject threshold. Avalue including and between 0.0 and 1.0.

– setVerificationLevel(int) - A value including and between 1 and 6. A lower valueruns faster, but probably decreases accuracy.

– setMatchesUsed(int) - If set to greater than 1, the result returned is the mode ofthe n most likely candidates.

4.2.5 EnsembleRecognizer

The EnsembleRecognizer combines an arbitrary number of elementary recognizers, whichvote democratically amongst themselves about the final result. The setConfidenceThresh-old(double) method sets the minimum fraction of participating recognizers which need toagree to produce a result. Otherwise, the probe is considered unknown. Note that theparticipating recognizers can also explicitly vote for an unknown identity, if so configured.

4.3 Normalizers

Input imagery can vary greatly depending on camera equipment, lighting conditions duringthe shot, lossy processing since the image was taken, etc. In addition, different imageprocessing algorithms require different input preprocessing to achieve optimal performance.Normalizers are modules that perform the preprocessing steps for the other parts of therecognition system. This makes it easy to test a variety of normalization options in orderto figure out which one suits a particular algorithm in a particular context best.

Figure 4.4: INormalizer interface UML diagram.

4.4. Techniques 35

A normalizer is a class that implements the INormalizer interface. The interface definesonly a single method, normalize, which accepts an input image and returns an outputimage. Any number and combinations of image processing operations can be performed bya normalizer. The framework currently supports the following four, but adding more is verysimple:

– GrayNormalizer - Converts an RGB image to grayscale.

– ResizeNormalizer - Scale an image to the given dimensions.

– EqHistNormalizer - Enhance contrast by equalizing the image histogram.

– AggregateNormalizer - Utility class that lets the user create a custom normalizerby assembling a sequence of elementary normalizers. This circumvents the need tocreate a new normalizer for every conceivable combination of normalization steps.

4.4 Techniques

A technique is a top-level class that ties together the constituent detection and recognitionalgorithms in a particular way. While the IDetection and IRecognition interfaces deal solelywith individual images, a technique is responsible for loading the gallery and probe files andpotentially also iterating over the frames of a probe video and algorithmic tasks spanningmultiple sequential frames.

Figure 4.5: ITechnique interface UML diagram.

Every technique implements the ITechnique interface which specifies the train and recog-nize methods. The former accepts a Gallery object which specifies the files to use for trainingthe underlying recognition model. The latter accepts a string containing the filename of theprobe image or video and produces an Annotation object describing the prescence, identitiesand locations of recognized individuals in the probe.

4.4.1 SimpleTechnique

SimpleTechnique is the prototypical technique. It accepts one detector for the gallery andone for the probe, and a single recognizer. All gallery files are loaded from disk in turn.If a gallery file is an image, it applies the gallery detector and feeds the detected imageregion and the corresponding label to the recognizer for training. If a gallery file is a video,performs the same operations on each of the frames of the video in turn. After the trainingis complete, it loads the probe from file. If it is an image, the probe detector is applied toit and the detected image region is fed to the recognizer and the result is stored. If it is avideo file, the technique performs the same operations on each frame in turn.

4.4.2 TrackingTechnique

TrackingTechnique implements the recognition/object tracking integration described in de-tail in 5.1. The gallery face detection preprocessing and recognizer training is identical toSimpleTechnique (see above).


Figure 4.6: SimpleTechnique class UML diagram.

Figure 4.7: TrackingTechnique class UML diagram.

4.5 Other modules

4.5.1 Annotation

The Annotation class represents the annotation of the presence and location of individualsin a sequence of images, such as a video clip. An instance can be produced by the frameworkthrough the application of face recognition to a probe image or video, but also saved andloaded from disk.

In addition, an instance can be compared to another, ”true”, annotation by a number ofperformance measures, described in 6.1. When saved to file, it uses a simple ASCII-based fileformat. The first line contains exactly one positive integer, representing the total numberof individuals in the subject gallery. Each subsequent line represents a frame or image.

The present individuals in the frame/image can be specified in two ways, depending onwhether or not location data is included. Either, the individuals present in the frame arerepresented by a number of non-negative integer labels, separated by whitespace, or eachindividual is represented by a non-negative integer label followed by a start parenthesis’(’, four comma-separated non-negative integers representing the x, y, width, height of therectangle specifying the image region of the face of the individual in the frame/image,followed by an end parenthesis ’)’. Each such segment is separated by whitespace. Forexample, a file of the first type may look like this:

4

0

0

0 1

0 1

0 1 2

1 2

1 2

1 2 3

4.5. Other modules 37

And a file of the second type may look like this:

9

0(46,25,59,59)

0(46,25,61,61)

0(45,24,63,63)

0(47,25,61,61)

0(46,25,61,61)

0(45,24,62,62)

0(46,24,62,62) 1(146,124,41,41)

0(45,24,62,62) 1(146,124,41,41)

4.5.2 Gallery

The Gallery class is a simple abstraction of the gallery data used to train face recognitionmodels. An instance is created simply by providing the path to the gallery listing file. Anoptional parameter samplesPerSubject specifies the maximum number of samples to extractfrom each video file in the gallery listing. If the parameter is left out, this indicates to clientmodules that the maximum number of samples should be extracted. This commonly meansall detected faces in the video.

Figure 4.8: Gallery class UML diagram.

The gallery listing is an ASCII-format newline-separated list of gallery file/label pairs.Each pair consists of a string specifying the path to the gallery file and a non-negativeinteger label specifying the identity of the subject in the gallery file, separated by a colon.The gallery file can be either an image or a video clip. In both cases, it is assumed thatonly the face of the subject is present throughout the image sequence. For example:

/home/user1/mygallery/subj0.jpg:0

/home/user1/mygallery/subj0.avi:0

/home/user1/mygallery/subj1.png:1

/home/user1/mygallery/subj1.wmv:1

/home/user1/mygallery/subj2.bmp:2

/home/user1/mygallery/subj2.avi:2

4.5.3 Renderer

This class is used to play an annotated video back and display detection or recognitionresults visually. To render detection results, it accepts the video file and a list of lists ofcv::Rects, representing the set of detected face regions for each frame of the video. Torender recognition results, it simply accepts the video file and an associated Annotationobject. The framerate of the playback can also be configured.

Figure 4.9: Renderer class UML diagram.


4.6 Command-line interface

The framework includes a simple command-line interface to a subset of the functionalityprovided by the framework, as an example of what an application might look like. It acceptsa gallery listings file and probe video file and performs face recognition. The technique,detection and recognition algorithms used can be customized and the result can be eithersaved to file or rendered visually. The application can also be used to benchmark differentalgorithms. The syntax is as follows:

./[executable] GALLERY_FILE PROBE_FILE [-o OUTPUT_FILE] [-t TECHNIQUE] [-d DETECTOR]

[-c CASCADE_DATA] [-r RECOGNIZER] [-R] [-C CONFIDENCE_THRESHOLD] [-b BENCHMARKING_FILE]

[-n SAMPLES_PER_VIDEO]

4.6.1 Options

– -o - Specifies the file to write the resulting annotation to. If this option is left out, theoutput is not saved.

– -t - Specifies the technique to use. Can be either ”simple” or ”tracking”. The defaultis ”simple”.

– -d - Specifies the detector to use. Can be either ”cascade” or ”rotating”. The defaultis ”cascade”.

– -c - Specifies the cascade detector training data to use. The default is frontal facetraining data included in the source tree.

– -r - Specifies the recognizer to use. Can be either ”eigenfaces”, ”fisherfaces”, ”lbph”or ”wawo”. The default is ”eigenfaces”.

– -R - Indicates that the result should be rendered visually. The sequence of frames isplayed back at 30 frames per second and the recognition result is overlayed on eachframe.

– -C - The confidence threshold to set for the selected recognizer. The range of thisvalue depends on the algorithm. The default is 0.

– -D - Set the benchmarking annotation file to use. The result is compared to this fileand performance data is written to stdout when processing is complete.

– -n - Set the number of faces to extract from each gallery video file for training therecognizer model. If this option is not given, as many faces as possible will be used.

Chapter 5

Algorithm extensions

This chapter discusses the improvements made to the basic face recognition system of theVidispine plugin that have been added to the standalone framework. The improvementsare twofold: Firstly, an integration of an arbitrary face recognition algorithm and theCAMSHIFT object tracking algorithm and secondly, an extension of the cascade face de-tection algorithm. This chapter primarily describes the improvements and discusses theirpotential and weaknesses, while their performance is empirically evaluated in chapter six.

5.1 Face recognition/object tracking integration

The majority of face recognition approaches proposed in the literature operate on a singleimage. As discussed in 2.4, these kinds of techniques can be applied to face recognition invideo by applying them on a frame-by-frame basis. However, this purposefully disregardscertain information contained in a video clip that can be used to achieve better recognitionperformance. For example, geometric continuity in successive frames tends to imply thesame object. How can we take that into account when recognizing faces? In addition,a weakness of many popular face recognition techniques is that they are view-dependent.Either the model is trained exclusively with samples from a single view, and thus only able torecognize faces from this one perspective, or the model is trained with samples from multipleviews and often suffer a reduction in recognition performance for any one perspective.

A color-based object tracking algorithm has the advantage of not being dependent on thepose of the tracked object as long as the color distribution of the object does not radicallychange with the pose. In fact, an advertised strength of the CAMSHIFT tracking algorithmis that it is able to continue tracking an object as long as occlusion isn’t 100%[12]. It isthus a natural step to combine an elementary face recognition algorithm to identify facesin individual frames, and the CAMSHIFT algorithm in order to overcome issues with view-dependence and to associate the faces of the same subjects across multiple frames. Theproposed algorithm consists of the steps below. For details concerning face recognition, facedetection or the CAMSHIFT algorithm, see 2.3, 2.2 and 2.5.4 respectively.

39

40 Chapter 5. Algorithm extensions

1. For each frame in the video:

(a) Extend any existing CAMSHIFT tracks to the new frame, possibly terminatingthem. If two tracks intersect, select one and terminate it.

(b) Detect any faces in the frame using an arbitrary face detection algorithm. If aface is detected and it does not intersect with any existing tracks, use it as theinitial search region of a new CAMSHIFT track.

(c) For each existing track, uniformly expand the search region of the current frameby a fixed percentage, and apply face detection inside the expanded search region.If a face is detected, apply an arbitrary face recognition algorithm on the faceregion, and store the recognized ID.

2. Once the video has been processed, iterate over all tracks that were created:

(a) Compute the mode of all recognized IDs of the track and write it as output to eachframe the track covers. Write the CAMSHIFT search region as the correspondingface region to each frame the track covers.

Figure 5.1 illustrates an example of the algorithm in action visually.

Figure 5.1: Example illustrating the face recognition/tracking integration. In frame a, aface is detected using a frontal face detector and a CAMSHIFT track subsequently created.The face is recognized as belonging to subject ID #1. In frame b, the CAMSHIFT track isextended, but since the head is partially occluded, the detector is unable to recognize it as aface. In frame c, the CAMSHIFT track is once again extended, and the head is once againfully revealed, allowing the detector to once again detect and identify the subject as #1. Inframe d, the face is completely occluded and CAMSHIFT loses track of it. The final trackcovers frames 1-3, the mode of the identified faces in the track is computed and assigned toeach frame the track covers.

The main advantage of this approach is that weaknesses in the face detector or theface recognizer for temporary disadvantageous conditions are circumvented. As long asa majority of successful identifications in a track are correct, failed detections or invalidrecognitions are overlooked. This means that temporary pose changes that would normallyinterrupt a regular recognition algorithm are mediated. The approach could also deal withtemporary occlusions or shadows so long as they do not cause CAMSHIFT to lose track.

5.2. Rotating cascade detector 41

The approach could also be extended to using multiple detector/recognizer pairs for multipleviewpoints to increase the range of conditions that result in valid identifications, furtherincreasing the probability of achieving a majority of valid identifications in a single track. Aprobable weakness of the approach is that cluttered backgrounds could easily result in falsepositive face detections, depending on the quality of the face detector used, which couldproduce tracks tracking background noise. It might be possible to filter out many tracks ofthis type by finding criteria that are likely to be fulfilled by noise tracks, such as a shortlength or a relatively low number of identifications.

5.1.1 Backwards tracking

In theory, there is nothing that prevents the algorithm presented above from tracking bothforward and backward along the temporal axis. In the case where a face is introduced into thevideo under conditions that are disadvantageous to the underlying detector/recognizer andthen later become detectable, the algorithm would miss this initial segment unless tracking isdone in both temporal directions. In practice, however, video is usually compressed in such away that only certain key frames are stored in full, and the frames inbetween are representedas sequential modifications to the previous key frame. This means that a video can onlybe played back in reverse by first playing it forwards and buffering the frames as they arecomputed. If no upper limit is set on the size of this buffer, even short video clips of moderateresolution would require huge amounts of memory to process. The implementation of therecognition/tracking integration developed during this project supports backwards trackingwith a finite-sized buffer, the size of which can be configured to suit the memory resourcesand accuracy requirements of the intended application.

5.2 Rotating cascade detector

The range of face poses that can be detected by cascade detection with Haar-like features islimited by the range of poses present in the training data. In addition, training for multipleposes can limit the accuracy of detecting a face in any one particular pose. To partiallymitigate this issue, an extension to the basic cascade detector has been developed. The basicidea is to rotate the input image about its center in a stepwise manner and apply the regulardetector to each resulting rotated image in turn, in order to detect faces in the correct posebesides a rotation about the image z-axis. However, this approach is likely to detect thesame face multiple times due to the basic cascade detector being somewhat robust to minorpose changes. In order to handle multiple detections of the same face, the resulting detectedface regions are then rotated back to the original image orientation and merged, producinga single face region in the original image for each face present. This extended detector thusexpands the range of detectable poses. The steps of the algorithm are as follows:

1. For a given input image, a given maximum angle and a given step angle, start byrotating the image by the negative maximum angle. Increase the rotation angle bythe step angle for each iteration until the angle exceeds the positive maximum angle.

2. For each image orientation:

(a) Apply cascade face detection to the rotated image.

(b) For each detected face region, rotate the region back to the original orientationand compute an axis-aligned bounding box (AABB) around it.

42 Chapter 5. Algorithm extensions

3. For all AABBs from the previous step, find each set of overlapping rectangles.

4. For each set, compute the average rectangle, i. e., the average top-left/bottom-rightpoint defining each rectangle.

Figure 5.2 shows an example of the rotating detector in action. A downside of therotating detector versus the basic detector is that the stepwise rotation increases processingtime by a constant factor, which is the total number of orientations processed, in addition tothe relatively minor time it takes to merge the resulting face regions. Thus, this extensionis only a viable option in scenarios where accuracy is prioritised over speed, and where arotation about the z-axis is very likely to occur frequently. Another issue is that the riskof detecting false positives is increased as the number of orientations considered increases.For this reason, the approach may be less useful in scenes with cluttered backgrounds.

Figure 5.2: Illustrated example of the rotating cascade detector in action. In the first step,the image is rotated to a fixed set of orientations and face detection is applied in each. Inthe second step, the detected regions are rotated back to the original image orientation andan axis-aligned bounding box is computed for each. In the last step, the average of theoverlapping bounding boxes is computed as the final face region.

Chapter 6

Performance evaluation

In this chapter the accuracy and performance of the basic face detection and recognitionalgorithms, as well as the algorithmic extensions introduced in this report, are empiricallyevaluated. Also, the accuracy and performance of the basic algorithms are evaluated undera variety of scene and imaging conditions, in order to elucidate what their strengths andweaknesses are, and to develop recommendations for application areas and avenues forimprovement to be used in future work.

Firstly, the performance metrics used in the evaluation are introduced and explained indetail. Secondly, the sources and properties of the various datasets used are described, andthirdly, the setup of each individual test is explained. The final section includes both apresentation and explanation of the results as well as an analysis and discussion.

6.1 Metrics

The task of performing face detection and recognition on a probe video with respect to asubject gallery and producing an annotation of the identities and temporal locations of anyfaces present in the probe can be viewed as a multi-label classification problem applied toeach frame of the probe. In order to prevent optimization according to a potential bias ina certain metric, a number of different metrics will be used. Let L be the set of subjectlabels present in the gallery. Let D = x1, x2, . . . , x|D| be the sequence of frames of theprobe video and let Y = y1, y2, . . . , y|D| be the true annotation of the probe where yi ⊆ Lis the true set of subject labels for the ith frame. Let H be a face recognition system andH(D) = Z = z1, z2, . . . , z|D| the predicted annotation for the probe by the system. Let trbe the time it takes to play the video and tp be the time it takes to perform face recognitionon the video. The following metrics prominent in the literature[57][5] will be used:

– Hamming loss:

HL(H,D) =1

|D|

|D|∑i=1

|yi∆zi||L|

where ∆ is the symmetric difference of two sets, which corresponds to the XOR opera-tion in Boolean logic. This metric measures the average ratio of incorrect labelings andmissing labels to the total number of labels. Since this is a loss function, a Hammingloss equal to 0 corresponds to optimal performance.

43

44 Chapter 6. Performance evaluation

– Accuracy:

A(H,D) =1

|D|

|D|∑i=1

|yi ∩ zi||yi ∪ zi|

Accuracy symmetrically measures the similarity between yi and zi, averaged over allframes. A value of 1 corresponds to optimal performance.

– Precision:

P (H,D) =1

|D|

|D|∑i=1

|yi ∩ zi||zi|

Precision is the average percentage of identified true positives to the total number oflabels identified. A value of 1 corresponds to optimal performance.

– Recall:

R(H,D) =1

|D|

|D|∑i=1

|yi ∩ zi||yi|

Recall is the average percentage of identified true positives to the total number of truepositives. A value of 1 corresponds to optimal performance.

– F-measure:

F (H,D) =1

|D|

|D|∑i=1

2 |yi ∩ zi||zi|+ |yi|

The F-measure is the harmonic mean of precision and recall and gives an aggregatedescription of both metrics. A value of 1 corresponds to optimal performance.

– Subset accuracy:

SA(H,D) =1

|D|

|D|∑i=1

indzi = yi

Subset accuracy is the fraction of frames in which all subjects are correctly classifiedwithout false positives. A value of 1 corresponds to optimal performance.

– Real time factor:

RTF (H,D) =tptr

The real time factor is the ratio of the time it takes to perform recognition on thevideo to the time it takes to play it back. If this value is 1 or below, it is possible toperform face recognition in near-real time.

6.2. Testing datasets 45

6.2 Testing datasets

Several different test databases were used, for two reasons. Firstly, using several databaseswith different properties, such as clutter, imaging conditions and number of subjects givesa better overview of how different parameters affect the quality and speed of recognition.Secondly, using standard test databases allow the results to be compared with other resultsin the literature. This section lists the databases that were used along with a description oftheir properties.

6.2.1 NRC-IIT

This database contains 11 pairs of short video clips of 11 individuals respectively. One ormore of the files for two of the subjects could not be read and were excluded from thisevaluation. The resolution is 160x120 pixels, with the face occupying 1

4 to 18 of the frame

width. The average duration is 10-20 seconds. All of the clips were shot under approximatelyequal illumination conditions, which was uniformly distributed ceiling light and no sunlight.The subjects pose a variety of different facial expressions and head orientations. Only asingle face is present in each video and the faces are present in the video for the entireduration.[26]

6.2.2 News

Contains a gallery of six short video clips of the faces of news anchors, each 12-15 sec-onds long, and a probe clip containing outtakes from news reports featuring two of thesubjects which is 40 seconds long. The resolution of the gallery clips varies slightly, butis approximately 190x250 pixels. The face occupies the majority of the image with verylittle background and no clutter. The subjects are speaking but have mostly neutral facialexpressions. The probe contains outtakes featuring two of the anchors in a full frame fromthe original news reports. The resolution is 640x480 pixels. The background contains slightclutter and is mostly static, but varies slightly as imagery from news stories is sometimesdisplayed. In some cases, unknown faces are showed. The illumination is uniform studiolighting without significant shadows.

6.2.3 NR

Consists of a gallery of seven subjects, and for each subject, five video clips featuring thesubject in a frontal pose, and one to five video clips featuring the subject in a profile pose.The gallery clips contain only the subjects’ faces but also some background with varyingdegrees of clutter. Each clip is one to 10 seconds long. The database also contains one 90second probe video clip featuring a subset of the subjects in the gallery. All gallery andprobe clips were shot with the same camera. They are in color with a resolution of 640x480pixels. The illumination and facial expressions of the subjects vary across the gallery clips.The pose and facial expressions of the subjects in the probe vary, but the illumination isapproximately uniform. The probe features several subjects in a single frame as well asseveral unknown subjects. The background is dynamic with a relatively high degree ofclutter compared to the other datasets.


6.3 Experimental setup

Three different experiments were performed. The first compares the accuracy and processingspeed of the tracking extension to the basic framework face recognition algorithms as thegallery size increases, the second measures the performance of the rotation extension to thecascade detector and the third evaluates the impact of variations in multiple imaging andscene conditions on recognition algorithm performance. This section describes the purposeand setup of each experiment. All tests were performed on an Asus N56V laptop with eightIntel Core i7-3630QM CPUs at 2.40GHz, 8 GB of RAM and an NVIDIA Geforce GT 635Mgraphics card. The operating system used was Ubuntu 12.04 LTS.

6.3.1 Regular versus tracking recognizers

This test was performed in order to evaluate the performance and processing speed of thetracking extension compared to regular recognition systems. The algorithms evaluated wereEigenfaces, Fisherfaces, Local binary pattern histograms, Wawo and the ensemble method(see 4.2.5) using the other four algorithms. For each algorithm, both a frame-by-framerecognition approach, as well as the CAMSHIFT tracking approach, described in 5.1, wasused. All algorithms used the cascade classifier with Haar-like features for face detection.This test was performed on the NRC-IIT database. The gallery was extracted from the firstvideo clip of each subject. The second video clip of each subject was used as probes. Themean subset accuracy over all probes was computed for a number of gallery sizes rangingfrom 1 to 50. The real time factor (RTF) was computed for each gallery size by dividingthe total processing time for all probes, including retraining the algorithms with the galleryfor each probe, by the sum total length of all probe video clips.

6.3.2 Regular detector versus rotating detector

This test evaluates the recognition performance and processing speed of recognition systemsusing the rotating extension of the cascade classifier face detector (see 5.2) with respect to theregular classifier. The algorithms used for the evaluation are Eigenfaces, Fisherfaces, Localbinary pattern histograms, Wawo and the ensemble method (see 4.2.5) using the other fouralgorithms. For each algorithm, both the regular cascade classifier and the cascade classifierwith the rotating extension was used. In all other regards, the test was identical to the testdescribed in the previous section. The rotating detector used 20 different angle deviationsfrom the original orientation, with a maximum angle of ±40 and a step size of 4.

6.3.3 Algorithm accuracy in cases of multiple variable conditions

The purpose of this test is to illuminate what the obstacles to applying the system to reallife scenarios are, where many different scene and image conditions can be expected to varysimultaneously. For this reason, each of the basic algorithms was tested on each of the threedatasets, each of which has a different set of variable image and scene conditions (see above).

For each of the three datasets, Eigenfaces, Fisherfaces, LBPH and Wawo was tested,each using a cascade detector trained for detecting frontal faces using the default cascadedata included in the framework. For the NRC-IIT database, the gallery was extractedfrom the first video clip of each subject. The second video clip of each subject was used asprobes. The mean Hamming loss, accuracy, precision, recall, F-measure and subset accuracywas computed over all probes. For the News database, the same set of measurements wascomputed over its single probe. For the NR database, only the frontal gallery was used. The

6.4. Evaluation results 47

same set of measurements was computed over its single probe. In each test, the maximumnumber of usable samples were extracted from each gallery video, as specified by the Gallerymodule (see 4.5.2). In the case of Wawo and the ensemble method, this had to be limitedto 50 samples per video for the NRC-IIT test and 10 samples per video for the News andNR tests due to segmentation faults occurring inside Wawo library calls when using largergallery sizes.

6.4 Evaluation results

6.4.1 Comparison of algorithm accuracy and speed over gallery size

As figure 6.1 illustrates, the tracking extension vastly improves the accuracy of all fivealgorithms. Wawo and the ensemble approach quickly reach a near-optimal level of accuracyat 95-96% and the other algorithms catch up as the gallery size increases. Without thetracking extension, Wawo outperforms the other algorithms for all but the smallest gallerysizes. However, as figure 6.2 shows, the RTF of Wawo and the ensemble approach are heavilyimpacted by the gallery size while the other algorithms retain an essentially constant RTF asthe gallery size increases. The tracking extension adds a relatively minor, constant increaseto the RTF of all algorithms.

These results suggest that the tracking-extended Wawo algorithm may be a suitablechoice for applications where the gallery size is small to medium-sized (but not too small)and processing time is a non-critical factor. On the other hand, LBPH or Fisherfacesperform nearly as well, even for smaller sample sizes and vastly outperform Wawo in termsof processing time. It should be noted that these results are highly dependent on the datasetand this analysis should only be considered valid for applications that use data with similarconditions to the test data.

Figure 6.1: The performance of each algorithm as measured by subset accuracy, as thegallery size increases. The thin lines represent the basic frame-by-frame algorithms and thethick lines represent the corresponding algorithm using the tracking extension.


Figure 6.2: The real time factor of each algorithm as the gallery size increases. The thinlines represent the basic frame-by-frame algorithms and the thick lines represent the corre-sponding algorithm using the tracking extension.

6.4.2 Regular detector versus rotating detector

Figure 6.3 shows that the rotating extension invariably improves the accuracy of all algo-rithms. The degree of the improvement does vary slightly, but usually lies in 0.05-0.1 range.The degree of improvement does not seem to be affected by the gallery size. Figure 6.4illustrates that the extension adds a hefty constant cost to the processing time with respectto gallery size, but which is instead directly proportional to the total number of orientationsconsidered by the rotation extension.

These results indicate that the rotating extension purchases a slight improvement in ac-curacy for a large cost in performance. The performance cost can be reduced by consideringa smaller number of orientations. Depending on the application, this may or may not reduceaccuracy. For example, in a scenario where the subjects to be identified are unlikely to leantheir heads to either side by no more than a small amount, a large maximum angle may bewasteful. In addition, this test only considers the case where the step size is a tenth of themaximum angle. It is possible that a larger step size would result in the same accuracy, butat the time of writing this has not been tested.

Another consideration is the increased likelihood of false positives. Since the basic facedetection algorithm is performed once for each orientation, the basic probability of a falsepositive is compounded by the number of orientations considered. This factor does notappear to impair the algorithm for this particular dataset, but in cases where the basic falsepositive rate is relatively high, such as in scenes with cluttered backgrounds, it may becomea greater problem.


Figure 6.3: The performance, as measured by subset accuracy, of each algorithm using theregular cascade classifier (thin lines) compared to ones using the rotating extension (thicklines), as the gallery size increases.

Figure 6.4: The real time factor of each algorithm as the gallery size increases. The thinlines represent the basic frame-by-frame algorithms and the thick lines represent the corre-sponding algorithm using the tracking extension.


6.4.3 Evaluation of algorithm accuracy in cases of multiple variableconditions

Table 6.1 shows the results of the NRC-IIT test. The subset accuracy measurement indicatesthat the algorithms correctly label about 50-60% of frames. Visual inspection shows thatmost of the error comes from a failure to detect faces that have been oriented away from thecamera or distorted and/or occluded. This issue is overcome using the tracking techniqueas demonstrated above.

Table 6.1: NRC-IIT test results. The values of all measures besides Hamming loss are equalbecause by their definition, they equate to the subset accuracy when |zi|, |yi| ≤ 1.

Algorithm Hamming loss Accuracy Precision Recall F-measure Subset accuracy

Eigenfaces 0.104112 0.531498 0.531498 0.531498 0.531498 0.531498Fisherfaces 0.0875582 0.605988 0.605988 0.605988 0.605988 0.605988

LBPH 0.0975996 0.560802 0.560802 0.560802 0.560802 0.560802Wawo 0.0908961 0.590968 0.590968 0.590968 0.590968 0.590968

Ensemble 0.0933599 0.57988 0.57988 0.57988 0.57988 0.57988

Table 6.2 shows the result of testing on the News dataset. A noteworthy feature here isthat the recall is similar to the results for the NRC-IIT test, which means that about thesame fraction of true positives were identified. However, the precision is markedly lower,which indicates a greater number of false positives. This is most likely due to the morecluttered background in the News probe. This is corroborated by visual inspection of theresult. The error consists both of non-face background elements falsely classified as faces bythe detector and unknown faces falsely identified as belonging to the subject gallery by therecognizers. The ensemble method performs worst according to all measures in this test.This is most likely due to the fact that the Wawo component causes a segmentation faultfor sample sizes larger than 10 and this limitation is applied to the other algorithms as wellin the current implementation. The other algorithms perform comparatively worse at thisgallery size as demonstrated above and thus sabotage the overall performance.

Table 6.2: News test results. Accuracy and precision are equal when yi ⊆ zi whenever|yi ∩ zi| 6= 0. If yi = zi, accuracy and precision equals recall. This indicates a number offalse positives were present.



LBPH 0.309898 0.444169 0.444169 0.622002 0.50284 0.269644Wawo 0.351944 0.340433 0.340433 0.463193 0.381086 0.219189

Ensemble 0.368211 0.301213 0.301213 0.438379 0.34654 0.166253

Table 6.3 shows the results for the NR test. Despite the fact that the training data isof somewhat lower quality and the background is highly dynamic and cluttered, with manyunknown individuals present, Wawo and Fisherfaces performed on par with the results fromthe News test, although Eigenfaces and LBPH performed worse. To various degrees, theprecision measurements indicate all methods detected a large number of false positives.Visual inspection shows that the error is due to non-faces classified as faces, unknownindividuals identified as belonging to the gallery and, to a greater extent than for the Newstest, known subjects falsely identified as other known subjects. The last issue may arisepartly from the relatively lower quality of the subject gallery but also from the more dynamic


and varied poses and facial expressions all individuals in the probe assume. Despite the samegallery size limitation to the ensemble method as in the News test, strangely it outperformsall other algorithms except Wawo.

Table 6.3: NR test results. Accuracy and precision are equal when yi ⊆ zi for all i where|yi ∩ zi| 6= 0. If instead yi = zi, accuracy and precision would equal recall. This indicates anumber of false positives were present



LBPH 0.263492 0.244444 0.244444 0.388889 0.291667 0.105556Wawo 0.194444 0.389583 0.389583 0.625 0.465463 0.169444

Ensemble 0.21746 0.343981 0.343981 0.55 0.40963 0.155556

The above results indicate that a major obstacle to applying face recognition in reallife scenarios is the high rate of false positives detected in cluttered background conditionsand with unknown individuals present in the scene. This problem can be attacked fromtwo angles. The first is that the face detection algorithm can be improved so as to reducethe number of non-face regions falsely detected. Besides testing other algorithms than thecascade classifier, or using better training data, it may also be possible to preprocess theimage to optimize the conditions for face detection.

The second angle would be to improve the face recognition algorithms themselves soas to correctly label unknown individuals as such. While no formal evaluation has beenperformed in this project, rudimentary investigation has indicated that it is possible tooptimize the recognition performance for a particular dataset to a certain degree by findingan appropriate confidence level. However, as the dataset grows large, it seems likely thatthese gains would diminish.


Chapter 7

Conclusion

In this report, the implementation of a standalone version of the Vidispine face recognitionplugin was documented, the tradeoff between face recognition performance and accuracywas evaluated and the possibilities of integrating face recognition with object tracking wasinvestigated.

Among the algorithms evaluated, Wawo performs better than the others for all but thesmallest gallery sizes. However, this comes at a great performance cost as the recognitiontime scales linearly with the number of samples in the gallery. Eigenfaces outperformsWawo in terms of accuracy for small gallery sizes and Fisherfaces has an almost comparableaccuracy for large gallery sizes, and the processing time for both is constant with respect togallery size.

This implies that the best method to use depends on the requirements of the application.If maximizing accuracy with a limited, but not too limited (>5 samples per subject), gallerysize is the goal, at any computational costs, Wawo may be the best option. If the gallerysize is limited (<25 samples per subject) but processing time should also be kept in check,Eigenfaces would be better. Finally, if a relatively large gallery size (>30 samples persubject) can be acquired and processing time should be minimized, Fisherfaces seems to bethe best choice.

While acquiring a multitude of qualitative photos of a single individual, especially onethat is not directly available, can be very difficult, a large number of admissable samplescan easily be extracted from short video clips. In the modern era of the Web and real-timemedia streaming, this is a much more viable option than it used to be in the past.

Due to the nature of the evaluation dataset, these recommendations are conditional onthe assumption that the probe data has an uncluttered background and constant illumina-tion. A cluttered background in particular can be devastating for recognition performance,as the number of false positives for both face detection and recognition rises dramatically.The best recommendation to deal with these issues that can be given based on this evalu-ation is simply to restrict the application scope so as not to include scenes with clutteredbackgrounds and greatly variable illumination.

One of the original goals was to investigate the possibilities of profile face recognition.While the framework supports profile face recognition in principle, by supplying profilerecognition training data to the elementary detection and recognition algorithms, the per-formance of such approaches have not been evaluated in this report. This is mainly due tothe fact that there is hardly any suitable data for such an evaluation readily available, andgathering such data is a time-consuming process that did not fit into the project schedule.

53

54 Chapter 7. Conclusion

An approach for integrating image-based face recognition algorithms and the CAMSHIFTobject tracking algorithm was developed, the specifics of which are described in 5.1. Usingthis method, the performance of the basic face recognition algorithms was improved by ap-proximately 35-45 percentage points. In certain contexts, the method seems to be able toovercome some major obstacles to successful face detection and recognition, such as partialocclusion, face deformation, pose changes and illumination variability.

7.1 Limitations of the evaluation

The majority of tests in this evaluation was performed on the NRC-IIT dataset, whichcovers only a restricted set of possible scene and imaging conditions. As a consequence,the results, discussion and recommendations are mainly applicable to similar types of data.That is, scenes with static uncluttered backgrounds and constant illumination. Based onthe literature, this is a problem that affects the field as a whole, and many authors call formore standardized face recognition datasets covering wider ranges of variables[6]. If moretime was available, additional data would be gathered for a more informative evaluation.This would be made easier if the scope of the intended application area of the framework wasrestricted and specified in more detail, as the amount of necessary data would be reduced.

7.2 Future work

As the original aims of this project were quite broad, so are the possible future lines ofinvestigation. As previously mentioned, a primary issue throughout the project was thelack of useful data to use for evaluation of the developed solutions. Specifically, in order tosystematically address the performance of the various techniques under specific conditions,such as variability in illumination, pose, background clutter, facial expression, external facialfeatures such as beard, glasses or makeup, test data sets that introduce these factors oneby one, and in combination in small numbers, would be required. For a more restrictedapplication area, the amount of necessary test data would be limited to those conditionsthat appear in the real-world scenarios in which the framework would be used. There is alsoa lack of profile image and video test data relative to the amount of frontal face databasesavailable in the literature. An important future project could be to build an extensive testdataset, appropriate to the intended application, according to the above specifications, tobe used to evaluate the performance of new and existing algorithms under development.

The strength and applicability of the framework can always be enhanced by addingnew algorithms for face detection, face recognition and face tracking. If a more specificapplication area is selected, this would inform the choice of new algorithms to add, asdifferent algorithms have different strengths and weaknesses that may make them more orless suitable for a particular application. In addition, it would be interesting to see howthe ensemble method could be improved by adding new algorithms, from different facerecognition paradigms, that complement each other’s primary weaknesses.

Being able to distinguish between known and unknown individuals is relevant to manyapplications, so a future project could be to try to find more general solutions to thisproblem. This is yet another instance of a problem that most likely becomes easier if theproblem domain is restricted. Another possible direction could be to investigate imagepreprocessing methods to improve detection and recognition performance.

Chapter 8

Acknowledgements

I would like to thank my supervisor, Petter Ericson, for keeping me on track and providingsuggestions and insight throughout the project. I want to thank Johanna Bjorklund andEmil Lundh for providing the initial thesis concept and Codemill for providing a calm,quiet workspace. I’d also like to thank everyone at Codemill for helping me get set up andproviding feedback. Finally, I want to thank my girlfriend, my parents and my brother forsupporting me all the way.

55

56 Chapter 8. Acknowledgements

References

[1] OpenBR (Open Source Biometric Recognition). http://openbiometrics.org/ (visited2013-11-28).

[2] OpenCV (Open Source Computer Vision). http://opencv.org/ (visited 2013-11-28).

[3] Wawo Technology AB. http://www.wawo.com/ (visited 2013-11-28).

[4] R. Bolle A. K. Jain and S. Pankanti. Biometrics: Personal Identification in NetworkedSociety. Kluwer Academic Publishers, 1999.

[5] A. M. P. Canuto A. M. Santos and A. F. Neto. Evaluating classification methodsapplied to multi-label tasks in different domains. International Journal of ComputerInformation Systems and Industrial Management Applications, 3:218–227, 2011.

[6] A. H. El-baz A. S. Tolba and A. A. El-harby. Face recognition: A literature review.International Journal of Signal Processing, 2(2):88–103, 2005.

[7] O. Javed A. Yilmaz and M. Shah. Object tracking: A survey. ACM Comput. Surv.,38(4), 2006.

[8] T. Ahonen, Abdenour Hadid, and Matti Pietikainen. Face recognition with local binarypatterns. In In Proc. of 9th Euro15 We, pages 469–481, 2004.

[9] P. Ho B. Heisele and T. Poggio. Face recognition with support vector machines: Globalversus component-based approach. In In Proc. 8th International Conference on Com-puter Vision, pages 688–694, 2001.

[10] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEETransactions on Pattern Analysis and Machine Intelligence, 25:2003, 2003.

[11] A. L. Blum and P. Langley. Selection of relevant features and examples in machinelearning. Artificial Intelligence, 97:245–271, 1997.

[12] Gary R. Bradski. Computer vision face tracking for use in a perceptual user interface,1998.

[13] A. Hertzmann C. Bregler and H. Biermann. Recovering non-rigid 3d shape from imagestreams. In CVPR, pages 2690–2696. IEEE Computer Society, 2000.

[14] M. Oren C. P. Papageorgiou and T. Poggio. A general framework for object detection.In Proceedings of the Sixth International Conference on Computer Vision, pages 555–.IEEE Computer Society, 1998.

57

58 REFERENCES

[15] A. Albiol E. Acosta, L. Torres and E. J. Delp. An automatic face detection and recog-nition system for video indexing applications. Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing, 4:3644–3647, 2002.

[16] K. Etemad and R. Chellappa. Discriminant analysis for recognition of human faceimages. Journal of Optical Society of America A, 14:1724–1733, 1997.

[17] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals ofEugenics, 7(7):179–188, 1936.

[18] A. W. Fitzgibbon and A. Zisserman. Joint manifold distance: A new approach toappearance based clustering. In Proceedings of the 2003 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition, CVPR’03, pages 26–33, Wash-ington, DC, USA, 2003. IEEE Computer Society.

[19] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learningand an application to boosting. In Proceedings of the Second European Conference onComputational Learning Theory, pages 23–37. Springer-Verlag, 1995.

[20] P. Fua. Regularized bundle adjustment to model heads from image sequences withoutcalibrated data. International Journal of Computer Vision, 38:154–157, 2000.

[21] A. K. Roy Chowdhury G. Aggarwal and R. Chellappa. A system identification approachfor video-based face recognition. In ICPR (4), pages 175–178, 2004.

[22] S. Z. Li G. Guo and K. Chan. Face recognition by support vector machines. In Pro-ceedings of the Fourth IEEE International Conference on Automatic Face and GestureRecognition 2000, FG ’00, pages 196–201, Washington, DC, USA, 2000. IEEE Com-puter Society.

[23] J. W. Fisher G. Shakhnarovich and T. Darrell. Face recognition from long-term obser-vations. In In Proc. IEEE European Conference on Computer Vision, pages 851–868,2002.

[24] Y. Gao and M. K. H. Leung. Face recognition using line edge map. IEEE Transactionson Pattern Analysis and Machine Intelligence, 24:764– 779, 2002.

[25] G. G. Gordon. Face recognition based on depth maps and surface curvature. In SPIEGeometric methods in Computer Vision, pages 234–247, 1991.

[26] D. O. Gorodnichy. Video-based framework for face recognition in video. In SecondWorkshop on Face Processing in Video (FPiV’05) in Proceedings of Second CanadianConference on Computer and Robot Vision (CRV’05), pages 330–338, 2005.

[27] C. Gurel. Development of a face recognition system. Master’s thesis, Atilim University,2011.

[28] M. R. Lyu H.-M. Tang and I. King. Face recognition committee machine. In ICME,pages 425–428. IEEE, 2003.

[29] C. Harris and M. Stephens. A combined corner and edge detector. In In Proc. of FourthAlvey Vision Conference, pages 147–151, 1988.

REFERENCES 59

[30] J. Ghosn I. J. Cox and P. N. Yianilos. Feature-based face recognition using mixture-distance. In Proceedings of the 1996 Conference on Computer Vision and PatternRecognition (CVPR ’96), CVPR ’96, pages 209–216, Washington, DC, USA, 1996.IEEE Computer Society.

[31] M.-H. Yang J. Ho and D. Kriegman. Video-based face recognition using probabilisticappearance manifolds. In In Proc. IEEE Conference on Computer Vision and PatternRecognition, pages 313–320, 2003.

[32] N. Ahuja J. Weng and T. S. Huang. Learning recognition and segmentation of 3dobjects from 2d images. Proc. IEEE Int’l Conf. Computer Vision, pages 121–128,1993.

[33] R. Jafri and H. R. Arabnia. A survey of face recognition techniques. Journal ofInformation Processing Systems, 5(2):41–68, 2009.

[34] Y. Li K. Jonsson, J. Kittler and J. Matas. Learning support vectors for face verificationand recognition. In Fourth IEEE International Conference on Automatic Face andGesture Recognition, pages 208–213. IEEE Computer Society, 2000.

[35] H.-S. Le. Face Recognition: A Single View Based HMM Approach. PhD thesis, UmeaUniversity, 2008.

[36] J.-H. Lee and W.-Y. Kim. Video summarization and retrieval system using face recog-nition and mpeg-7 descriptors. Image and Video Retrieval, 3115:179–188, 2004.

[37] M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of the 23rd AnnualConference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages31–42, New York, NY, USA, 1996. ACM.

[38] C.-J. Lin. On the convergence of the decomposition method for support vector ma-chines. IEEE Transactions on Neural Networks, 12(6):1288–1298, 2001.

[39] X. Liu and T. Chen. Video-based face recognition using adaptive hidden markov models.In CVPR (1), pages 340–345. IEEE Computer Society, 2003.

[40] D. J. Kriegman M.-H. Yang and N. Ahuja. Detecting faces in images: A survey. IEEETransactions on Pattern Analysis and Machine Intelligence, 24(1):34–58, 2002.

[41] D. McCullagh. Call it super bowl face scan i.http://www.wired.com/politics/law/news/2001/02/41571 (visited 2013-11-26).

[42] M. C. Santana O. Deniz and M. Hernandez. Face recognition using independent com-ponent analysis and support vector machines. In AVBPA, volume 2091 of Lecture Notesin Computer Science, pages 59–64. Springer, 2001.

[43] K. Fukui O. Yamaguchi and K. Maeda. Face recognition using temporal image sequence.In Proceedings of the 3rd. International Conference on Face & Gesture Recognition, FG’98, pages 318–323, Washington, DC, USA, 1998. IEEE Computer Society.

[44] J. P. Hespanha P. N. Belhumeur and D. J. Kriegman. Eigenfaces vs. fisherfaces: Recog-nition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell.,19(7):711–720, 1997.

60 REFERENCES

[45] P. J. Phillips. Support vector machines applied to face recognition. In Advances inNeural Information Processing Systems 11, pages 803–809. MIT Press, 1999.

[46] C. J. Poelman and T. Kanade. A paraperspective factorization method for shape andmotion recovery. IEEE Trans. Pattern Anal. Mach. Intell., 19(3):206–218, 1997.

[47] V. Pavlovic R. Huang and D. N. Metaxas. A hybrid face recognition method usingmarkov random fields. In In Proceedings of ICPR 2004, pages 157–160, 2004.

[48] S.-Y. Kung S.-H. Lin and L.-J. Lin. Face recognition/detection by probabilistic decision-based neural network. IEEE Transactions on Neural Networks, 8(1):114–132, 1997.

[49] A. C. Tsoi S. Lawrence, C. L. Giles and A. D. Back. Face recognition: A convolutionalneural-network approach. IEEE Transactions on Neural Networks, 8(1):98–113, 1997.

[50] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions onPattern Analysis and Machine Intelligence, 22:888–905, 1997.

[51] C. Stauffer and W. E. L. Grimson. Learning patterns of activity using real-time track-ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:747–757,2000.

[52] K.-K. Sung and T. Poggio. Learning human face detection in cluttered scenes. ComputerAnalysis of Images and Patterns, 970:432–439, 1995.

[53] B. Takacs. Comparing face images using the modified hausdorff distance. PatternRecognition, 31(12):1873–1881, 1998.

[54] A. S. Tolba. A parameter-based combined classifier for invariant face recognition.Cybernetics and Systems, 31(8):837–849, 2000.

[55] A. S. Tolba and A. N. S. Abu-Rezq. Combined classifiers for invariant face recognition.Pattern Anal. Appl., 3(4):289–302, 2000.

[56] Carlo Tomasi. Shape and motion from image streams under orthography: A factoriza-tion method. International Journal of Computer Vision, 9:137–154, 1992.

[57] G. Tsoumakas and I. Katakis. Multi-label classification: An overview. Int J DataWarehousing and Mining, 2007:1–13, 2007.

[58] M. A. Turk and A. P. Pentland. Eigenfaces for recognition. Journal of CognitiveNeuroscience, 3(1):71–86, 1991.

[59] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. Computer Visionand Pattern Recognition, pages 586–591, 1991.

[60] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York,Inc., New York, NY, USA, 1995.

[61] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simplefeatures. In CVPR, volume 1, pages 511–518. IEEE Computer Society, 2001.

[62] P. Wagner. Face recognition with opencv. http://docs.opencv.org/trunk/modules/contrib/doc/facerec/facerec(visited2013-12-02).

REFERENCES 61

[63] W. Zhao and R. Chellappa. Sfs based view synthesis for robust face recognition. In Pro-ceedings of the Fourth IEEE International Conference on Automatic Face and GestureRecognition, pages 285–292, 2000.

[64] S. K. Zhou and R. Chellappa. Probabilistic human recognition from video. In ECCV(3), volume 2352 of Lecture Notes in Computer Science, pages 681–697. Springer, 2002.

Documents

Towards a Video Annotation System using Face Recognition · The nal goal would be to try to combine face detection and recognition with object tracking forwards and backwards in time