Improvements in Pose Invariance and Local … · as´ı como la gran dimensionalidad del problema constituyen los principales obst´aculos ... en voz alta todos los mu´sculos y huesos

Improvements in Pose Invariance

and Local Description for

Gabor-based 2D Face

Recognition

Author: Daniel Gonzalez-JimenezAdvisor: Jose Luis Alba-Castro

A Thesis submitted for the degree ofDoctor Europeus

Universidade de VigoVigo, 2008

A mi familia

Abstract

Automatic face recognition has attracted a lot of attention not only because of thelarge number of practical applications where human identification is needed but alsodue to the technical challenges involved in this problem: large variability in facialappearance, non-linearity of face manifolds and high dimensionality are some themost critical handicaps. In order to deal with the above mentioned challenges, thereare two possible strategies: the first is to construct a “good” feature space in whichthe manifolds become simpler (more linear and more convex). This scheme usuallycomprises two levels of processing: (1) normalize images geometrically and photo-metrically and (2) extract features that are stable with respect to these variations(such as those based on Gabor filters). The second strategy is to use classificationstructures that are able to deal with non-linearities and to generalize properly. Toobtain high performance, an algorithm may need to combine both strategies. In thisThesis we have tackled completely different problems throughout the complex facerecognition process, proposing solutions that combine both schemes in the frameworkof Gabor-based face recognition.

Jointly with factors such as illumination and expression, differences in viewpointare mostly responsible for the large appearance variability in face images. In this The-sis we have tackled the pose problem by proposing two different approaches based ona 2D linear model. These techniques take advantage of facial symmetry to overcomeproblems due to self-occlussion and synthesize virtual images at specific viewpointsby means of texture mapping, obtaining comparable results to a 3D approach basedon Morphable Models with horizontal rotations up to 67.5◦.

Some of the most successful face recognition approaches that have been proposedup to date are those based on extraction of Gabor features. This choice is motivatedboth by biological reasons and because of their optimal characterization in the spaceand frequency domains. Using Gabor filters as recognition engine, we have proposeda method for extracting features from positions or regions that are somehow subject-specific, by exploiting individual face structure. This constitutes a new point of viewwith respect to classical methods that extract features from a pre-defined (eitherrectangular or face-like) graph.

Following with Gabor-based approaches and in order to obtain better performance,

i

ii

we have empirically validated different state-of-the-art tools for combining local Gaborsimilarities, and proposed an evaluation of different distance measures for Gaborfeatures comparison.

Despite the large number of papers dealing with Gabor-based recognition systems,no statistical model has been proposed or used for Gabor feature coefficients. In thisThesis we have studied the marginal statistics of coefficients extracted from faceimages, proposing the Generalized Gaussian distribution to model the characteristicnon-normal behavior these features show. In addition, multivariate characterization ofGabor coefficients has been also considered for study. A novel multivariate extensionof the Generalized Gaussian has been proposed and tested with success in limitedexperiments.

Finally, in this Thesis we have implemented software for tracking faces throughoutvideo sequences, therefore setting the necessary grounds to develop face recognitionsystems from video. Recent results using this tracking module have been obtainedin the context of pose-robust recognition from video and audio-video asynchronydetection.

Resumen

El reconocimiento automatico de caras ha atraıdo una gran atencion debido no soloa la gran cantidad de aplicaciones practicas en las que se requiere la identificacionde personas, sino tambien por el reto tecnologico que entrana: la gran variabilidaden la apariencia facial, la no linealidad de los subespacios donde residen las caras,ası como la gran dimensionalidad del problema constituyen los principales obstaculospara alcanzar un sistema fiable.

Para hacer frente a los problemas anteriormente citados existen dos posibles es-trategias: la primera de ellas consiste en usar un espacio de caracterısticas mas sencilloque el original (mas lineal y convexo). Este esquema generalmente conlleva dos nive-les de procesado: (1) Normalizacion geometrica y fotometrica y (2) extraccion decaracterısticas robustas (como las derivadas de los filtros de Gabor). La segunda es-trategia busca clasificadores que sean capaces de lidiar con la no linealidad, ası comogeneralizar adecuadamente. Habitualmente, un algoritmo debe combinar ambas es-trategias para obtener buenas prestaciones. En esta Tesis, se han tratado problemasde ındole muy diversa relativos al reconocimiento de caras, proponiendo solucionesque combinan ambos esquemas, usando siempre los filtros de Gabor como motor dereconocimiento.

Conjuntamente con factores como iluminacion y expresion, las diferencias de poseson mayormente responsables de los grandes cambios en la apariencia facial. Enesta Tesis hemos tratado el problema de la pose proponiendo dos algoritmos que sebasan en un modelo lineal bidimensional. Estas tecnicas aprovechan la simetrıa facialpara hacer frente a oclusiones debidas a la rotacion de la propia cabeza y sintetizanimagenes virtuales en poses especıficas usando mapeado de textura. Los sistemaspropuestos obtienen resultados comparables a los conseguidos por un algoritmo tridi-mensional basado en Morphable Models en rotaciones de hasta 67.5◦.

Algunos de los sistemas de reconocimiento facial mas famosos son los basadosen extraccion de caracterısticas Gabor. Esta eleccion esta motivada por razonesbiologicas ası como por su optima resolucion en los dominios de espacio y frecuen-cia. Usando filtros de Gabor como motor de reconocimiento, hemos propuesto unmetodo que trata de extraer caracterısticas en posiciones o regiones que de algunmodo son especıficas de cada individuo. Esto constituye un nuevo enfoque con re-

iii

iv

specto a los clasicos metodos que extraen caracterısticas en puntos predefinidos de lacara. Siguiendo con algoritmos basados en Gabor, hemos propuesto una comparativade diferentes tecnicas para combinacion de similitudes locales (fusion intramodal).Asimismo, se ha propuesto una evaluacion de diferentes medidas de distancia paracomparacion de caracterısticas locales de Gabor.

A pesar del gran numero de artıculos que se han escrito utilizando filtros de Ga-bor para reconocimiento facial, ninguno de ellos ha propuesto (o usado) un modeloestadıstico. En esta Tesis se ha estudiado la distribucion marginal de coeficientes deGabor extraıdos de imagenes de caras, proponiendo el uso de la Gaussiana General-izada para modelar el comportamiento no normal que muestran dichos coeficientes.Ademas, tambien se ha estudiado la caracterizacion estadıstica multivariada de coe-ficientes de Gabor, proponiendo una nueva formulacion probada con exito en experi-mentos limitados.

Finalmente, en esta Tesis se ha desarrollado sofware de seguimiento de caras.De este modo, hemos asentado las bases necesarias para implementar sistemas dereconocimiento facial en vıdeo. Se han obtenido resultados recientes usando estemodulo de seguimiento en el contexto de reconocimiento robusto a cambios de poseen vıdeo, ası como deteccion de asincronıa entre las senales de audio y vıdeo.

Agradecimientos

No se muy bien por donde empezar, o mas bien con quien comenzar estos agradec-imientos. Supongo que en las lıneas que siguen deberıa hablar de dos vertientesdiferenciadas: una cientıfica y otra afectiva, pero el problema radica en que a lo largode los anos de esta Tesis, ambas se entremezclan y son difıciles de separar. Por estemotivo, dejare fluir libremente aquello que me viene a la cabeza, gente y recuerdos,intentando no dejar a nadie en el tintero.

Las primeras personas que creo que merecen estar aquı son mis padres, Feliciano yMontse, no solo por la formacion y educacion que me han dado (la cual a veces olvido:-) ), sino tambien por su apoyo incondicional en todo aquello que me he propuesto.Un enorme gracias a los dos. Y creo que no basta.

Con la bata verde, el estetoscopio y la tabla de surf∗ encontramos a VıctorGonzalez Jimenez. Lo siento Grıctor, pero creo que voy a ser el primer Doctor dela familia, jaja. Con el he pasado momentos buenısimos: solo recordar como recita

en voz alta todos los musculos y huesos del cuerpo humano me llena de satis-faccion. Bromas aparte, debo decir que me he divertido un monton con sus historiasdel Nicolas Pena, de la sala de urgencias y de ginecologıa, ademas de disfrutar con susconocimientos medicos y sus habilidades culinarias. Y claramente, no puedo olvidardecirle que sin el, el viaje a Hawaii no hubiese sido lo mismo: que habrıa hecho yo soloen Waikiki, en el North Shore de Oahu, en The Big Island o en el Kalalau Trail deKauai? Sinceramente, espero poder seguir disfrutando de tu companıa mucho tiempo(no nos queda otro remedio que seguir aguantandonos, yengaaa!).

Otra persona que ha estado a mi lado durante todo este tiempo (salvo cuando meescapo a alguna que otra conferencia) es mi novia, Bea. Ella tambien tiene parte deculpa de que esta Tesis haya salido adelante. Quizas nunca se lo haya dicho y ahı va:No tengo palabras para agradecerte todo el apoyo que me brindas, la confianza quetienes en mı y la tranquilidad que me da el tenerte cerca. En mi cabeza se agolpantal cantidad de buenos momentos que necesitarıa tiempo y muchas paginas podercontarlos todos. Aparte de esto, quiero que sepas que no te vas a poder aprovecharde mi momento sensiblero: Olvıdate de tener un gato mas en casa! ;-).

∗Mejor dedıcate a auscultar, pardi-surfer

v

vi

A estas alturas, si es que lees esta seccion, estaras diciendo: este desgraciado seha olvidado de mı. Pues no, este parrafo esta dedicado a ti, Jose Luis Alba Castro,alias Bubi (mi tutor de Tesis para los que no lo sepan). Querıa decirte que ha sidoun placer trabajar a tu lado (unos dıas mas que otros jaja). Ciertamente, no puedohacer otra cosa mas que darte las gracias. En primer lugar, por la confianza enmi/nuestro trabajo, por las ideas aportadas y por la “libertad” investigadora queme has dado a lo largo de estos cuatro anos. En segundo lugar porque no creoque haya muchos jefes que dejen a sus alumnos de doctorado viajar por el mundocomo tu me has permitido (Nueva York, Hong Kong, Hawaii, Madeira, Texas y unlarguısimo etcetera). En tercer lugar, porque todavıa no he podido olvidar las nochesen la pension de Manhattan ni la del hotel de Friburgo jajaja. Aunque a dıa dehoy todavıa sigamos buscando algun que otro .mat perdido Dios sabe donde (clarosıntoma de lo ordenados que somos) y de algun que otro varapalo cientıfico, finalmentenuestro trabajo ha obtenido la recompensa que creo que merecıa, y esta Tesis ası lorefleja: gracias Bubi! Asimismo, me gustarıa dar las gracias a Carmen, por todo suesfuerzo, paciencia y ayuda en miles de asuntos. Menos mal que estabas ahı...jaja.Como no, agradecer a la gente de la torre A su ayuda y consejos, especialmente aIgnacio Alonso, Fernando Perez, Paco y Julio Martın (ah!, y a Carlos Mosquera porhaberme “empujado” al mundo del reconocimiento de caras hace ya unos cuantosanos).

De companeros del trabajo, casi prefiero ni hablar. Que gente mas desagradablecon la que te puedes encontrar en la vida! :-)

Empezando en el A-001, otrora Inteligencia, recuerdo que cuando llegue alla por2004 estaban Enrique Alexandre, Jose Angel, Leandro, Quique y Patrisia. Hoy soloqueda Quique, al que debo agradecerle la ayuda a lo largo de estos anos con infinitoscomandos Linux/Unix y scripts tcsh (entre otros). No me olvidare del concierto dejazz al que fuimos aquel mes de agosto en Parıs, de la mojadura de Brujas y de algunque otro viaje mas: recuerdas aquella habitacion de Suiza en donde intentamos dormirtres o las cervezas en las sillas Alien? jajaja). Tambien es de destacar su integridadcomo persona (no se pasa a Windows a pesar de la presion de todo el despacho)y su capacidad de despiste, jaja. Marcos, 30 anos, orensano de nacimiento: quegran descubrimiento. Resuelve cualquier tipo de duda, divertido, deportista, hablaidiomas, ahorrador, soltero y sin compromiso. Chicas, a que estais esperando? Pueseso, muchısimas gracias por la gran ayuda prestada (aunque todavıa no hayas sabidoaclararme como se dice “Por su parte” en ingles) y por presentarme a personajesde la talla de Manolo, el Batu, El Payo Juan Manuel, etc. Sin la lluvia de graficasy las pinceladas culturales a las que nos tienes acostumbrados, el despacho no serıalo mismo :-). Ahora es el turno de la unica chica que dıa tras dıa es capaz deaguantarnos: Marta. Gracias a ella sabemos que es una faceta, un fast multipolemethod y conocemos a eruditos como Wilton y Burkholder. Si vais a ir a cenar acualquier restaurante/bar/meson de Vigo o alrededores, pedidle consejo. Segurısimo

vii

que lo conoce! La ultima adquisicion es Norber, que ha sido papa recientemente.Buen chaval, se le coge carino rapido, aunque no tan deprisa como el quisiera:-). Finalmente, me gustarıa recordar al resto de personas que han pasado por aquı:Patricia, Carolina y Deborah. Buenas risas que hemos echado!

Subiendo hasta el ultimo piso de la torre, encontramos la puerta del TSC-5. Aquıtambien hay varias personas que merecen estar en estas paginas. Ire por orden anal-fabetico:

Mencion especial para Luis, el Prıncipe del Watermarking. Ya son muchos anoscompartiendo el dıa a dıa y quieras que no, se nota. Debo darte las gracias por tuayuda en la innumerable cantidad de preguntas y dudas resueltas, pero sobre todopor tu amistad, tu presencia en las ocasiones que se merecen y por ser el reporterooficial del TSC-5 jaja.

Fran Campillo (aka Campanillo), guru de las pruebas subjetivas. Estoy conven-cidısimo de que una de aquellas dos voces sonaba mucho mejor que la otra: estasseguro de que el fichero de configuracion que use yo no estaba mal?? :-). Que maspuedo decir de el... pues que lo aprecio un monton y que ya sabe que hay una fiestacon y otra sin el. Al lado de este chavalote aparece encajadito en su silla el Abu,famoso por el baile de Diogo y su aversion por las mesas camillas. Que buen chavalque eres!

Mais a dereita atopamos a Gabriel Domınguez Conde (eu tamen respeto os nomes),coas suas historias reintegracionistas, os autochistes e os seus monologos divertidısimos.Aında que diga que non, encantalle levar a contraria e iso converte a hora do xantarnun foro de discusion amigabel digno de presenciar hahaha. Aparte de todo isto debodicir que considero (libre e democraticamente) que es moi boa persoa e que te aprecio.Ah!, cando fundes GabiTel (R), acordate de min :-).

Gracias a Fernando Pineiro por atender a mis preguntas y solucionar los problemasinformaticos que han aparecido a lo largo de estos anos. Y como no, por las historiasincreıblemente divertidas con las que amenizas las cenas y por dar a conocer la famosadieta del Arce jajaja.

Dos chicos de la escuela del Data Hiding e Information Theory que es necesariodestacar son Pedro y Juan. Al primero decirle dos cosas: a) animo y suerte en todo loque se proponga (que seguro que lo consigue), y b) que tenga cuidado con la piraguaesa que tiene, a ver si va a aparecer un dıa encallado en las Cıes. A Juan, claramenteinfluenciado por la cancion Enjoy the silence de Depeche Mode :-), quiero decirleque siga ası, que algun dıa llegara a la altura de MAMMUT jaja. En el aspectoinvestigador, creo que soy yo el que no esta a la altura para darle consejos: solodesearle mucha suerte!

Eli, que decir de ti. Pues que eres un pedazo de pan! Que grande es recordarlos momentos en los que hablabas de los Java Beans en las reuniones y, sobre todo,tu alegrıa, tus chistes y las explicaciones en pseudo-ingles (Spiums jaja). Saludostambien a Gonzalo que ahora mismo se encuentra haciendo una estancia por Europa

viii

adelante (o al menos, eso nos dijo :-))Las ultimas incorporaciones (David y Bea) se han integrado adecuadamente y

apuntan alto: David por sus teorıas y alegrıa y Bea por ser capaz de aguantarnos :-).Otros chavales que ya no estan en ese laboratorio pero que merece la pena recordarpor multitud de razones es a Congui, Brandan, Armando y Yogu.

Grazie mille a Manuele per il suo grandissimo aiuto quando sono stato (ho essuto)in Alghero. In piu, devo dire che lavorare con te e stato molto produttivo e divertente.Il unico problema e che adesso mi mancano la buona pasta italiana, la isola, PortoFerro, la purpuzza di Mamoiadas, le cene da renato, i cinghiali e il profumo dellasedia :-). Spero tornare a lavorare insieme qualche giorno (Dovete (E)scusare il miocattivo italiano ma soltanto ho imparato le parolazze :-) ). Grazie anche a Gio perla sua pazienza con noi due e per il suo aiuto. Non dovrei finire senza ringraziaretutta la gente del laboratorio de Porto Conte (che bello che e!!!): Enrico, Andrea,Massimo(s), Gavin, ...

Thanks are also due to the people in the Centruum voor Wiskunde en Informatica(Amsterdam), specially to Hans, Ben, Eric and Onkar.

Gracias tambien a la gente que he ido conociendo a lo largo de esta Tesis en losdistintos workshops, cursos de verano, conferencias, estancias, etc. y con la que heentablado una buena amistad: Manuel Jesus, Juanjo, Fede, la gente del CVC, la delBiosecure Residential Workshop de Parıs, Giulia, Massimo, Romain, etc.

Asimismo, me gustarıa dar las gracias a los amigos que me han permitido de-sconectar de la Tesis y pasar buenos momentos, en especial a la banda de siempre:Berto, Yosi y Ruba (Trojans), a Diego (largas y buenas charlas), a mis companerosde pesca submarina Fran y Moi (temblad lubinas que llega el verano jaja), y tambiena los del futbol PDI-PAS (Miguel, Obi, Ardao, Jorge, Guille, Alberto, Alvaro, etc.).

Dado que esta Tesis habla de reconocimiento facial, me ha parecido una buena idearecordaros con las fotos de vuestras caras, plasmadas en la Figura 1. Como se puedeapreciar en esta figura, muchos son los problemas a los que se enfrenta un sistemaautomatico de reconocimiento de rostros: pose, iluminacion, expresion, oclusiones,jajaja. Ya empezamos.

Alghero, Sardegna, 23 de Noviembre de 2007/Vigo, 14 de Marzo de 2008

ix

Figure 1: Gente

Contents

1 Introduction 11.1 Introduction to Automatic Face Recognition . . . . . . . . . . . . . . 11.2 Major Challenges in Face Recognition . . . . . . . . . . . . . . . . . . 31.3 Overview of Face Recognition Methods . . . . . . . . . . . . . . . . . 41.4 Face Recognition Evaluations, Databases, Benchmarks and Competitions 8

1.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5 Motivations and Objectives . . . . . . . . . . . . . . . . . . . . . . . 11

1.5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 ViewPoint-Robust 2-D Face Recognition 172.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 A Point Distribution Model For Faces . . . . . . . . . . . . . . . . . . 212.3 Pose Eigenvectors and Pose Parameters . . . . . . . . . . . . . . . . . 23

2.3.1 Theoretical evidence on the fact that symmetric meshes help todecouple left-right rotations and non-rigid factors. . . . . . . . 24

2.3.2 Experiment on a video-sequence: Decoupling of pose and ex-pression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.3 Experiment on the CMU PIE database . . . . . . . . . . . . . 322.4 Virtual Face Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4.1 Thin Plate Splines Warping . . . . . . . . . . . . . . . . . . . 372.4.2 Synthesizing virtual face images across pose using Thin Plate

Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.5 Pose Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5.1 Warping to Mean Shape (WMS) . . . . . . . . . . . . . . . . . 442.5.2 Normalizing to Frontal Pose and Warping (NFPW) . . . . . . 442.5.3 Pose Transfer and Warping (PTW): Warping one image to

adopt the other one’s pose . . . . . . . . . . . . . . . . . . . . 452.5.4 Taking advantage of facial symmetry . . . . . . . . . . . . . . 47

2.6 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.7 Face authentication on the XM2VTS database . . . . . . . . . . . . . 51

xi

xii CONTENTS

2.7.1 Statistical Analysis of the Results . . . . . . . . . . . . . . . . 512.8 Results with automatic fitting via Invariant Optimal Features Active

Shape Models (IOF-ASM) . . . . . . . . . . . . . . . . . . . . . . . . 53

2.9 Face Identification on the CMU PIE database . . . . . . . . . . . . . 54

2.9.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 54

2.9.2 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.9.3 Other researchers’ results . . . . . . . . . . . . . . . . . . . . . 55

2.9.4 Testing the system with large pose changes. . . . . . . . . . . 57

2.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3 Shape-driven Gabor Jets 67

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.2 Ridges and Valleys Detector . . . . . . . . . . . . . . . . . . . . . . . 70

3.3 Shape sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.4 Extracting textural information . . . . . . . . . . . . . . . . . . . . . 73

3.5 Shape Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.5.1 Invariance to scaling, translation and rotation . . . . . . . . . 76

3.6 Texture dissimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.7 Measuring dissimilarity between sets of points . . . . . . . . . . . . . 77

3.8 Combining Shape and Texture . . . . . . . . . . . . . . . . . . . . . . 79

3.9 Testing the system against lighting and expression variations . . . . . 80

3.9.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.9.2 Facial expression changes . . . . . . . . . . . . . . . . . . . . . 82

3.9.3 Illumination variation . . . . . . . . . . . . . . . . . . . . . . . 83

3.10 Face Authentication on the XM2VTS database . . . . . . . . . . . . 84

3.10.1 Comparison with EBGM . . . . . . . . . . . . . . . . . . . . . 84

3.10.2 Measuring GD1 and GD2 performance . . . . . . . . . . . . . . 87

3.10.3 Shape and texture combination results . . . . . . . . . . . . . 90

3.10.4 Results from other researchers . . . . . . . . . . . . . . . . . . 90

3.10.5 Accuracy-based Feature Selection (AFS) . . . . . . . . . . . . 91

3.11 Face Authentication on the BANCA database . . . . . . . . . . . . . 93

3.12 Distance Measures for Gabor Jets Comparison . . . . . . . . . . . . . 95

3.12.1 Distance between faces . . . . . . . . . . . . . . . . . . . . . . 963.12.2 Results on BANCA’s MC protocol . . . . . . . . . . . . . . . 98

3.13 Conclusions and further research . . . . . . . . . . . . . . . . . . . . 100

4 Gabor Jets Similarity Fusion 103

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.2 Mesh Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.3 Jet Similarity Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

CONTENTS xiii

4.3.1 Accuracy-based Feature Selection (AFS) and Best IndividualFeatures (BIF) . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.3.2 Sequential Floating Forward Search (SFFS) . . . . . . . . . . 107

4.3.3 LDA-based fusion . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.3.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 109

4.3.5 Adaboosted MLP ensemble . . . . . . . . . . . . . . . . . . . 110

4.4 Database and Experimental results . . . . . . . . . . . . . . . . . . . 111

4.4.1 Database and Experimental Setup . . . . . . . . . . . . . . . . 111

4.4.2 Evaluating AFS and BIF approaches . . . . . . . . . . . . . . 111

4.4.3 Evaluating SFFS . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.4.4 Adaboosted MLP ensemble, SVMs and LDA-based . . . . . . 116

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5 Modeling Marginal Distributions of Gabor Coefficients 121

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.2 The Face Recognition system . . . . . . . . . . . . . . . . . . . . . . 123

5.3 Modeling Marginal Distributions of Gabor coefficients . . . . . . . . . 124

5.3.1 Univariate Generalized Gaussians . . . . . . . . . . . . . . . . 124

5.3.2 Modeling Gabor coefficients with univariate GG’s . . . . . . . 125

5.3.3 Bessel K Form Densities . . . . . . . . . . . . . . . . . . . . . 126

5.3.4 Analyzing Estimated GG Parameters . . . . . . . . . . . . . . 129

5.4 Coefficient quantization by means of Lloyd-Max algorithm . . . . . . 133

5.5 Face Verification on the XM2VTS database . . . . . . . . . . . . . . 134

5.6 Conclusions and further research . . . . . . . . . . . . . . . . . . . . 135

6 Recent Results 137

6.1 Automatic Face Alignment (Still Images and Video Sequences) . . . . 138

6.1.1 Face Alignment on Still Images . . . . . . . . . . . . . . . . . 138

6.1.2 Face Tracking on Video Sequences . . . . . . . . . . . . . . . . 138

6.2 Multivariate Generalized Gaussians . . . . . . . . . . . . . . . . . . . 140

6.2.1 Multivariate Generalized Gaussian Formulation . . . . . . . . 141

6.2.2 Poly-β Multivariate Generalized Gaussian . . . . . . . . . . . 145

6.3 Generalized Gaussians for Hidden Markov Models . . . . . . . . . . . 146

6.3.1 Fundamentals of HMMs . . . . . . . . . . . . . . . . . . . . . 149

6.3.2 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . 150

7 Conclusions 157

7.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

xiv CONTENTS

A Face Databases 163A.1 AR Face Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163A.2 BANCA Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

A.2.1 BANCA Protocols . . . . . . . . . . . . . . . . . . . . . . . . 165A.3 CMU PIE Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 166A.4 XM2VTS Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

A.4.1 Lausanne Protocol for the XM2VTS Database . . . . . . . . . 169

B Statistical Significance of TER Measures 171

C Active Shape Models with Invariant Optimal Features (IOF-ASM)173

D Estimation of Univariate Generalized Gaussian Parameters 175D.1 Maximum Likelihood Parameter Estimation . . . . . . . . . . . . . . 175D.2 Moments-based Parameter Estimation . . . . . . . . . . . . . . . . . 176

E (Mono-β) Multivariate Generalized Gaussian Parameter Estimation179E.1 Estimation of β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180E.2 Estimation of Σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

F Poly-β Multivariate Generalized Gaussian Parameter Estimation 183F.1 Partial derivatives ∂LL

∂βd. . . . . . . . . . . . . . . . . . . . . . . . . . 184

F.2 Partial derivative ∂LL

∂Σ. . . . . . . . . . . . . . . . . . . . . . . . . . . 184

G GG-HMM Training Algorithm 187

H Resumen en Castellano 191H.1 Reconocimiento de Caras 2-D Robusto a Cambios de Pose . . . . . . 191

H.1.1 Point Distribution Model: Autovectores de pose . . . . . . . . 192H.1.2 Generacion de imagenes sinteticas . . . . . . . . . . . . . . . . 194H.1.3 Correccion de pose y Reconocimiento Robusto de Caras . . . . 196

H.2 Extraccion y comparacion de respuestas de Gabor. Fusion Intramodal 203H.3 Modelado estadıstico de los coeficientes Gabor. . . . . . . . . . . . . . 207H.4 Conclusiones y Lıneas Futuras . . . . . . . . . . . . . . . . . . . . . . 211

Bibliography 214

List of Tables

2.1 Relationship between b2 and the angle of rotation θ. . . . . . . . . . . 332.2 False Acceptance Rate (FAR), False Rejection Rate (FRR) and Total

Error Rate (TER) over the test set for different methods. . . . . . . . 522.3 Confidence interval around ∆HTER = HTERA −HTERB for Zα/2 =

1.645 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522.4 False Acceptance Rate (FAR), False Rejection Rate (FRR) and Total

Error Rate (TER) over the test set for our method and automaticapproaches from [Messer et al., 2003]. . . . . . . . . . . . . . . . . . . 54

2.5 Identification rates (%) on the CMU PIE database: No pose correction 552.6 Identification rates (%) on the CMU PIE database: NFPW without

facial symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.7 Identification rates (%) on the CMU PIE database: NFPW plus facial

symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.8 Identification rates (%) on the CMU PIE database: PTW plus facial

symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.9 Identification rates (%) on the CMU PIE database: 3D Morphable

Model with LiST fitting algorithm [Romdhani et al., 2002] . . . . . . 582.10 Identification rates (%) on the CMU PIE database: Other results . . 582.11 Identification rates (%) on the CMU PIE database: Visionics’ FaceIt

results [Gross et al., 2001]. . . . . . . . . . . . . . . . . . . . . . . . . 582.12 Identification rates (%) on the CMU PIE database: Testing the system

with extreme pose changes . . . . . . . . . . . . . . . . . . . . . . . . 612.13 Identification rates (%) on the CMU PIE database: 3D Morphable

Model with LiST fitting algorithm [Romdhani et al., 2002] . . . . . . 62

3.1 Sketch Distortion (SKD) between the face images from Figures 3.4 to3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.2 Face Authentication on the XM2VTS database. False Acceptance Rate(FAR), False Rejection Rate (FRR) and Total Error Rate (TER) overthe test set for our Shape-driven approach (without sketch distortion)and the EBGM algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 89

xv

xvi LIST OF TABLES

3.3 Face Authentication on the XM2VTS database. False Acceptance Rate(FAR), False Rejection Rate (FRR) and Total Error Rate (TER) overthe test set for GD1 and GD2 computed from EBGM-SC and SDGJ-SC. 90

3.4 Results from other researchers on XM2VTS database. . . . . . . . . . . . 93

3.5 Results reported on the BANCA database from other researchers. . . 95

3.6 Our results on the BANCA database on configurations MC and P withλ0 = 1 and λ1 = λ2 = 0. . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.7 Average WER (%) using several distance measures D(~J~pi, ~J~qξ(i)

)to

compare jets and different resolution of input images (jets are not nor-malized). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99


)to

compare jets and different resolution of input images (jets are normal-ized to have unit L1 norm). . . . . . . . . . . . . . . . . . . . . . . . 99


)to

compare jets and different resolution of input images (jets are normal-ized to have unit L2 norm). . . . . . . . . . . . . . . . . . . . . . . . 100

4.1 Baseline results obtained when fn ≡ median (already shown in Section

3.10.1) for the Rectangular, face-like, and shape-driven approaches. . . . . 111

4.2 Total Error Rate (TER) using different fusion techniques: Median,LDA-based, Adaboosted MLP ensemble (MLP-AB), MLP ensemblebuilt with Adaboost using the similarities selected by LDA (LDA-AB),and Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . 118

4.3 Confidence interval (%) around ∆HTER = HTERA − HTERB forZα/2 = 1.645 for the different fusion methods according to configura-tion I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.4 Confidence interval (%) around ∆HTER = HTERA − HTERB forZα/2 = 1.645 for the different fusion methods according to configura-tion II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.1 Face Verification on the XM2VTS database. False Acceptance Rate(FAR), False Rejection Rate (FRR) and Total Error Rate (TER) overthe test set using both raw and compressed data. Moreover, approxi-mate storage saving is provided for each quantization level . . . . . . 136

6.1 Accuracies of G-HMM and GG-HMM for the two EEG experiments. . 154

6.2 Accuracies of G-HMM and GG-HMM for the three face recognitionexperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

LIST OF TABLES xvii

H.1 Verificacion en XM2VTS database. Tasa de falsa aceptacion (FAR),falso rechazo (FRR) y tasa total de error (TER) para nuestro metodoy el algoritmo EBGM. . . . . . . . . . . . . . . . . . . . . . . . . . . 205

H.2 Porcentajes de error usando distintas medidas de distancia y normal-izacion L1 para los jets . . . . . . . . . . . . . . . . . . . . . . . . . . 206

H.3 Tasas de error usando diversas tecnicas de fusion . . . . . . . . . . . . 207H.4 Verificacion en XM2VTS. Tasas de error obtenidas para los coeficientes

originales y los comprimidos con distinto numero de niveles de cuan-tificacion (NL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

List of Figures

1 Gente . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1.1 Scheme of a general face recognition system, highlighting the contri-butions we made. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 Position of the 62 landmarks used in this work on an image from theXM2VTS database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Effect of changing the value b1 on the reconstructed shapes. φ1 controlsthe up-down rotation of the face. . . . . . . . . . . . . . . . . . . . . 23

2.3 Effect of changing the value b2 on the reconstructed shapes. Upperrow: Coupling of both rigid (left-right rotation) and non-rigid (eye-brow movement and lip width) facial motion within the second eigen-vector φ2. Lower row: When using virtual symmetric meshes toaugment the training set, expression changes are not noticeable in φ2. 24

2.4 Coefficients from eigenvector φ2 (obtained with the original trainingset S) grouped by the specific facial feature they affect. . . . . . . . . 26

2.5 Upper row: Reconstructed shapes using φ2. Bottom row: Recon-structed shapes using φ′

2. Clearly, X(α) = X + αφ2 has the same

pose as X′(α) = X′+ αφ′

2, and the same non-rigid information as

X′(−α) = X′ − αφ′

2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.6 Covariance matrices plots: a) C, b) C′, c) C +C′, d) C −C′. Fromthese plots (C ≈ C′) and the fact that the non-rigid contribution issmaller than the rigid one, we can assume that (C − C′)vNRV is notsignificant compared to (C + C′)vLRV . . . . . . . . . . . . . . . . . 30

2.7 Coefficients from eigenvector φ2,aug (obtained with the augmentedtraining set Saug) grouped by the specific facial feature they affect. . 31

2.8 Experiment on the video sequence. Each row shows, for a given framef , the original shape X(f) and the reconstructed shapes (Xothers(f)and Xpose(f)) using bothers(f) and bpose(f) respectively. Clearly, Xothers(f)controls expression and identity while Xpose(f) is mostly responsiblefor rigid changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

xix

xx LIST OF FIGURES

2.9 Images taken from all cameras of the CMU PIE database for subject04006. The 9 cameras in the horizontal sweep are each separated byabout 22.5◦ [Sim et al., 2003] . . . . . . . . . . . . . . . . . . . . . . 34

2.10 Intra and inter pose variances for each of the shape parameters. . . . 352.11 Inter-pose variance divided by intra-pose variance for each of the shape

parameters. Clearly, the b2 parameter we identified as responsible forleft-right rotations presents the highest ratio. . . . . . . . . . . . . . . 36

2.12 Examples of synthesized face images across azimuth. In each row, thefrontal face is warped onto virtual meshes obtained by sweeping b2within a range of values. . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.13 Original face images across azimuth (±22.5◦ and ±45◦). By comparingthe two leftmost and the two righmost columns of Figure 2.12 with thefaces shown here, we can realize that the virtual images are very similarto the original ones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.14 Examples of synthesized face images across elevation. In each column,the frontal face is warped onto virtual meshes obtained by sweeping b1within a range of values. . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.15 Slight coupling between pose changes and facial expression when thetraining set has not been properly chosen. . . . . . . . . . . . . . . . 42

2.16 Upper row: Identity is not modified when changing the value of theup-down parameter (a good training set has been chosen). Lower row:Identity is clearly distorted when changing the value of the up-downparameter (not enough up-down tilting examples in the training set). 43

2.17 Upper row: Although pose is forced to change, synthetic imagesmaintain the original worried expression. Lower row: The same oc-curs when the subject smiles. . . . . . . . . . . . . . . . . . . . . . . 43

2.18 Images from subject 013 of the XM2VTS. Left: Original image. Right:Image warped onto the average shape. Observe that subject-specificinformation has been reduced (specially in the lips region). . . . . . . 45

2.19 Block diagram for pose correction using NFPW. After face alignment,the obtained meshes are corrected to frontal pose (Pose Normaliza-tion block), and virtual faces are obtained through Thin Plate Splines(TPS) warping. Finally, both synthesized images are compared. It isimportant to note that the processing of the training image could (andshould) be done offline, thus saving time during recognition. . . . . . 46

2.20 Block diagram for pose normalization using PTW. After face align-ment, mesh A adopts the pose of mesh B (Pose Transfer block), andvirtual face A is obtained through Thin Plate Splines (TPS) warping.Finally, faces A and B are compared. . . . . . . . . . . . . . . . . . . 48

2.21 Block diagram for pose normalization using NFPW and facial symmetry 492.22 Taking advantage of facial symmetry in PTW. . . . . . . . . . . . . . 50

LIST OF FIGURES xxi

2.23 Near profile image from subject 04004 of the PIE database. . . . . . . 60

2.24 Effect of large b2 values on the reconstructed shapes. The “occluded”contour, marked with blue dashed line, seems to disappear behind thevisible features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.25 Upper row: Examples of virtual images using the whole set of62 landmarks. As we can see, serious distortions are induced in thepresence of large rotations. Lower row: Synthesized faces using theset of visible landmarks. Clearly, images under large pose changes aremuch more realistic and seem to preserve identity information correctly. 62

2.26 First and third columns show original images at poses 14 and 02 re-spectively, while second and fourth columns present the correspondingsynthesized images using the variant of the PTW method introducedin Section 2.9.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.1 Applying the ridges and valleys detector to the same face image usingtwo different smoothing filters. Left: Original Image. Center-left:Valleys and ridges image. Center-right: Thresholded ridges image.Right: Thresholded valleys image . . . . . . . . . . . . . . . . . . . . 72

3.2 Left: Original rectangular dense grid. Center: Sketch. Right: Gridadjusted to the sketch. . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.3 Log-polar histogram located over a point of the face: shape context . 75

3.4 Top: Left: First image from subject A. Center: Sketch. Right: Gridadjusted to the sketch. Bottom: Left: Second image from subject A.Center: Sketch. Right: Grid adjusted to the sketch. . . . . . . . . . . 78

3.5 Top: Left: First image from subject B. Center: Sketch. Right: Gridadjusted to the sketch. Bottom: Left: Second image from subject B.Center: Sketch. Right: Grid adjusted to the sketch. . . . . . . . . . . 79

3.6 TER (Evaluation and Test sets) against λ . . . . . . . . . . . . . . . 81

3.7 Face images from the AR face database. Top row shows images fromthe first session: a) Neutral, b) Smile, c) Anger, d) Scream, e)

Left light on, f) Right light on, and g) Both lights on, whilebottom row presents the shots recorded during the second session: h)-n). 82

3.8 System performance with expression variations. Gallery: shot a) (neu-tral face from first session). Probe: shots b), c) and d). Clearly, thesystem only fails to recognize screaming faces. . . . . . . . . . . . . . 84

3.9 Top row: ridges and valleys for the neutral expression. Bottom row:ridges and valleys for the screaming expression. Although the positionand shape of the sketch lines obviously vary with expression, these lineskeep representing the main facial features in a consistent manner. . . 85

xxii LIST OF FIGURES

3.10 Top row: ridges and valleys for the neutral expression with diffuse light.Bottom row: ridges and valleys for the neutral expression when bothlights are switched on. Although the obtained sketch is not completelyinvariant to lighting changes (for instance, some valleys from the noseregion -top row, purple- dissapear in the presence of strong lighting,valleys associated to “wrinkles” appear -bottom row, blue- and someridges change -top and bottom rows, red-), the reported results (seetext) demonstrate that the system achieves a robust behaviour underthe tested conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.11 System performance under lighting variations. Gallery: shot a). Probe:shots e), f) and g). . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.12 Set of points used for jet extraction in the EBGM approach. Blue tri-angles represent manually annotated vertices, whilst red dots representthe middle point connecting manual vertices. . . . . . . . . . . . . . . 88

3.13 Rectangular rigid grid . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.14 Left: Original set of shape-driven points for client 003 of the XM2VTSdatabase. Right: Set of preserved shape-driven points after accuracy-based selection (Section 3.10.5) . . . . . . . . . . . . . . . . . . . . . 92

3.15 Examples of images from the controlled, degraded and adverse condi-tions of the BANCA database. . . . . . . . . . . . . . . . . . . . . . . 94

4.1 MLP architecture chosen for the experiments . . . . . . . . . . . . . . 110

4.2 Accuracy-based feature selection (AFS): effect of sweeping τ(%) onsystem performance (TER(%) measures are reported) for the shape-driven, manual face-like mesh and rectangular grid methods in config-urations I (left) and II (right). . . . . . . . . . . . . . . . . . . . . . 112

4.3 Best Individual Features (BIF) with criterion A: effect of sweeping thenumber K of best selected features on system performance (TER(%)measures are reported) for the shape-driven, manual face-like mesh andrectangular grid methods in configurations I (left) and II (right). . . 113

4.4 BIF with two different criteria for feature selection. Criterion A: Clas-sification accuracy, and Criterion B: Separation between client andimpostor similarities. Effect of sweeping K on system performance(TER(%) measures are reported) for the shape-driven mesh in config-urations I (left) and II (right). . . . . . . . . . . . . . . . . . . . . . . 114

4.5 SFFS with two different criteria for feature selection. Criterion A:Classification accuracy, and Criterion B: Separation between clientand impostor similarities. Effect of sweeping K on system perfor-mance (TER(%) measures are reported) for the shape-driven mesh inconfigurations I (left) and II (right). . . . . . . . . . . . . . . . . . . . 115

LIST OF FIGURES xxiii

4.6 SFFS with criterion B: effect of sweeping the number K of selectedfeatures on system performance (TER(%) measures are reported) forthe shape-driven, manual face-like mesh and rectangular grid methodsin configurations I (left) and II (right). . . . . . . . . . . . . . . . . . 116

4.7 Global SFFS Vs. Global BIF (Criterion B is used in both selectionschemes): effect of sweeping the number K of best selected features onsystem performance (TER(%) measures are reported) for the shape-driven mesh in configuration I. . . . . . . . . . . . . . . . . . . . . . . 117

5.1 Real part of the set of 40 (8 orientations × 5 scales) Gabor filters usedin this paper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.2 Rectangular grid over the preprocessed (geometrically and photomet-rically normalized) face image. At each node, a Gabor jet with 40coefficients is computed and stored. . . . . . . . . . . . . . . . . . . . 124

5.3 Effect of β on the univariate GG distribution. . . . . . . . . . . . . . 125

5.4 Histogram for coefficient g34 along with the fitted GG. . . . . . . . . 126

5.5 Kullback-Leibler and χ2 distances between the fitted GG and the datafor both real and imaginary parts of each Gabor coefficient. . . . . . . 127

5.6 Examples of observed histograms (on a log scale) along with the BKFand GG fitted densities . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.7 Left: Mean KL distance between observed histograms and the two es-timated densities (GG and BKF). Right: Associated standard deviation129

5.8 Left: Mean KL distance between observed histograms and the twoestimated densities (GG fitted via a moments-based method and BKF).Right: Associated standard deviation . . . . . . . . . . . . . . . . . 130

5.9 Mean KL distance between observed histograms and the two estimateddensities (GG fitted via a moments-based method and GG fitted viaML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.10 Obtained β and σ GG parameters for both real and imaginary partsof each Gabor coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.11 Shaping factors β and standard deviations σ for the real part of Gaborcoefficients grouped by scale subbands. . . . . . . . . . . . . . . . . . 132

5.12 Shaping factors β and standard deviations σ for the real part of Gaborcoefficients grouped by orientation subbands. . . . . . . . . . . . . . . 133

5.13 Face reconstruction [Potzsch et al., 1996] using original and quantizedcoefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.1 Preliminary results on automatic face fitting on an image from theCMU PIE database. Left: initialization. Center: Fitting after 10iterations. Right: Final fitting. . . . . . . . . . . . . . . . . . . . . . 138

xxiv LIST OF FIGURES

6.2 Example of a tracked face through a sequence from the BANCA database.Extraction of lip coordinates for audio-video asynchrony detection [Ar-gones Rua et al., 2008] . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.3 Similarity scores with and without pose correction in a video sequence 1416.4 Comparison of ΣML and Σ0 for modeling joint statistics of Gabor

coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.5 Comparison of ΣML and Σ0 when data are sampled from a 2-D lapla-

cian distribution [Eltoft et al., 2006] . . . . . . . . . . . . . . . . . . . 1446.6 Comparison of ΣML and Σ0 when data are sampled from a 2-D Gaus-

sian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446.7 Modeling joint statistics of Gabor coefficients (1,12). Contour probabil-

ity lines for the 2-D histogram are shown on top. The following statis-tical models are displayed: Mono-βMGG (middle-left), Poly-βMGG(middle-right), GMM (2 Gaussians, bottom-left), and GMM (10Gaussians, bottom-right) . . . . . . . . . . . . . . . . . . . . . . . . 147

6.8 Modeling joint statistics of Gabor coefficients (13,40). Contour prob-ability lines for the 2-D histogram are shown on top. The followingstatistical models are displayed: Mono-βMGG (middle-left), Poly-βMGG (middle-right), GMM (2 Gaussians, bottom-left), and GMM(10 Gaussians, bottom-right) . . . . . . . . . . . . . . . . . . . . . 148

6.9 Generating HMMs for the synthetic problems: (a) first experiment, (b)second experiment. Note that N(µ,σ) represents a Gaussian distributionwith mean µ and variance σ; U[a,b] represents an uniform distributionin the interval [a, b]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.10 Synthetic experiment 1. Generating emission functions are all Gaussian.1526.11 Synthetic experiment 2. Underlying data is not gaussian . . . . . . . 1536.12 Some face images used for testing. . . . . . . . . . . . . . . . . . . . . 1556.13 Fitted mesh on one frame. . . . . . . . . . . . . . . . . . . . . . . . . 155

A.1 Face images from the AR face database. Top row shows images fromthe first session: a) Neutral, b) Smile, c) Anger, d) Scream, e)

Left light on, f) Right light on, and g) Both lights on, whilebottom row presents the shots recorded during the second session: h)-n).164

A.2 Examples of images from the controlled, degraded and adverse condi-tions of the BANCA database. . . . . . . . . . . . . . . . . . . . . . . 165

A.3 Left: Setup of the CMU 3D Room [Kanade et al., 1998]. Right:Diagram of the locations of the cameras, flashes and the head of thesubject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

A.4 Images taken from all cameras of the CMU PIE database for subject04006. The 9 cameras in the horizontal sweep are each separated byabout 22.5◦ [Sim et al., 2003] . . . . . . . . . . . . . . . . . . . . . . 168

LIST OF FIGURES xxv

A.5 Frontal face images from the XM2VTS database . . . . . . . . . . . . 169

H.1 Posicion de los 62 nodos usados en esta Tesis . . . . . . . . . . . . . . 193

H.2 Efecto de cambiar el valor de b1 en las mallas reconstruidas. φ1 controlala rotacion arriba-abajo . . . . . . . . . . . . . . . . . . . . . . . . . . 193

H.3 Efecto de cambiar el valor de b2 en las mallas reconstruidas. φ2 controlala rotacion izquierda-derecha . . . . . . . . . . . . . . . . . . . . . . . 194

H.4 Ejemplos de imagenes sinteticas en rotacion de azimuth. . . . . . . . 195

H.5 Ejemplos de imagenes sinteticas en rotacion arriba-abajo . . . . . . . 197

H.6 Arriba: La identidad no se modifica cuando se escoge un buen con-junto de entrenamiento. Abajo: La identidad se distorsiona cuandono se escoge un buen conjunto de entrenamiento. . . . . . . . . . . . 198

H.7 Diagrama de bloques para correccion de pose con NFPW. Las dosmallas son corregidas a pose frontal (bloque Pose Normalization), ycaras virtuales frontales son generadas utilizando Thin Plate Splines(TPS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

H.8 Diagrama de bloques para correccion de pose con PTW. La malla Aadopta la pose de la malla B (bloque Pose Transfer), y una cara virtualA se genera a traves de Thin Plate Splines (TPS). Finalmente, secomparan las caras A y B. . . . . . . . . . . . . . . . . . . . . . . . . 199

H.9 Utilizando simetrıa facial en NFPW : La imagen original y la refle-jada se mapean a la malla frontal sintetica, y posteriormente ambasversiones son mezcladas usando mascaras. . . . . . . . . . . . . . . . 200

H.10 Utilizando simetrıa facial en PTW. Previa a la transferencia de pose, seobservan los valores de rotacion horizontal de ambas caras (parametrosb2). Si son de signo contrario, se refleja una de las caras y posterior-mente se hace la transferencia de pose. . . . . . . . . . . . . . . . . . 201

H.11 La primera y la tercera columnas muestran imagenes originales en±67.5◦ respectivamente, mientras que la segunda y la cuarta presentanlas caras sinteticas correspondientes . . . . . . . . . . . . . . . . . . . 202

H.12 Sistema de extraccion de respuestas de Gabor utilizando crestas y valles.203

H.13 Imagenes de la AR face database usadas en los experimentos. . . . . 204

H.14 Prestaciones con variaciones de expresion. Claramente, el sistema solofalla reconociendo caras con la boca abierta. . . . . . . . . . . . . . . 205

H.15 Prestaciones con variaciones de iluminacion. . . . . . . . . . . . . . . 205

H.16 Histograma de coeficiente 34 junto con la Gaussiana generalizada ajus-tada. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

H.17 Modelado estadıstico del histograma bidimensional de los coeficientesGabor (1,12). Contornos de probabilidad del histograma (arriba) ymodelo ajustado (abajo) . . . . . . . . . . . . . . . . . . . . . . . . . 209

xxvi LIST OF FIGURES

H.18 Modelado estadıstico del histograma bidimensional de los coeficientesGabor (13,40). Contornos de probabilidad del histograma (arriba) ymodelo ajustado (abajo) . . . . . . . . . . . . . . . . . . . . . . . . . 210

H.19 Resultados preliminares en ajuste automatico de la malla. Izquierda:Inicializacion. Centro: ajuste tras 10 iteraciones. Derecha: Ajustefinal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

H.20 Ejemplo de seguimiento de caras en la base de datos BANCA. Ex-traccion de coordenadas de labios para detectar asincronıa entre audioy vıdeo [Argones Rua et al., 2008] . . . . . . . . . . . . . . . . . . . . 213

Chapter 1

Introduction

Contents1.1 Introduction to Automatic Face Recognition . . . . . . . 1

1.2 Major Challenges in Face Recognition . . . . . . . . . . . 3

1.3 Overview of Face Recognition Methods . . . . . . . . . . 4

1.4 Face Recognition Evaluations, Databases, Benchmarksand Competitions . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Motivations and Objectives . . . . . . . . . . . . . . . . . 11

1.5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.1 Introduction to Automatic Face Recognition

Automatic recognition of people has received increasing attention from the ComputerVision, Machine Learning, Pattern Recognition and Computer Graphics communitiesduring the last decade [Gong et al., 2000, Zhao et al., 2003, Li and Jain, 2005, Rosset al., 2006]. Among other reasons, the need of secure access controls, automatedcrowd surveillance, tools for subject retrieval in large databases, internet communica-tion, and computer entertainment have powered the stablishment of a new researchand technology area known as biometric recognition, or simply biometrics. The maingoal of biometrics is to automatically discriminate between individuals based on oneor more signals derived from their physical or behavioral traits, such as face, iris, ear,fingerprint, voice, palm, written signature, gait, etc. Although the earliest attemptsin biometrics date back to the 70’s ([Goldstein et al., 1971], [Kanade, 1973] [Atal,

1

2 Chapter 1. Introduction

1976]), it has not been until the last decade that biometrics have been stablished asan specific research area. This fact is clearly evidenced by the publication of referencetexts such as [Jain et al., 1999, Ross et al., 2006, Maltoni et al., 2003, Li and Jain,2005], the existence of specific conferences (such as Audio- and Video-based Bio-metric Person Authentication, International Conference on Biometrics, AutomaticFace and Gesture Recognition), journals partially devoted to related topics (IEEETransactions on Information Forensics and Security [IEEE-TIFS, 2006]), commonbenchmarks, protocols and evaluations (FERET [Phillips et al., 2000b], Face Recogni-tion Vendor Test (FRVT) [FRVT, 2000], Face Recognition Grand Challenge (FRGC)[FRGC, 2004], XM2VTS [Messer et al., 1999, Luttin and Maıtre, 1998], BANCA[Bailly-Bailliere et al., 2003]), international projects (COST-275 [COST-275, 2001],SecurePhone [SecurePhone, 2004], Biosecure [BioSecure, 2004]), international consor-tia (European Biometrics Forum (EBF) [EBF, 2003]), standardization efforts (BioAPI[BioAPI, 1998], ISO SC 37 [SC37, 2002]) and increasing attention both from govern-ment (DoD Biometrics [DoD, 2000]) and industry (Viisage [Viisage, 1993], Identix[Identix, 1982], both of them merged into L-1 Identity Solutions [L-1, 2005], Cognitec[Cognitec, 2002], Omniperception [Omniperception, 2001], Neven Vision (formerlyEyematic and now part of Google), Toshiba, Samsung, Sagem, etc.). According tothe International Biometric Group (IBG) [IBG, 1996], global biometric revenues were719 million dollars in 2003, and are expected to reach 4.6 thousand million dollarsthis year (2008). This drastic increase is driven, among other reasons, by the supportof governments and the creation of specific standards for biometric data.

Among the different biometric attributes mentioned above, the face-based modal-ity seems to be a natural way to perform recognition. In fact, human beings arecontinuosly recognizing people in everyday’s life by their facial attributes. Althoughextremely reliable techniques in biometric recognition exist, such as iris, retina orfingerprint -based systems, these methods rely on the active cooperation of the par-ticipants, whereas a personal identification system based on analysis of frontal orprofile facial images is often effective without the subject’s cooperation or knowl-edge, i.e. face recognition provides low intrusiveness with a relative good accuracyas pointed out by IBG [IBG, 1996]. Moreover, among the six biometric modalitiestested in [Hietmeyer, 2000], facial features scored the highest compatibility in a Ma-chine Readable Travel Documents (MRTD) system based on a number of evaluationfactors [Hietmeyer, 2000] such as enrollment, renewal, machine requirements and pub-lic perception. Finally, there are some technological advances that are increasinglysupporting the use of face recognition nowadays, such as: i) Digital cameras capturingeither still images or video sequences are becoming less expensive, ii) Mobile devices(cellular phones, PDA’s) with integrated cameras are more and more common, andiii) The number of Closed Circuit TV (CCTV) networks for automated surveillancehas increased considerably. The constantly growing number of vendors involved in thedevelopment of face recognition technology is a proof of the success of facial biomet-

1.2. Major Challenges in Face Recognition 3

rics. Currently, the Face Recognition Homepage (http://www.face-rec.org) listsmore than 20 vendors within this field, some of which have participated in the lastFace Recognition Vendor Test (FRVT) 2006.

A general statement of the face recognition problem can be formulated as follows:given a still image or a video sequence of a scene, recognize one or more individualsin the imaged scene using a stored database of faces. To achieve this goal, severalsteps must be performed: i) face detection/tracking and face alignment, ii) facepre-processing to cope with illumination, pose, expression, etc. variations, iii) facialfeatures extraction, and finally iv) comparison (matching). An excellent survey of facerecognition methods (with extensions to cope with lighting, view-point variations,etc.) can be found in the work of [Zhao et al., 2003].

1.2 Major Challenges in Face Recognition

Although significant progress has been made in the field during the last decade, thegeneral face recognition problem still remains unsolved, i.e. there exist factors thatcause face-based biometric systems to fail, such as lighting conditions, view-point vari-ations, expressions, occlussions, facial hair, aging, etc. In addition to these, variousimaging parameters such as resolution, compression, aperture, exposure time, lensaberrations and sensor spectral response also increase the variability within imagesbelonging to the same subject (intra-subject variability). All the mentioned factorsare mixed in the image data and, as pointed out in [Moses et al., 1994, Adini et al.,1997], “the variations between images of the same face due to illumination and view-ing direction are almost always larger than the image variation due to change in faceidentity”. This appearance variability makes it difficult to extract the intrinsic infor-mation (i.e. identity) of the face objects from their respective images, and has encour-aged researchers working in different fields to focus on this difficult task. In additionto those research communities already mentioned at the beginning of the section,there has also been an increasing interest from cognitive sciences (e.g. Cognitive Psy-chology) in face recognition mechanisms by humans [Burton et al., 2005, Sinha et al.,2006, Jenkins and Burton, 2008]. Advances in this field provide deeper understandingof such a complex process, supporting the development of biologically-inspired novelalgorithms.

Apart from the large appearance variability in face images, other two major prob-lems are present in face recognition [Li and Jain, 2005]:

• The highly non-linear (and non-convex) manifolds in which the space of faceslies. Linear methods such as Principal Components Analysis (PCA) [Turk andPentland, 1991], Independent Components Analysis (ICA) [Bartlett et al., 2002]and Linear Discriminant Analysis (LDA) [Belhumeur et al., 1997, Lu et al., 2003]project the data linearly from the original image space to a low-dimensional


subspace. Therefore, they are not able to preserve the non-convex variations ofthe face manifold necessary to accurately distinguish between different subjects.

• The high dimensionality of the data and the small sample size. Another chal-lenge is the difficulty to generalize: even a face of only 64×64 pixels resides in a642-dimensional feature space. However, the number of images per person thatare available for learning the manifold is usually much smaller (typically 10 orless, even only one) than the dimensionality of the image space and hence, it isvery difficult to accurately generalize for unseen instances of a given face.

In order to deal with the above mentioned challenges, there are two possiblestrategies: the first is to construct a “good” feature space in which the manifoldsbecome simpler (more linear and more convex). This scheme comprises two levels ofprocessing: (1) normalize images geometrically and photometrically and (2) extractfeatures that are stable with respect to these variations (such as those based onGabor filters). The second strategy is to use classification structures that are able todeal with non-linearities and to generalize properly. To obtain high performance, analgorithm may need to combine both strategies.

1.3 Overview of Face Recognition Methods

The earliest methods [Goldstein et al., 1971, Kanade, 1973, Brunelli and Poggio,1993] aimed to detect a set of facial landmarks (eyes, nose, chin, etc.) and usegeometric-based features such as areas, distances and angles between these landmarksas face descriptors for recognition. However, facial feature detection and measurementtechniques developed up to now are not accurate enough for reliable geometric-basedface recognition [Cox et al., 1996]. Since these methods do not take the appearance ofthe face into account, lot of information regarding identity is discarded, and thereforetheir effectiveness is very limited.

The appearance-based approach, such as PCA [Turk and Pentland, 1991] and LDA[Belhumeur et al., 1997] significantly advanced face recognition technology. Thesemethods operate by linearly projecting an image-based representation of the face (anarray of pixel intensities) onto a low-dimensional subspace derived from training im-ages, where recognition is performed. The main difference between PCA and LDA isthat the former aims to “optimally” represent the face object, whilst LDA constructsa discriminant subspace to “optimally” distinguish between faces of different people,and therefore LDA has proven to be usually better for recognition [Belhumeur et al.,1997]. However, in scenarios where very few data are available, PCA has reportedbetter results than LDA [Martinez and Kak, 2001].

Based on PCA, several extensions have been devised, such as the works of [Pent-land et al., 1994] in which eigenspaces where constructed for several facial features

1.3. Overview of Face Recognition Methods 5

and to cope with different poses, and [Moghaddam and Pentland, 1997] where, in-stead of the classical Euclidean distance, a probabilistic similarity measure was usedto compare faces. Other linear methods such as Independent Components Analysis(ICA) have been also proposed for recognition, with better results than PCA [Bartlettet al., 2002]. Another approach that uses linear subspaces for recognition is the oneproposed in [Moghaddam et al., 2000] by means of a Bayesian formulation.

As already introduced, one of the main challenges in face recognition is the char-acteristic non-linearity of the face space. To soften this problem, the linear meth-ods above introduced have been extended using kernel techniques, leading to KernelPCA (KPCA), Kernel Discriminant Analysis (KDA), and Kernel ICA [Yang et al.,2005a, Zheng, 2006, Zhao et al., 2007, Yang et al., 2005b].

Another way to handle non-linearity is to construct a local appearance space usingappropriate filters, so that faces are less affected by (typical) factors such as pose andexpression. The Dynamic Link Architecture [Lades et al., 1993] pioneered the use ofElastic Graph Matching for locating a set of nodes in the face. In the training phase,a rectangular model graph attached with Gabor features was built for every user inthe gallery while, in the test phase, the graph matching procedure was required foreach comparison between images. Based on DLA, [Wiskott et al., 1997] proposedthe EBGM approach, in which a more appropriate graph structure to represent faceswas employed. Compared to DLA, EBGM used an object-adapted graph (i.e. aface-like graph) whose nodes refer to specific facial landmarks (eyes, tip of the nose,mouth. . . ). Moreover, in order to reduce the computational burden of matching witheach individual model graph (as it is done in DLA), they proposed the use of the so-called Face Bunch Graph (FBG). For recognition, both methods computed multi-scaleand multi-orientation Gabor responses (jets) at each node of the face graph, and usednormalized dot products between corresponding jets to output a measure of similaritybetween two faces. Since then, a large number of modifications, improvements andextensions have been proposed [Duc et al., 1999, Kotropoulos et al., 2000, Liao andLi, 2000, Jiao et al., 2002, Mu et al., 2003, Kela et al., 2006, Shin et al., 2007]. Otherlocal feature approaches that have been successfully applied for face recognition arethose based on Local Feature Analysis (LFA) [Penev and Attick, 1996], and LocalBinary Patterns (LBP) [Zhang et al., 2004, Heusch et al., 2005, Ahonen et al., 2006].

In Section 1.2, we saw that a possible strategy for softening the non-linearity andthe problem of generalization was to use adequate classification engines. To this aim,several pattern recognition tools have been successfully employed for face recognition:Neural Networks in [Sato et al., 1998, Er et al., 2002b], Support Vector Machines in[Heisele et al., 2001, Tefas et al., 2001, Czyk et al., 2004], Adaboost in [Yang et al.,2004, Zhang et al., 2004, Shen et al., 2005, Shan et al., 2005] . The works basedon boosting lead to a framework for learning both effective features and classifiers.Other tools such as Hidden Markov Models have been also used in the context of facerecognition [Eickeler et al., 1999, Nefian and Hayes, 1998] but with limited success


up to now.Lot of work has been focused on obtaining systems that are invariant (or at least

robust) to the three major appearance variability factors: pose, illumination andexpression. The first one will be the subject of Chapter 2 and we will provide a deepercovering of the work referred to this topic there. However, some brief notes are givenin the following. Pose-invariant face recognition methods have been proposed usingboth 2D [Beymer, 1994, Beymer and Poggio, 1995, Pentland et al., 1994, Maurerand Malsburg, 1996, Gross et al., 2004, Chai et al., 2006, Kanade and Yamada,2003] and 3D [Chai et al., 2005, Romdhani et al., 2002, Blanz et al., 2005, Zhangand Samaras, 2006, Lee and Surendra, 2003] information. Some of them aim to userobust features/models/classifiers, but using directly the original faces [Maurer andMalsburg, 1996, Kanade and Yamada, 2003] while others pre-process the images toobtain synthetic faces in pre-defined poses previously to perform recognition [Beymerand Poggio, 1995, Blanz et al., 2005, Chai et al., 2006]. 3D systems usually achievebetter performance than 2D approaches, but they have the drawback of requiringmore computation, both in training and in the recognition stages.

In addition to pose, illumination is the next most significant factor affecting theappearance of faces. It has been shown experimentally [Adini et al., 1997] and the-oretically for systems based on PCA [Zhao and Chellappa, 1999] that differences inappearance due to illumination are larger than differences between subjects. Earlywork in illumination invariant face recognition aimed to obtain image representa-tions that are mostly insensitive to lighting changes [Adini et al., 1997, Jacobs et al.,1998, Shashua and Riklin-Raviv, 2001]. A different approach to the illumination prob-lem is based on the observation that the images of a Lambertian surface (with fixedpose but varying lighting) lie in a 3D linear subspace of the image space [Shashua,1992]. [Belhumeur and Kriegman, 1998] showed that the set of images of an objectin fixed pose and varying illumination forms a convex cone in the space of faces,and [Georghiades et al., 1998] approximated these cones for human faces using low-dimensional linear subspaces. [Basri and Jacobs, 2003] showed that the illuminationcone of a convex Lambertian surface can be approximated by a 9 dimensional linearsubspace, with good recognition results across illumination in limited experiments.As seen, pose and illumination have been tackled separately in a large number ofpapers. In addition, several methods have studied face recognition across pose andillumination simultaneously [Georghiades et al., 2001, Blanz and Vetter, 2003]

Contrarily to pose and illumination, expression-invariant face recognition has re-ceived less attention. Some interesting approaches are briefly described in the follow-ing. In order to give less weight to those regions that are more affected by expressionchanges, [Martınez, 2003] used optical flow between the two images to be compared.[Liu et al., 2003] proposed to use facial asymmetry as a biometric, concluding thatis quite robust to expression variations. [Bronstein et al., 2005] computes expressioninvariant signatures based on isometry-invariant representation of the facial surface.

1.3. Overview of Face Recognition Methods 7

[Ramachandran et al., 2005] proposed to use pre-processing steps to convert a smilingimage to a neutral one.

Face detection is the first step to achieve an automatic face recognition systemand its accuracy greatly influences the performance and usability of the whole chain.Given a still image or a video sequence, an ideal face detector should be able todetect all the present faces independently of the position, scale, pose, expression,orientation, illumination, etc. Face detection has been performed based on severalcues such as skin color, facial shape, facial appearance or a combination of them. Themost successful algorithms developed up to now are appearance-based without usingany other cues: [Rowley et al., 1998] via neural networks, [Schneiderman and Kanade,2004] through wavelet representation and non-lineal classifiers , [Osuna et al., 1997]using Support Vector Machines, [Yang et al., 1999] by means of the Sparse Networkof Winnows (SNoW) architecture, [Viola and Jones, 2001, Viola and Jones, 2002]using Haar-like filters and Adaboost. Without a doubt, Adaboost learning-basedface detection, pioneered by the work of [Viola and Jones, 2001], has been the mosteffective leading to a plethora of variations and extensions [Lienhart and Maydt,2002, Li and Zhang, 2004, Sochman and Matas, 2004, Ichikawa et al., 2006].

In addition to face detection, which provides a coarse estimation of the positionand scale of each detected face, face alignment aims to achieve a more accuratelocalization therefore allowing to normalize faces geometrically. Different approachessuch as Active Shape Models [Cootes et al., 1995] (with extensions [Zhang et al.,2005, Sukno et al., 2007, Cristinacce and Cootes, 2007]), Active Appearance Models[Cootes et al., 2001] (with extensions [Cootes et al., 2000, Cootes and Taylor, 2001,Cristinacce and Cootes, 2006]) and elastic graph matching methods [Wiskott et al.,1997] have been proposed in the literature. In these algorithms, a set of facial featuressuch as nose, eyes, mouth and face outline are located, and these positions are used forgeometrical normalization in order to get rid of in-plane rotation, scale, etc. and evenof out-of-the-plane rotations. 3D models have been also used for face alignment, e.g.3D Morphable Models [Blanz and Vetter, 1999] with great success in face recognition[Romdhani et al., 2002, Blanz and Vetter, 2003, Blanz et al., 2005].

Finally, face recognition from video [Zhou et al., 2003, Zhou et al., 2004, Liu andChen, 2003] has attracted increasing attention because of the huge amount of data(i.e. frames) that are available for processing, and the confidence on the fact thatthe use of video, instead of still images, will decrease error rates significantly (andwill help to prevent against fake attempts). In video, it is needed to track the facethroughout the sequence, and several approaches have been successfully designed forthis task. For instance, those approaches employed for face alignment can be also usedfor face tracking (when the face must be aligned in a given frame, those parameters(position of the features, etc.) resulting from the previous frame are used as initialestimates). [Baker and Matthews, 2001] proposed an extension to Active AppearanceModels, namely the Inverse Compositional Image Alignment (ICIA), which is able to


obtain more accurate results and at drastically higher speed.

1.4 Face Recognition Evaluations, Databases, Bench-

marks and Competitions

Since the beginning of 90’s, several projects have sponsored and promoted research onautomatic face recognition. The Face Recognition Technology (FERET) program ranfrom 1993 to 1998 supported by the DoD Counterdrug Technology Development Pro-gram Office, and had three main objectives: i) Support the development of facial bio-metrics algorithms, ii) Collect and Distribute the FERET database containing 14126images from 1199 subjects, and iii) Organize competitions (in years 1994, 1995 and1996) to compare the abilities of face recognition systems using the FERET databaseand specific biometric evaluation methodology [Phillips et al., 2000b, Phillips et al.,2000a].

At the conclusion of the FERET program, facial biometrics were typically found inprototype systems of universities and research laboratories. In year 2000, when severalcommercial systems were available in the market, the National Institute of Standardsand Technology (NIST) launched the first Face Recognition Vendor Test (FRVT).The objectives of FRVT 2000 were to compare the performance of the competingsystems and examine their usability in an access control scenario.

The second FRVT evaluation took part in 2002. FRVT 2002 was designed tomeasure technical progress since 2000, to evaluate performance on real-life large-scaledatabases, and to introduce new experiments to help understand face recognitionperformance better. FRVT 2002 consisted of two tests requiring the systems to befully automatic: the High Computational Intensity (HCInt) test and the MediumComputational Intensity (MCInt) test. The first one (HCInt) was designed to teststate-of-the-art systems on extremely challenging real-world (full-face still frontal)images. This test required performing 15 thousand million matches in 242 hours,measuring performance of face recognition algorithms on large databases, as well asexamining the effect of database size on overall performance. The Medium Computa-tional Intensity (MCInt) test consisted of two separate parts: still images and videosequences. The still part of the MCInt was similar to the FERET and FRVT 2000evaluations, measuring the effect of different factors (such as time between images,changes in illumination and variations in pose) on the performance of face recognitionsystems. The video portion was designed to provide an initial evaluation of whetheror not video helps to increase face recognition performance.

The Face Recognition Grand Challenge (FRGC) [FRGC, 2004, Phillips et al.,2005] aimed to support the development of face recognition algorithms from high-resolution still and 3D imagery, as well as pre-processing techniques to cope withlighting and pose variations. The primary goal of the FRGC was to decrease the

1.4. Face Recognition Evaluations, Databases, Benchmarks andCompetitions 9

error rates of facial biometrics by an order of magnitude over what was observed inFRVT 2002. The FRGC consisted of six experiments. These experiments measuredperformance on still images taken with controlled lighting and background, uncon-trolled lighting and background, 3D imagery, multi-still imagery, and between 3D andstill images.

The third and last FRVT evaluation was launched in 2006, aiming to improve theperformance achieved in year 2002 by an order of magnitude. Different experimentswere devised for this evaluation, including still face recognition with varying imagequality, recognition under unconstrained illumination, 3D face recognition, and acomparative between automatic systems and humans. The FRVT 2006 achieved itsprimary goal, since both still and 3D face recognition algorithms obtained a decreasein the error rates of at least an order of magnitude over what was observed in theFRVT 2002. The FRVT 2006 documented significant progress since January 2005in face recognition when faces are matched across different lighting conditions. Infact, in the FRVT 2006 five submissions performed better than the best results in theJanuary 2005 FRGC results [Phillips et al., 2006]. The observed increase occurreddespite the FRGC being an open challenge problem with the identities of faces knownto the FRGC participants and the FRVT 2006 being a sequestered evaluation.

For the first time in a biometric evaluation, the FRVT 2006 directly comparedhuman and machine face recognition performance. The results show that, at low falsealarm rates for humans, seven automatic face recognition algorithms were comparableor better than humans at recognizing faces taken under different lighting conditions.Furthermore, three of the seven algorithms were comparable or better than humansfor the full range of false alarm rates measured.

Collected in the framework of european projects, the XM2VTS [Messer et al.,1999] and BANCA [Bailly-Bailliere et al., 2003] databases have been extensively usedfor assessing the performance of face recognition systems. Specific protocols for au-thentication scenarios have been designed, and several public competitions have beencarried out. More concisely, face contests on XM2VTS took place in years 2000[Matas et al., 2000], 2003 [Messer et al., 2003] and 2006 [Messer et al., 2006], whiletwo different competitions on the BANCA database were held in year 2004 [Messeret al., 2004a, Messer et al., 2004b]. XM2VTS is a database containing 2360 (mainly)frontal face images belonging to 295 different subjects which were captured under con-trolled conditions. Two different authentication protocols for this database (namelyConfigurations I and II) were designed in [Luttin and Maıtre, 1998] and used in theface contests above mentioned. The BANCA database contains images of 208 sub-jects (half men, half women) captured under three different conditions (controlled,degraded and adverse). For this database, seven protocols with different levels ofdifficulty have been developed. To the best of our knowledge, only the english partof the BANCA database has been distributed and used for testing. Both XM2VTSand BANCA databases have been used for testing systems developed during this


PhD Thesis and hence, a more concise description of both databases is presented inAppendix A.

In addition to the already mentioned, there exist more than 30 face databases thathave been commonly used by researchers all over the world (see [Gross, 2005] for asurvey): among others, the Yale and Yale B databases [Yale, 1997, Georghiades et al.,2001], the AT&T (formerly ORL) database [AT&T, 1992], the chinese CAS-PEALdatabase [CAS-PEAL, 2002, Gao et al., 2008], the BIOID database [BioID, 1998], theAR face database [Martınez and Benavente, 1998] and the CMU PIE database [Simet al., 2003]. The last two cited databases have been also used during this PhD Thesisfor testing the robustness of the developed systems against illumination, expressionand pose variations.

The AR face database contains images from 126 subjects captured under differ-ent illumination conditions (ambient, left, right, over exposed), expressions (neutral,smiling, angry, screaming) and occlusions (scarfs, sunglasses). On the other hand, theCMU-PIE database contains face images of 68 subjects recorded under 13 differentviewpoints, 43 different illumination conditions, and with 4 different expressions. Adeeper description of both databases is provided in Appendix A.

1.4.1 Definitions

Regarding the specific scenario of an automatic face recognition application, we candistinguish two different modalities:

1. Authentication (also verification) scenarios, where the user provides a biometricsample X (a still image of the face for instance), along with a claimed identityID. The system must then output a decision regarding whether X belongs toID (whose biometric sample(s) – template or model– must have been previouslystored in the database through an enrollment process) or not. Hence, a one-to-one matching is performed, i.e. only one comparison is needed. The resultof this matching, usually given by a similarity score s, is compared with apreset threshold τ , so that the claim is accepted if s > τ . Otherwise, thisclaim is rejected. In authentication scenarios, the most common measures usedto assess the performance of a given algorithm are the False Acceptance Rate(FAR) and the False Rejection Rate (FRR). FAR is defined as the number ofimpostor claims that are uncorrecly accepted by the system, divided by thetotal number of impostor attempts, while FRR is defined as the number of trueclient claims that are uncorrecly rejected by the system, divided by the totalnumber of true attempts. The so-called Total Error Rate (TER) is defined asTER=FAR+FRR.

2. Identification applications, where the user only provides a biometric sample Xwhich is compared to every stored template in the database, hence performing a

1.5. Motivations and Objectives 11

one-to-N matching. Two different scenarios exist in identification applications:if every subject entering the system is also present in the stored database,we refer to as a closed universe model. On the other hand, when the subjectaccessing the system can either be a true client or an impostor, we refer to as anopen universe model (called watch-list in the last Face Recognition Vendor Test2006). When this model is considered, decision thresholds are also needed inorder to discard a false claim. In closed universe model identification scenarios,the so-called recognition rate (also identification rate) is commonly used toevaluate the performance of a given technique. This measure represents thepercentage of people that are correctly identified by the system∗. When openmodels are considered, a combination of identification and verification measuresshould be used to evaluate the performance of a given technique.

Both identification (closed universe model) and authentication experiments havebeen carried out during this Ph.D. Thesis on different databases like XM2VTS [Messeret al., 1999], BANCA [Bailly-Bailliere et al., 2003], CMU PIE [Sim et al., 2003] andthe AR face database [Martınez and Benavente, 1998].

Finally, we would like to note that throughout this Thesis, we will refer on nu-merous ocasions to local Gabor features, or simply Gabor features. In the literature,these features are commonly referred to as jets, and therefore we will use indistinctlythe terms Gabor features, Gabor jets or jets with the same meaning. In our specificcase, a jet comprises 40 complex coefficients, which are obtained after convolution ofthe local image patch with the set of 40 Gabor filters as defined in [Wiskott et al.,1997].

1.5 Motivations and Objectives

Since few years ago, the Signal Theory and Communications Department at the Uni-versity of Vigo, as a partner of the Biosecure Network of Excellence [BioSecure, 2004],has been interested in the development of tools for distributed secure authenticationin a web environment (both the platform in which the biometric algorithms shouldbe embedded [Otero-Muras et al., 2007], and the algorithms themselves). Amongthe different biometric modalities, face and voice-based techniques seem an adequatechoice for the specific scenario (web environment) in which we are involved. In fact,the only extra hardware requirements are a webcam and a microphone, which nowa-days are unexpensive and very common. Starting from a baseline algorithm based onGabor filters, we have tackled completely different problems throughout the complexrecognition process, suggesting novel ideas to achieve a robust biometric system. The

∗In such scenarios, the rank-r cumulative match score is also used as a performance measure,determining the percentage of correct identifications within the r first


final goal would be to integrate the designed system within the Java-based platformdescribed in [Otero-Muras et al., 2007].

The main motivations and objectives that guided this Ph.D. Thesis are moreconcisely described in the following.

As stated above, automatic face recognition has been significantly improved sincethe seminal works in the 70’s [Goldstein et al., 1971, Kanade, 1973], but the generalproblem still remains unsolved. There are several reasons that cause face recognitionsystems to fail, such as rigid (pose) head changes, expression variations, lighting con-ditions, occlusions, etc. Regarding the problem of pose variations, several approacheshave been proposed using 2D [Beymer, 1994, Beymer and Poggio, 1995, Pentlandet al., 1994, Maurer and Malsburg, 1996, Chai et al., 2006, Kanade and Yamada,2003] and 3D [Chai et al., 2005, Romdhani et al., 2002, Blanz et al., 2005, Zhang andSamaras, 2006, Lee and Surendra, 2003] information. 3D systems usually achievebetter performance than 2D approaches, but they have the drawback of requiringmore computation, both in training and in the recognition stages. Hence, the firstobjective of this Ph.D. Thesis is to develop algorithms that are able to soften theproblems derived from pose variations in a 2D face recognition framework, aimingto achieve comparable performance to 3D techniques, but with lower computationalburden.

On the other hand, even in the absence of such adverse conditions (pose, ex-pression, lighting, etc.), there exists the need of selecting features that are able todiscriminate between subjects in a reliable way. Regarding the kind of facial featuresthat are extracted for further processing, face recognition systems can be divided intwo categories: Holistic (or Global-feature) methods [Turk and Pentland, 1991, Bel-humeur et al., 1997, Lu et al., 2003] and Local-feature approaches [Wiskott et al.,1997, Penev and Attick, 1996]. It has been shown that the use of local features usuallyleads to better performance, due to the fact that they are more robust to occlusions,pose changes, expression variations etc. Among the huge set of particular featuresthat have been used (DCT [Kohir and Desai, 1998], Local Binary Patterns [Zhanget al., 2004, Heusch et al., 2005, Ahonen et al., 2006], SIFT [Lowe, 2004, Bicego et al.,2006], etc.), Gabor filters have received great attention both for biological reasons andbecause of the optimal resolution in both frequency and spatial domains [Daugman,1980, Daugman, 1985, Daugman, 1988]. Recently, it has been published a surveyon the use of Gabor filters for face recognition [Shen and Bai, 2006a], revealing thehuge number of papers that had adopted such features for face processing. However,these features have been traditionally extracted either from the nodes of a (possiblydeformed) rectangular grid [Duc et al., 1999] or at “fiducial” points forming face-likegraphs [Wiskott et al., 1997], i.e. from pre-defined locations in the face image. But,why should these features be extracted at pre-defined, universal positions? Given thatrecognition aims to discriminate between subjects, we suggest to extract features frompositions or regions that are somehow subject-specific. Hence, the second objective of

1.5. Motivations and Objectives 13

this Ph.D. Thesis is to propose a novel face recognition approach based on extractinglocal Gabor responses from regions that could be inherently subject-dependent, byexploiting individual face shape. Following with Gabor-based recognition systems,most approaches have used normalized dot products in order to compare correspond-ing features, but this choice is not supported, neither with a theoretical basis norwith an experimental evaluation. To the best of our knowledge, the only evaluationof distances for Gabor feature comparison was performed in [Jiao et al., 2002], wherethe authors concluded that Manhattan (or city block) distance outperformed bothcosine and euclidean measures. However, it is not explicitely described, neither in[Jiao et al., 2002] nor in other research papers dealing with Gabor-based face recog-nition systems, whether the features have been previously normalized or not. Thismotivated us to propose a more extensive evaluation, comparing different distancesfor measuring similarities between Gabor responses, and assessing the impact of theconcrete normalization method that is applied to features before comparison.

Other interesting topic within the face recognition community is the selection andfusion of features for achieving robustness and improving system performance. To thisaim, different tools such as Support Vector Machines [Heisele et al., 2001, Tefas et al.,2001, Czyk et al., 2004], MultiLayer Perceptrons [Sato et al., 1998, Er et al., 2002b],Adaboost [Yang et al., 2004, Zhang et al., 2004, Shen et al., 2005, Shan et al., 2005],etc. have been extensively used in the literature. The third objective of this Thesisis to empirically validate different fusion methods for combining local similarities inthe context of Gabor-based face recognition.

Despite the large number of papers dealing with Gabor-based recognition systems,no statistical model has been proposed or used for Gabor feature coefficients. On theother hand, wavelet coefficients have been successfully modeled in other applications,such as texture characterization and retrieval [Van de Wouver et al., 1999, Do andVetterli, 2002] or noise modeling [Hernandez et al., 2000]. The fourth objective of thisPh.D. Thesis is to propose an accurate statistical model of Gabor coefficients for facerecognition. In [Shen and Bai, 2006a], it was stated that one of the drawbacks ofGabor-based recognition systems is the huge amount of data that must be stored torepresent a face. Among other benefits, the underlying statistics would allow us toreduce the required amount of storage (i.e. data compression) for face representation.Moreover, this finding would open new possibilities in terms of providing theoreticalevidence for the construction of optimal distance functions to compare Gabor features.

Finally, in the last years, face recognition from video [Zhou et al., 2003, Zhouet al., 2004, Liu and Chen, 2003] has attracted increasing attention because of thehuge amount of data (i.e. frames) that are available for processing, and the confidenceon the fact that the use of video, instead of still images, will decrease error ratessignificantly (and will help to prevent against fake attempts). Before developing anyvideo-based face processing system, it is needed to have an algorithm tracking the facethroughout the sequence. Hence, the last objective of this Ph.D. Thesis is to develop


a face tracker using a variant of the well known Lucas-Kanade algorithm [Lucas andKanade, 1981].

1.5.1 Contributions

The contributions made so far in the field of face processing and recognition during thedevelopment of this Ph.D. Thesis in the last four years of research can be classified asfollows (see Figure 1.1 for a scheme of a general face recognition system, highlightingthe stages at which we made contributions):

StillImage

VideoSequence

Pre−processingSelectionFeature

Comparison

FeatureExtraction

andModeling

Empirical Evaluationof Distance Measures

Face Tracking

Statistical Feature Modeling

Extraction ApproachNovel Feature

(6)

Face detection/ Tracking

Database ofenrolled

usersRobustness to

pose variations(3)

(4)(2)

Accuracy−based Feature

Selection Technique

(5)(1)

Figure 1.1: Scheme of a general face recognition system, highlighting the contributionswe made.

1. Feature Extraction: Development of a novel method for local feature ex-traction [Gonzalez-Jimenez and Alba-Castro, 2005] (finalist in the IEEE ICIP2005 best student paper award), [Gonzalez-Jimenez and Alba-Castro, 2007a]based on Gabor filters. This face verification module has been implemented asa Biometric Service Provider (BSP) following the BioAPI standard [BioAPI,1998]. In this way, it has been successfully integrated within an open-sourceJava framework intended to provide single sign-on web authentication based onany BioAPI-compliant biometric software or device† [Otero-Muras et al., 2007].

This face verification system ranked first during the Biosecure Residential Work-shop held in Paris during August 2005‡ . Moreover, face similarity scores using

†Source code available at https://sourceforge.net/projects/biowebauth‡Obtained results in http://www.cilab.upf.edu/biosecure1/public docs/06 Highlights-Achievements RW MT.ppt

1.6. Outline 15

this system have been shared both with colleagues from the University of Vigoand with the University of Hertfordshire for fusion with speaker recognitionscores.

2. Feature Selection: Development of a simple yet quite powerful feature se-lection technique [Gonzalez-Jimenez et al., 2007a] which has been empiricallycompared against well known state-of-the-art tools.

3. Pre-processing (Robustness to Pose Changes): Development of two novelmethods for pose-robust face recognition using 2D information and facial sym-metry [Gonzalez-Jimenez and Alba-Castro, 2007b], and combination with au-tomatic face segmentation [Gonzalez-Jimenez et al., 2006].

4. Video-based Face Processing: Development of face tracking techniquesthroughout video sequences with application to speech-and-lip asynchrony de-tection [Argones Rua et al., 2008], and pose-robust face recognition from video[Alba-Castro et al., 2008]. In collaboration with Enrique Argones-Rua, two facerecognition approaches were prepared for the Biosecure Multimodal EvaluationCampaign (BMEC) competition [Mayoue et al., 2007], obtaining the first twopositions in the category of face recognition from video.

5. Distance Comparison: Empirical evaluation of distance measures for Gaborfeatures-based face recognition systems [Gonzalez-Jimenez et al., 2007b].

6. Statistical Modeling: Development of a novel statistical model of Gaborfeature coefficients for face recognition [Gonzalez-Jimenez et al., 2007c], withapplication to data compression.

7. Ground truth data: We manually annotated 62 facial landmarks (see Figure2.1 for landmark configuration) on 340 images from the CMU PIE database [Simet al., 2003]. These annotations are useful both for assessing the performanceof an automatic face alignment method, and for the construction of a PointDistribution Model (PDM–see Appendix A) rich in pose variations. Moreover,the first frame of 52 videos from the BANCA database [Bailly-Bailliere et al.,2003] have also been annotated for face tracking initialization.

1.6 Outline

The rest of this Thesis is organized as follows:

• Chapter 2 introduces the problem of pose in face recognition, proposing twodifferent approaches to tackle this kind of variations. The devised methods


are compared to other successful techniques found in the literature using thethe CMU PIE [Sim et al., 2003]. Experiments with fully automatic face align-ment on the XM2VTS database [Messer et al., 1999] are also reported. Thematerial presented in this chapter can be found in [Gonzalez-Jimenez andAlba-Castro, 2006, Gonzalez-Jimenez et al., 2006, Gonzalez-Jimenez and Alba-Castro, 2007b]. I would like to acknowledge Federico Sukno for his contributionto this chapter.

• Chapter 3 presents an approach for subject-specific feature extraction basedon individual face structure and Gabor filters. Comparison against well knownmethods is reported on different databases. Moreover, an empirical evaluation ofdifferent distance measures for Gabor jet comparison is shown. The materials ofthis chapter have been published in [Gonzalez-Jimenez and Alba-Castro, 2005,Gonzalez-Jimenez and Alba-Castro, 2007a, Gonzalez-Jimenez et al., 2007b].

• Chapter 4 presents an empirical evaluation of different tools for local similarityfusion in the context of Gabor-based face recognition. The materials of thischapter can be found in [Argones Rua et al., 2006, Gonzalez-Jimenez et al.,2007a].

• Chapter 5 deals with modeling marginal distributions of Gabor coefficients ex-tracted from face images. Two statistical priors (Generalized Gaussians andBessel K Forms) are compared in the specific scenario. Based on the underlyingstatistics, an application for data compression is presented. Part of the materi-als introduced here has been published in [Gonzalez-Jimenez et al., 2007c].

• Chapter 6 presents recent results on different topics: semi-automatic face track-ing with application to pose-robust face recognition and audio-video synchronydetection, modeling of Gabor coefficients’ joint statistics, and a novel HMMbased on the Generalized Gaussian distribution. Part of the materials pre-sented in this chapter can be found in [Alba-Castro et al., 2008, Argones Ruaet al., 2008, Bicego et al., 2008].

• Finally, Chapter 7 presents the main conclusions and future research lines ofthis Thesis.

Chapter 2

ViewPoint-Robust 2-D FaceRecognition

Contents2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 A Point Distribution Model For Faces . . . . . . . . . . . 21

2.3 Pose Eigenvectors and Pose Parameters . . . . . . . . . . 23

2.3.1 Theoretical evidence on the fact that symmetric mesheshelp to decouple left-right rotations and non-rigid factors. . 24

2.3.2 Experiment on a video-sequence: Decoupling of pose andexpression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.3 Experiment on the CMU PIE database . . . . . . . . . . . 32

2.4 Virtual Face Synthesis . . . . . . . . . . . . . . . . . . . . 37

2.4.1 Thin Plate Splines Warping . . . . . . . . . . . . . . . . . . 37

2.4.2 Synthesizing virtual face images across pose using ThinPlate Splines . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.5 Pose Correction . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5.1 Warping to Mean Shape (WMS) . . . . . . . . . . . . . . . 44

2.5.2 Normalizing to Frontal Pose and Warping (NFPW) . . . . 44

2.5.3 Pose Transfer and Warping (PTW): Warping one image toadopt the other one’s pose . . . . . . . . . . . . . . . . . . . 45

2.5.4 Taking advantage of facial symmetry . . . . . . . . . . . . . 47

2.6 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . 47

2.7 Face authentication on the XM2VTS database . . . . . . 51

17

18 Chapter 2. ViewPoint-Robust 2-D Face Recognition

2.7.1 Statistical Analysis of the Results . . . . . . . . . . . . . . 51

2.8 Results with automatic fitting via Invariant OptimalFeatures Active Shape Models (IOF-ASM) . . . . . . . . 53

2.9 Face Identification on the CMU PIE database . . . . . . 54

2.9.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 54

2.9.2 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.9.3 Other researchers’ results . . . . . . . . . . . . . . . . . . . 55

2.9.4 Testing the system with large pose changes. . . . . . . . . . 57

2.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.1 Introduction

This chapter addresses one of the major issues within the general face recognitionproblem: dealing with pose changes. It is well known that the performance of facerecognition systems drops drastically when pose differences are present within theinput images, and it has become a major goal to design algorithms that are able tocope with this kind of variations. Up to now, the most successful algorithms are thosewhich make use of prior knowledge of the class of faces. [Pentland et al., 1994] extendsthe eigenface approach [Turk and Pentland, 1991] to a view-based eigenface method,where an individual eigenspace is constructed for each pose. In [Beymer and Poggio,1995], the authors extend the earlier attempt presented in [Beymer, 1994] (whose maindrawback was that images from different viewpoints were needed for every client):from a single image of a subject and making use of face class information, virtualviews facing different poses are synthesized and used in a view-based recognizer. Forthe generation of the virtual views, two different techniques were used: linear classesand parallel deformation. In [Maurer and Malsburg, 1996] the authors propose a poseinvariant face recognition approach based on Elastic Bunch Graph Matching [Wiskottet al., 1997]. The transformation of Gabor features (“jets”) are learnt from trainingfaces that are rotated in depth. In [Blanz and Vetter, 1999] the authors propose a3D Morphable Model, where each face can be represented as a linear combinationof 3D face examplars. Given an input image, the 3D Morphable Model is fitted,recovering shape and texture parameters following an analysis-by-synthesis scheme.Several approaches make use of the 3D Morphable Model to perform recognition.The main drawback of these methods is the high computational complexity neededto recover image parameters. [Romdhani et al., 2002] reports high recognition rates onthe CMU PIE database [Sim et al., 2003], by means of the 3D Morphable Model anda fitting algorithm that makes use of linear relations to update the shape and textureparameters, which are then employed for recognition purposes. Blanz et al. also use

2.1. Introduction 19

the 3D Morphable Model in [Blanz et al., 2005] to synthesize frontal faces from nonfrontal views, which are then fed into the recognition system. In this same direction,other researchers have tried to generate frontal faces from non frontal views, like theworks proposed in [Chai et al., 2006], via linear regression in each of the regions inwhich the face is divided, and in [Chai et al., 2005] where a 3D model is used. In[Zhang and Samaras, 2006] the authors combine the strengths of Morphable modelsto capture the variability of 3D face shape and a spherical harmonic representation forthe illumination. A 3D model is also used in [Lee and Surendra, 2003] to synthesizefaces at different poses. [Kanade and Yamada, 2003] proposes a completely differentmethod, where the problem of pose variations is addressed via a probabilistic approachthat takes into account the pose difference between probe and gallery images, learninghow facial features change as the pose changes. [Liu and Chen, 2005] approximatesa human head with a 3D ellipsoid model, and both training and test images are backprojected onto the surface of the ellipsoid, forming texture maps which are used forcomparison. Moreover, this texture map is represented as an array of local patches,and a probabilistic model is trained to compare corresponding patches. In [Grosset al., 2004] it is proposed to estimate the eigen light-fields of the subject’s head,using them for recognition across pose and illumination changes with tests on theCMU PIE database.

Using a dataset containing sparse face meshes (62 points per image), we builta Point Distribution Model and from the main modes of variation, the parametersresponsible for controlling the apparent changes in shape due to turning and noddingthe head (so-called pose parameters) were identified, similar to the research in [Lanitiset al., 1997], where the pose of the face was estimated using those parameters. Basedon them, we propose two novel approaches for pose correction:

1. A method in which pose parameters from both images are set to typical valuesof frontal faces.

2. A method in which one image adopts the pose parameters of the other one.

Both methods need the synthesis of virtual images, which is accomplished throughThin Plate Splines-based warping [Bookstein, 1989]. The use of texture mapping forthe generation of virtual views is close in spirit to the parallel deformation of [Beymerand Poggio, 1995], and it has the advantage over linear classes that it preserves subjectspecific texture peculiarities, since texture is sampled from the subject real face image.However, the goal of parallel deformation is to map a facial transformation observedbetween two images of a prototype subject onto a novel subject’s face. The problemwith this approach arises when the shape of the prototype subject differs significantlyfrom the novel subject’s shape, as the virtual view will appear geometrically distorted.We minimize this effect by modifying only pose parameters rather than the wholeshape. Also, there exist similarities between our first method and the works of [Blanz


et al., 2005], [Chai et al., 2006], and [Chai et al., 2005] as all of them try to generatefrontal images. Unlike in their approaches (among other differences), we will takefacial symmetry into account in order to overcome problems due to self-occlusion,leading to important improvements in system performance.

Holistic feature-based face recognition methods such as eigenfaces [Turk and Pent-land, 1991] need all images to be embedded into a constant reference frame (an averageshape for instance) in order to represent a face as a vector of ordered pixels. [Lanitiset al., 1997] also deformed each face image to the mean shape using 14 landmarks,extracted shape and appearance parameters and classified using the Mahalanobisdistance. However, the virtual images we obtain do not comply with the constant ref-erence frame requirement and hence, local features must be employed for recognition.To this aim, we compute local Gabor responses on the synthesized face. [Maurer andMalsburg, 1996] also used Gabor features in a pose invariant framework but, in theircase, the correction was applied to the Gabor features extracted from the originalnon-frontal image.

[Lanitis et al., 1997] showed that a linear model is enough to simulate large changesin viewpoint, as long as all the landmarks remain visible. [Cootes et al., 2000] statedthat a model trained on near fronto-parallel images can cope with pose variations ofup to ±45◦. However, for larger angle displacements, some facial features (landmarks)become occluded and the assumptions of the model break down. In order to deal withsuch large rotations, [Cootes et al., 2000] uses a set of models to represent shape andappearance from different viewpoints. Other approaches tackling this problem haveeither used a full 3D model [Blanz and Vetter, 1999] or included non-linearities in the2D model [Romdhani et al., 1999]. Based on the previous statement (“a linear modelis enough to simulate large changes in viewpoint, as long as all the landmarks remainvisible”), and under large rotation angles, we decided to use the restricted subset ofvisible landmarks for virtual face synthesis, empirically demonstrating the validity ofour approach with realistic face images and identification experiments.

The chapter is organized as follows. Next section briefly reviews Point Distribu-tion Models and Section 2.3 introduces the concept of Pose Eigenvectors and PoseParameters. Section 2.4 describes the technique used to synthesize pose correctedimages: Thin Plate Splines-based warping, with examples of virtual images acrosspose. In Section 2.5 we explain different ways to cope with pose variations, whilstSection 2.6 describes feature extraction on corrected images through Gabor filtering.Sections 2.7, 2.8 and 2.9 shows experimental results on two face databases:

• Authentication results on the XM2VTS database [Messer et al., 1999] confirmthe advantages of normalizing only pose parameters rather than warping ontoa mean shape (Section 2.7).

• In addition to these experiments, and aiming to assess whether a degradation inperformance occurs when considering a fully automated system, pose correction

2.2. A Point Distribution Model For Faces 21

with automatic fitting (via IOF-ASM [Sukno et al., 2007]) has been also testedon the XM2VTS (Section 2.8).

• Identification experiments on the CMU PIE database [Sim et al., 2003] allowus to assess the performance of the methods in the presence of large pose dif-ferences, and the benefits of taking facial symmetry into account (Section 2.9).

Finally, conclusions and future research lines are drawn in Section 2.10.

2.2 A Point Distribution Model For Faces

A point distribution model (PDM) of a face is generated from a set of training ex-amples. For each training image Ii, N landmarks are located and their normalizedcoordinates (by removing translation, rotation and scale) are stored, forming a vector

X i = (x1i, x2i, . . . , xNi, y1i, y2i, . . . , yNi)T =

(xi yi

)T(2.1)

The pair (xji, yji) represents the normalized coordinates of the j-th landmark inthe i-th training image. Principal Components Analysis (PCA) is applied to find themost important modes of shape variation. As a consequence, any training shape X i

can be approximately reconstructed:

X i = X + Pb, (2.2)

where X stands for the mean shape, P = [φ1|φ2| . . . |φt] is a matrix whose columnsare unit eigenvectors of the first t modes of variation found in the training set, and b isthe vector of parameters that define the actual shape of X i. So, the k-th componentfrom b (bk, k = 1, 2, . . . , t) weighs the k-th eigenvector φk. Also, since the columns ofP are orthogonal, we have that P TP = I, and thus:

b = P T(X i − X

), (2.3)

i.e. given any shape, it is possible to obtain its vector of parameters b. We builta 62-point PDM using manually annotated landmarks (some of them were providedby the FGnet project∗, while others were manually annotated by ourselves). Figure2.1 shows the position of the landmarks on an image from the XM2VTS database[Messer et al., 1999].

When a new image containing a face is presented to the system, the vector ofshape parameters that fits the data, b, should be computed automatically. Thereare several techniques like ASM [Cootes et al., 1995], IOF-ASM [Sukno et al., 2007],

∗ http://www-prima.inrialpes.fr/FGnet/data/07-XM2VTS/xm2vts-markup.html


1

2

3

4

5

6

78

9

10

11

12

13

14

15

161718

1920 2122

23 24252627

2829

3031 3233

34 3536

37

3839 40 41

42

43

44

4546

47484950

515253545556575859

60 61 62

Figure 2.1: Position of the 62 landmarks used in this work on an image from theXM2VTS database.

2.3. Pose Eigenvectors and Pose Parameters 23

AAM [Cootes et al., 2001] to deal with this problem. In this work we have usedmanual annotations for most experiments, which allows us to test the classificationperformance alone, without the effect of landmark detection errors. In addition tothese experiments, empirical validation with automatic fitting via IOF-ASM [Suknoet al., 2007] has been also performed in the XM2VTS database.

2.3 Pose Eigenvectors and Pose Parameters

Among the obtained modes of shape variation, we are interested in isolating theeigenvectors that are responsible for controlling the apparent changes in shape dueto rigid facial motion (pose). For each eigenvector φk, the value of its correspondingparameter bk is swept within suitable limits (while the remaining ones are set to 0) andthe reconstructed shapes are observed. This way, we can assess by visual inspection,which eigenvectors contain pose information.

Clearly, the eigenvectors (and their relative position) obtained after PCA, stronglydepend on the training data and hence, if all meshes used to build the PDM werestrictly frontal, there would not appear any eigenvector explaining rotations in depth.However, if we are sure that pose changes are present in the training set, the eigenvec-tors explaining those variations will appear among the first ones, due to the fact thatthe energy associated to rigid facial motion should be higher than that of most ex-pression/identity changes (once again, depending on the specific dataset used to trainthe PDM). With our settings, it turned out that φ1 controlled up-down rotations (seeFigure 2.2) while φ2 was the responsible for left-right rotations.

Figure 2.2: Effect of changing the value b1 on the reconstructed shapes. φ1 controlsthe up-down rotation of the face.

A major problem, inherent to the underlying PCA analysis, relies on the fact thata given pose-eigenvector may not only contain rigid facial motion (pose) but alsonon-rigid (expression/identity) information, mostly depending on the training dataused to build the PDM. Regarding φ1, it has been shown [Lyons et al., 2000] thatthere exists a dependence between the vertical variation in viewpoint (nodding) andthe perception of facial expression, as long as faces that are tilted forwards (leftmostshape in Figure 2.2) are judged as happier, while faces tilted backwards (rightmostshape in Figure 2.2) are judged as sadder.

Regarding φ2, the upper row of Figure 2.3 shows the reconstructed shapes byvarying b2. Apart from the left-right rotation, it is clear that φ2 also contains facial


Figure 2.3: Effect of changing the value b2 on the reconstructed shapes. Upper row:Coupling of both rigid (left-right rotation) and non-rigid (eyebrow movement and lipwidth) facial motion within the second eigenvector φ2. Lower row: When usingvirtual symmetric meshes to augment the training set, expression changes are notnoticeable in φ2.

expression/identity information: faces rotated to the right seem to show surprise(raised eyebrows), while faces rotated to the left look more serious. Ideally, rigidfacial motion should be orthogonal to other factors of shape variation but we can seethat this does not hold exactly for φ2 (although variations due to in-depth rotation aremuch more important than those induced by expression/identity changes). In orderto soften this coupling for the left-right eigenvector, the training set was augmentedwith virtual symmetric meshes. The reconstructed shapes in the lower row of Figure2.3, show that expression changes are now not noticeable in the obtained eigenvectorφ2. By enlarging the original training set with artificial symmetric samples:

• The variance of the training set is precisely increased in the left-right direction.

• No examples with new expression/identity information are introduced.

• The non-rigid variation is cancelled along φ2, because both training sampleslooking to the right and those looking to the left will exactly have the sameamount of expression/identity information: any exemplar in the original train-ing set (with a given pose and expression/identity) has its corresponding mirrorversion (with the same expression/identity but opposite pose), and thus theaugmented training data we get are “more balanced” in the left-right direction.

2.3.1 Theoretical evidence on the fact that symmetric meshes

help to decouple left-right rotations and non-rigid fac-tors.

We do now provide a theoretical explanation behind the intuitive use of the aug-mented training set with virtual symmetric meshes. Let S be the original trainingset comprising M examples of face meshes:


S =(

X1 X2 . . . XM

)

=

(x1 x2 . . . xM

y1 y2 . . . yM

)=

(XY

)(2.4)

Its covariance matrix C (assuming data are zero mean) is given by:

C =1

MSST (2.5)

As stated above, φ2 is the eigenvector that controls left-right rotations. By defi-nition

Cφ2 = λ2φ2 (2.6)

where λ2 is its associated eigenvalue. From the top row of Figure 2.3 we alreadyrealized that there also exists some coupled non-rigid information within φ2 (whichis mostly encoded in the vertical displacements of the eyebrows). Figure 2.4 showsthe φ2 coefficients grouped by the facial feature (face contour, eyes, mouth, etc.)they affect. Concentrate, for instance, in those coefficients weighing the eyebrows’y-coordinates: as long as they are far from being 0, changing the specific value ofb2 provokes expression changes (eyebrows’ raising and bending). On the other hand,those coefficients controlling horizontal displacements (i.e. those weighing the x-coordinates) are mostly responsible for left-right rotations. Hence, it makes senseto assume that φ2 can be expressed as the sum of two components: one controllingleft-right variations (LRV) and the other accounting for non-rigid variations (NRV):

φ2 = vLRV + vNRV (2.7)

From the upper row in Figure 2.3, it is straightforward to conclude that the non-rigid variation encoded in φ2 is clearly smaller than the left-right contribution, andthis conclusion can be also extracted from Figure 2.4, as long as those coefficientsassociated to vNRV (mainly those corresponding to the mouth’s and eyebrows’ y-coordinates) have smaller values (in modulus) than those associated to vLRV (seefor instance the φ2 coefficients weighing the face contour’s x-coordinates). In generalterms, the moduli of x-coefficients are greater than their corresponding y−coefficients(with the exception of eyebrows)

Let us now consider the symmetrized training set S′. For the i-th face shape

X i =(

xi yi

)T, the y-coordinates of its mirror version do not change, whilst the x-

coordinates have opposite sign. However, we have to take the following considerationinto account: let’s concentrate on the first landmark (x1, y1)

T from Figure 2.1. Whenobtaining the mirror version of this face shape, the symmetrized landmark (−x1, y1)

T


0 20 40 60 80 100 120 140−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

φ 2[k]

1) CONTOUR (X)

6) CONTOUR (Y)

2) EYEBROWS (X)

7) EYEBROWS (Y)

3) EYES (X)

8) EYES (Y)

4) NOSE (X)

9) NOSE (Y)

5) MOUTH (X)

10) MOUTH (Y)

10)9)

8)

7)

6)

5)

4)

3)

2)

1)

Figure 2.4: Coefficients from eigenvector φ2 (obtained with the original training setS) grouped by the specific facial feature they affect.

is no longer the first one, but actually becomes landmark #15 and vice versa. Asimilar reasoning can be applied to the remaining landmarks. Mathematically, thiscan be expressed by using a permutation matrix Π. Hence, S′ can be expressed asfollows:

S′ =

(−ΠXΠY

)

=

(−Π 00 Π

)(XY

)

= AπS (2.8)

Given that Π is a permutation matrix, it yields that ΠT = Π−1, and the sameoccurs with Aπ, i.e. Aπ

T = Aπ−1. The covariance matrix C′ of the symmetric

training set is given by:

C′ =1

MS′S′T

= AπCAπT (2.9)

Given that AπT = Aπ

−1, C and C′ have the same eigenvalues, and hence

C′φ′

2= λ2φ

′

2(2.10)


where φ′

2is the eigenvector controlling left-right rotations (plus some coupled non-

rigid variations) in the symmetric training set. Let us now observe the informationcontained in both φ2 and φ′

2.

The upper row of Figure 2.5 shows the reconstructed shapes using φ2 (i.e. X(α) =X+αφ2), while the bottom row plots the reconstructed shapes using φ′

2, i.e. X′(α) =

X′ + αφ′

2for a given value of α. The first thing we should note is that X ≈ X′.

Moreover, X(α) presents the same pose as X′(α) and the same non-rigid informationof X′(−α). Taking these facts into account, φ′

2can be decomposed (in a similar way

as φ2) into:

φ′

2= vLRV − vNRV (2.11)

Adding Equations (2.6) and (2.10), and taking Equations (2.7) and (2.11) intoaccount, it yields:

Cφ2 + C′φ′

2= λ2

(φ2 + φ′

2

)

(C + C′) vLRV + (C − C′) vNRV = 2λ2vLRV (2.12)

Now, we will assume that (C − C′)vNRV is not significant compared to (C +C′)vLRV . In fact, we have already seen that the non-rigid contribution from vNRV

is small compared to that of vLRV . Moreover C ≈ C′ (see Figure 2.6 for visualevidence), and hence it turns

C + C′

2vLRV ≈ λ2vLRV (2.13)

Dividing both sides of Equation (2.13) by the norm of vLRV , and denotingφLRV = vLRV

‖vLRV ‖, we have

C + C′

2φLRV ≈ λ2φLRV (2.14)

From Equation (2.14), it is clear that φLRV (containing just left-right rotations)

is (approximately) an eigenvector of C + C′

2with associated eigenvalue λ2. It is

straightforward to see that C + C′

2is precisely the covariance matrix of the aug-

mented training set comprising both original and symmetric meshes. In fact,

Saug =(

S S′)

(2.15)

and its covariance matrix Caug is given by:


Caug =1

2MSaugS

Taug

=1

2M

(S S′

) ( ST

S′T

)

=1

2C +

1

2C′ =

C + C′

2

Hence, we have demonstrated that by using the augmented training set, we wereable to get rid of the component containing the small non-rigid variations (as it wasshown in the bottom row of Figure 2.3). Figure 2.7 plots the coefficients of the left-right eigenvector φ2,aug obtained with the augmented training set. By comparingthis plot with Figure 2.4 we can conclude:

• The φ2,aug coefficients weighing the y−coordinates have smaller values in mod-ulus (thus closer to 0) than the corresponding ones from the original eigenvector(hence reducing expression changes when sweeping b2). This is specially signif-icant for the coefficients related to the eyebrows’ y-coordinates.

• Those coefficients weighing the right and left contour’s x−coordinates (indices1 to 7 and 9 to 15 respectively) show a perfectly symmetric pattern. This meansthat for a given value of α, X(α) = X + αφ2,aug and X(−α) = X − αφ2,augwill exactly show opposite left-right angles. This was not the case of the originaleigenvector φ2, whose x−contour’s coefficients did not show such a perfectlysymmetric pattern.

• In fact, for every facial feature, those coefficients weighing the x−coordinatesfrom the left side of the feature and the corresponding ones affecting the rightside share the same values. This symmetry provokes X(α) and X(−α) to besimple reflections (as shown in the bottom row of Figure 2.3).


X

X′

X(−α) = X − αφ2

X′(−α) = X′ − αφ′

2

X(α) = X + αφ2

X′(α) = X′+ αφ′

2

Figure 2.5: Upper row: Reconstructed shapes using φ2. Bottom row: Re-constructed shapes using φ′

2. Clearly, X(α) = X + αφ2 has the same pose as

X′(α) = X′+ αφ′

2, and the same non-rigid information as X′(−α) = X

′ − αφ′

2.


a)

c)

b)

d)

Figure 2.6: Covariance matrices plots: a) C, b) C′, c) C + C′, d) C − C′. Fromthese plots (C ≈ C′) and the fact that the non-rigid contribution is smaller thanthe rigid one, we can assume that (C − C′)vNRV is not significant compared to(C + C′)vLRV .


0 20 40 60 80 100 120 140−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

φ 2,au

g[k]

1) CONTOUR (X)6) CONTOUR (Y)2) EYEBROWS (X)7) EYEBROWS (Y)3) EYES (X)8) EYES (Y)4) NOSE (X)9) NOSE (Y)5) MOUTH (X)10) MOUTH (Y)

1)

2)3)

4)

5)

6)

7)

8)

9) 10)

Figure 2.7: Coefficients from eigenvector φ2,aug (obtained with the augmented train-ing set Saug) grouped by the specific facial feature they affect.


2.3.2 Experiment on a video-sequence: Decoupling of poseand expression

In order to demonstrate on real data that the presence of non-rigid factors within theidentified pose-eigenvectors is minimal, we used a manually annotated video-sequenceof a man during conversation† (hence, rich in expression changes). For each frame fin the video, the vector of shape parameters

b (f) = [b1 (f) , b2 (f) , . . . , bt (f)]T

of the corresponding mesh X(f) was calculated and splitted into the rigid (pose) part

bpose (f) = [b1 (f) , b2 (f) , 0, . . . , 0]T

and the non-rigid (expression/identity) part

bothers (f) = [0, 0, b3 (f) , . . . , bt (f)]T

Finally, we calculated the reconstructed meshes Xpose (f) and Xothers (f) usingEquation (2.2) with bpose (f) and bothers (f) respectively. Ideally, Xpose (f) should onlycontain rigid mesh information, while Xothers (f) should reflect changes in expressionand contain identity information. As shown in Figure 2.8, it is clear that althoughthere exists some coupling (specially in the seventh row with small eyebrow bending inXpose(f)), Xothers(f) is responsible for expression changes and identity information(face shape is clearly encoded in Xothers(f)) while Xpose(f) does mainly containrigid motion information. For instance, the original shapes from the first and secondrows share approximately the same pose, but differ substantially in their expression.Accordingly, the Xpose’s are approximately the same while the Xothers’s are clearlydifferent.

2.3.3 Experiment on the CMU PIE database

The CMU PIE (Pose, Illumination and Expresion) database [Sim et al., 2003] consistsof face images from 68 subjects recorded under different combinations of poses andilluminations (see Appendix A for a brief description of the database). Figure 2.9shows the images taken for subject 04006 from all cameras with neutral illumination.As we can see, this database is specially suitable for testing the robustness of systemsto left-right face rotations. In this work we use a subset of the database, namely theimages taken from cameras 11, 29, 27, 05 and 37 (with corresponding nominal rotationangles of −45◦, −22.5◦, 0◦, 22.5◦ and 45◦ approximately) under neutral illumination.All of them (a total of 68× 5 images) were manually annotated with the same set of62 landmarks shown in Figure 2.1.

†http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking face.html


Original shape (X) Xothers

Xpose


Xpose


Xpose


Xpose


Xpose


Xpose


Xpose

Figure 2.8: Experiment on the video sequence. Each row shows, for a given framef , the original shape X(f) and the reconstructed shapes (Xothers(f) and Xpose(f))using bothers(f) and bpose(f) respectively. Clearly, Xothers(f) controls expression andidentity while Xpose(f) is mostly responsible for rigid changes.

Table 2.1: Relationship between b2 and the angle of rotation θ.

pose c11 c29 c27 c05 c37θ −45◦ −22.5◦ 0◦ 22.5◦ 45◦

b2 -7.30 -4.21 0.38 4.69 7.25


Figure 2.9: Images taken from all cameras of the CMU PIE database for subject04006. The 9 cameras in the horizontal sweep are each separated by about 22.5◦ [Simet al., 2003]

For each annotated mesh, its vector of shape parameters b was calculated. So,for every subject, we have 5 b vectors, each one corresponding to a certain pose (11,29, 27, 05 and 37) and, for a given pose, we have 68 vectors, each one correspondingto a subject from the database. Table 2.1 shows, for each pose, the average value ofthe parameter b2 (b2), along with the nominal rotation angle θ. Clearly, there existsan approximate linear relationship between b2 and θ, in agreement with the resultobtained in [Lanitis et al., 1997]. Hence it seems that, within the set of consideredposes (ranging from −45◦ to 45◦), the PCA analysis is able to deal with rotationsproperly.

Moreover, if φ2 were only responsible for left-right rotations, the variance of b2with pose changes (Inter-pose variance) should be high whilst, when pose is fixed, thevariance of b2 (Intra-pose variance) should be small, i.e. φ2 should not be seriouslyaffected by other factors such as identity variations. Figure 2.10 presents both inter-and intra- pose variances for every parameter. It is clear that

• b2 has the highest inter-pose variance among the whole set of shape parameters

• The intra-pose variance associated to b2 is much lower than its inter-pose vari-ance.

• Given that all tested poses have approximately the same elevation (≈ 0o), theinter-pose variance is small for b1.

From Figure 2.10 it is clear that, apart from b2, other parameters present highinter-pose variances. However, their corresponding intra-pose variances are also high.The ratio between both quantities

r =Inter-Pose Variance

Intra-Pose Variance


0 10 20 30 40 500

5

10

15

20

25

30

35

40

bk

Inter−pose varianceIntra−pose variance

Figure 2.10: Intra and inter pose variances for each of the shape parameters.


0 10 20 30 40 500

5

10

15

20

25

30

35

bk

Inter−pose variance/Intra−pose variance

Figure 2.11: Inter-pose variance divided by intra-pose variance for each of the shapeparameters. Clearly, the b2 parameter we identified as responsible for left-right rota-tions presents the highest ratio.

2.4. Virtual Face Synthesis 37

is an adequate way of measuring how the two variances are related for a givenparameter. As shown in Figure 2.11, b2 presents the highest value among the set ofshape parameters. Section 2.4.2, with examples of virtual images, will give anothertoken of the fact that changing the particular values of b1 and b2 provokes variations inpose but does not seriously affect other facial properties such as identity or expressionfactors.

2.4 Virtual Face Synthesis

2.4.1 Thin Plate Splines Warping

All the methods that will be introduced in Section 2.5 share one common feature.Given one face image I1, the coordinates of its respective fitted mesh, X1, and anew set of coordinates, X2, a virtual face image must be synthesized by warping theoriginal face onto the new shape. For this purpose, we used the method developedby Bookstein based on thin plate splines [Bookstein, 1989]. Provided the set ofcorrespondences between X1 and X2, the original face I1 is allowed to be deformedso that the original landmarks are moved to fit the new shape.

Thin plate splines are a class of non-rigid spline mapping functions f (x, y) withseveral desirable properties for our application. They are globally smooth, easilycomputable and separable into affine and non-affine components. The thin platespline is the two-dimensional analog of the cubic spline in one dimension and containsthe least possible non-affine warping component to achieve the mapping. By the laststatement, we mean that the sum of squares of all second order partial derivatives,

∫ ∫

R2

(∂2f

∂x2

)2

+

(∂2f

∂y2

)2

+ 2

(∂2f

∂x∂y

)2

dxdy (2.16)

i.e. the bending energy, is minimized. By using two separate thin plate spline func-tions fx and fy which model the displacement of the landmarks in the x and ydirection we arrive at a vector-valued function F = (fx, fy) which maps each point ofthe image into a new point in the image plane: (x, y) → (fx (x, y) , fy (x, y)) . A thinplate spline interpolation function can be written as:

f (x, y) = a0 + a1x+ a2y +N∑

i=1

wiU(∣∣(x, y) −X i

1

∣∣) (2.17)

where U (r) = r2log (r2) is the solution to the biharmonic function ∆2U = 0 thatsatisfies the condition of bending energy minimization, N stands for the number oflandmarks, and ai and wi represent the affine and non-affine components respectively.These parameters are chosen so that each landmark from the first shape is mapped


to its corresponding position in the second one, i.e. F (X i1) = X i

2, ∀i = 1, . . . , N .This spline defines a global warping of space, and is therefore used to warp the entiresource image onto the target shape. Since the warp is smoothly extrapolated from theoverlapping area to all of space, no discontinuities are introduced. Furthermore, theminimal warping property of the thin plate spline guarantees that the extrapolationwill be reasonable.

2.4.2 Synthesizing virtual face images across pose using Thin

Plate Splines

Figure 2.12 shows some examples of virtual face images, corresponding to subjectsfrom the CMU PIE database, synthesized using Thin Plate Splines warping. For eachsubject, only the frontal image (i.e. pose 27) and its associated mesh are used as in-puts. The vector of shape parameters b = [b1, b2, . . . , bt]

T is recovered from the frontalmesh using Equation (2.3) and b2 is forced to sweep a range of values, synthesizingvirtual meshes using Equation (2.2). Finally, the frontal face image is warped ontothese virtual meshes, generating face images under different viewpoints. Figure 2.13shows the original face images taken at ±22.5◦ and ±45◦. Their corresponding virtualimages are present in the two leftmost and the two rightmost columns of Figure 2.12.It is clear that the synthetic faces are very similar to the real ones.

Figure 2.14 shows the virtual images synthesized by varying b1 in an analogousway. We would like to remark that, although some distortions due to mesh manipula-tion and warping process may exist, identity and expression‡ properties are preservedin most of the synthesized images from both Figures 2.12 and 2.14.

‡In agreement with the conclusions presented in [Lyons et al., 2000], virtual faces tilted forwardslook happier than those tilted backwards


Figure 2.12: Examples of synthesized face images across azimuth. In each row, thefrontal face is warped onto virtual meshes obtained by sweeping b2 within a range ofvalues.


Figure 2.13: Original face images across azimuth (±22.5◦ and ±45◦). By comparingthe two leftmost and the two righmost columns of Figure 2.12 with the faces shownhere, we can realize that the virtual images are very similar to the original ones.


Figure 2.14: Examples of synthesized face images across elevation. In each column,the frontal face is warped onto virtual meshes obtained by sweeping b1 within a rangeof values.


Coming back to the discussion addressed in Section 2.3 regarding the decouplingbetween rigid and non-rigid information within an identified pose-eigenvector, wewould like to re-emphasize that choosing a good training set is required in orderto obtain such decoupling. Otherwise, both rigid and non-rigid information may getmixed. It was shown that the use of an augmented training set with symmetric mesheshelped to decouple left-right rotation and facial expression, and this can be noted onceagain when comparing the first row in Figure 2.12 with Figure 2.15. Clearly, in thelatter, there exists some coupling between facial expression and pose changes (whichwas removed by using the augmented training set). Moreover, Figure 2.16 providesanother clear example of the need of choosing an adequate training set regarding theup-down eigenvector :

• The upper row shows the warped images obtained by modifying the identifiedup-down parameter when the training set contains enough up-down tilting ex-amples. Clearly, modifying the specific value of the parameter does not affectthe identity of the synthetic images.

• The lower row shows the warped images obtained by modifying the identified up-down parameter when the training set does not contain enough up-down tiltingexamples. Clearly, face shape and up-down rotation are mixed in the associatedeigenvector, distorting identity as the value of the parameter changes.

Figure 2.15: Slight coupling between pose changes and facial expression when thetraining set has not been properly chosen.

Up to now, all synthesized images showed a neutral expression, but what happensin the case a subject is expressing happiness, anger, etc.? Will the synthetic facesmaintain expression as long as pose changes? We already discussed that there existsa dependence between nodding and the perception of facial expression, but left-rightrotations should not affect expression at all. Figure 2.17 clearly suggests that thereis no change in the facial expression as long as pose changes for the two cases shown.

2.5 Pose Correction

Given a test image Itest with unknown identity and a training image Itrain of a givenclient, the system must output a measure of similarity (or dissimilarity) between

2.5. Pose Correction 43

Figure 2.16: Upper row: Identity is not modified when changing the value of theup-down parameter (a good training set has been chosen). Lower row: Identityis clearly distorted when changing the value of the up-down parameter (not enoughup-down tilting examples in the training set).

Figure 2.17: Upper row: Although pose is forced to change, synthetic images main-tain the original worried expression. Lower row: The same occurs when the subjectsmiles.


them. Straightforward texture comparison between Itest and Itrain may not producedesirable results as differences in pose could be quite important. So, in order to dealwith these differences, we apply and compare three different algorithms that makeuse of the PDM parameters.

2.5.1 Warping to Mean Shape (WMS)

Once the meshes have been fitted to Itrain and Itest, both faces are warped ontothe average shape of the training set, X, which corresponds to setting all shapeparameters to 0, i.e. b = ~0. Thus, the images are deformed so that a set of landmarksare moved to coincide with the corresponding set of landmarks on the average shape,obtaining Itrain and Itest. The number of landmarks used as “anchor” points is anothervariable to be fixed. For the experiments, we used two different sets:

• The whole set of 62 points.

• The set of 14 landmarks used in [Lanitis et al., 1997].

As the number of “anchor” points grows, the synthesized image is more likely topresent artifacts because more points are forced to be moved to landmarks of a meanshape (which may differ significantly from the subject’s shape). On the other hand,with few “anchor” points, small pose correction can be made.

2.5.2 Normalizing to Frontal Pose and Warping (NFPW)

We argue that normalizing only the pose parameters should produce better resultsthan warping images onto a mean shape, because in the latter approach (WMS ),discriminative information may be removed during the warping process, as long asall shape parameters are fixed to zero, and the set of “anchor” points are forcedto be moved to landmarks of a mean shape. Holistic approaches such as eigenfaces[Turk and Pentland, 1991] need all images to be embedded into a given referenceframe (an average shape for instance), in order to represent these images as vectors ofordered pixels. The problem arises when the subject’s shape differs enough from theaverage shape, as the warped image may appear geometrically distorted, and subject-specific information may be removed (see Figure 2.18 for an example). Given thatour recognition method is not holistic but uses local features instead, the reference-frame constraint is avoided and the distortion is minimized by modifying only poseparameters rather than the whole shape. In Figure 2.19, we can see a block diagramof this method. Given btrain and btest, only the subset of parameters that accountfor pose are fixed to the typical values of frontal faces from the training set (as theaverage shape corresponds to a frontal face, we fixed pose parameters to zero, i.e.b

posetrain = b

posetest = ~0). New coordinates are computed using Equation (2.2), and virtual

images, Itrain and Itest, are synthesized.

2.5. Pose Correction 45

Figure 2.18: Images from subject 013 of the XM2VTS. Left: Original image. Right:Image warped onto the average shape. Observe that subject-specific information hasbeen reduced (specially in the lips region).

2.5.3 Pose Transfer and Warping (PTW): Warping one im-age to adopt the other one’s pose

Based on the particular values of bposetrain and b

posetest , we can also think of synthesizing a

virtual face adopting the pose of the other one. A block diagram of this approach canbe seen in Figure 2.20. Compared to NFPW, this approach has one computationaldisadvantage: if several training images {I1, I2, . . . , IT} are available for a given client,it is approximately T times slower than NFPW, as each comparison between Itest and{Ii} , i = 1, 2, . . . , T needs the synthesis of a virtual image, whilst with NFPW onlyItest must be warped once to adopt a frontal pose, supposed that the images {Ii} havebeen distorted and stored during the training stage.

Experiments on the video sequence used in Section 2.3.2 revealed slightly betterperformance when warping the near profile face to adopt the pose of the near frontalone. This particular choice is the one that will be tested on the first set of experimentsover the XM2VTS database (Section 2.7).


Normalization

Pose

Normalization

Pose

TPS Warping

Comparison

Final

TPS Warping

Input image

Face Alignment

Face Alignment Training image

ONLINE

OFFLINE

Figure 2.19: Block diagram for pose correction using NFPW. After face alignment,the obtained meshes are corrected to frontal pose (Pose Normalization block), andvirtual faces are obtained through Thin Plate Splines (TPS) warping. Finally, bothsynthesized images are compared. It is important to note that the processing of thetraining image could (and should) be done offline, thus saving time during recognition.

2.6. Feature extraction 47

2.5.4 Taking advantage of facial symmetry

As we already introduced, the synthesis of a virtual image is accomplished by sam-pling texture from the original one. The problem arises when, due to self-occlusion,some face regions become not visible, i.e. texture is not available, and hence thecorresponding regions in the pose normalized image do not represent subject’s ap-pearance correctly. In order to overcome this drawback, we take advantage of thevertical symmetry of the face. For a horizontal rotation in depth of the head andonce the mesh has been fitted, the parameter controlling the azimuth angle indicateswhether the face is showing mostly its right or its left side. Whenever a frontal faceis synthesized from a non-frontal view, we warp the original image and its mirrorversion onto the pose-corrected frontal mesh and then blend the two virtual images,using simple masks that weigh the two sides of the face appropriately (according tothe current rotation -left or right- of the head), as it can be seen in Figure 2.21.On the other hand, when using PTW and the images to be compared are showingopposite sides, i.e. face A is rotated to the left and face B is rotated to the right,direct warping from one pose to the other provides poor results. A better solutioncan be obtained with the use of mirror images (see Figure 2.22 for an example):

1. Mirror versions of both faces, Amirror and Bmirror, and their respective meshes,MA

mirror and MBmirror, are obtained.

2. Pose parameters are transferred from mesh MB to MAmirror and similarly, pose

parameters are transferred from mesh MA to MBmirror, obtaining MA

mirror andMB

mirror respectively.

3. Warping is performed from Amirror to MAmirror obtaining Amirror, and from

Bmirror to MBmirror obtaining Bmirror.

4. Comparison is performed between A and Bmirror, and B and Amirror.

5. The two obtained scores are averaged.

2.6 Feature extraction

The recognition engine is based on Gabor filtering. Gabor filters are biologicallymotivated convolution kernels in the shape of plane waves restricted by a Gaussianenvelope, as it is shown next:

ψm (−→x ) =

∥∥∥−→k m

∥∥∥2

σ2exp

−∥∥∥−→k m

∥∥∥2

‖−→x ‖2

2σ2

[exp

(i−→k m · −→x

)− exp

(−σ2

2

)](2.18)


TPS Warping

Comparison

Final

Virtual face Â

imageTraining

Input image

Face Alignment

Face Alignment

Mesh A

Pose

Mesh B

Face B

Face A

Transfer

Virtual mesh Â

Figure 2.20: Block diagram for pose normalization using PTW. After face alignment,mesh A adopts the pose of mesh B (Pose Transfer block), and virtual face A isobtained through Thin Plate Splines (TPS) warping. Finally, faces A and B arecompared.

2.6. Feature extraction 49

where−→k m contains information about frequency and orientation of the filters,−→x = (x, y)T and σ = 2π. Our system uses a set of 40 Gabor filters with the same

configuration as in [Wiskott et al., 1997]. The region surrounding a pixel in the imageis encoded by the convolution of the image patch with these filters, and the set ofresponses is called a jet, J . So, a jet is a vector with 40 coefficients, and it providesinformation about an specific region of the image.

At each of the nodes of the pose-normalized mesh, a Gabor jet is extracted andstored for comparison. Given two images to be compared, say I1 and I2 with nodecoordinates P = {~p1, ~p2, . . . , ~pN} and Q = {~q1, ~q2, . . . , ~qN}, their respective sets ofjets are computed: {J~pi

}i=1,...N and {J~qi}i=1,...N . Finally, the score between the two

images is given by:SJ = fN {< J~pi

,J~qi>}i=1,...,N (2.19)

where < J~pi,J~qi

> represents the normalized dot product between correspondent jets,but taking into account that only the moduli of jet coefficients are used. In Equation(2.19), fN stands for a generic combination rule of the N dot products.

X

X

Right rotation

Frontal mesh Flip and Warp

Warping

Direct

Blending masks

Frontal Face

Original face and mesh

Figure 2.21: Block diagram for pose normalization using NFPW and facial symmetry


A−>B

POSE

TRANSFER

IMAGE A

IMAGE B

FLIP

B−>A

SCORE 2

SCORE 1

Figure 2.22: Taking advantage of facial symmetry in PTW.

2.7. Face authentication on the XM2VTS database 51

2.7 Face authentication on the XM2VTS database

Using the XM2VTS database [Messer et al., 1999], authentication experiments wereperformed on configuration I of the Lausanne protocol [Luttin and Maıtre, 1998] (seeAppendix A for a description of both database and protocol) in order to confirm theadvantages of modifying only pose parameters over warping onto a mean shape.

As explained in Section 2.6, N = 62 jets were computed for every image, thusobtaining 62 local scores when comparing two faces. The median rule was used tofuse these scores, i.e. fN ≡ median. According to configuration I, 3 training imagesare available per client. Hence, when a test image claims a given identity, we obtain3 scores, which may be fused in order to improve the results. Again, the median rulewas used to combine these values, obtaining a final score ready for verification.

We compared the performance of the different methods presented in Section 2.5.More concisely:

• WMS :

1. WMS 14: Warping images onto a mean shape using the same set of 14“anchor” points employed in [Lanitis et al., 1997].

2. WMS 62: Warping images onto a mean shape using 62 “anchor” points.

• NFPW : Normalizing only the subset of pose parameters to adopt a frontalmesh.

• PTW : Warping one image to adopt the pose of the other one.

Table 2.2 shows the False Acceptance Rate (FAR), False Rejection Rate (FRR) andTotal Error Rate (TER=FAR+FRR) over the test set for the above mentioned meth-ods. Moreover, the last row from this table presents the baseline results when no posecorrection is applied (Baseline).

We should remark that facial symmetry was not taken into account for theseexperiments. Pose variation is not a major characteristic of the XM2VTS and theimpact of facial symmetry is not very impressive on this database. However, we willshow with tests on the CMU PIE database that, in the presence of large pose changes,performance is significantly improved if symmetry is used.

2.7.1 Statistical Analysis of the Results

In [Bengio and Mariethoz, 2004], the authors adapt statistical tests to compute con-fidence intervals around Half Total Error Rates (HTER = TER/2) measures, and toassess whether there exist statistically significant differences between two approachesor not (see Appendix B). Following the methodology of [Bengio and Mariethoz, 2004],


Table 2.2: False Acceptance Rate (FAR), False Rejection Rate (FRR) and Total ErrorRate (TER) over the test set for different methods.

METHOD FAR(%) FRR(%) TER(%)

WMS 14 2.31 5.00 7.31

WMS 62 2.64 4.50 7.14

NFPW 2.17 2.75 4.92

PTW 1.76 3.00 4.76

Baseline 2.93 4.25 7.18

Table 2.3: Confidence interval around ∆HTER = HTERA−HTERB for Zα/2 = 1.645

METHOD WMS 62 NFPW PTW BaselineWMS 14 [−1.15%, 1.32%] [0.07%, 2.32%] [0.14%, 2.41%] [−1.16%, 1.29%]WMS 62 [0.02%, 2.20%] [0.08%, 2.30%] [−1.21%, 1.17%]NFPW [−0.89%, 1.05%] [−2.20%,−0.06%]PTW [−2.30%,−0.12%]

for each comparison between the methods presented in Table 2.2, we calculated con-fidence intervals which are shown in Table 2.3. From both Tables 2.2 and 2.3 we canconclude:

1. Although pose variation is not a major characteristic of the XM2VTS database,it is clear that the use of both NFPW and PTW significantly improved systemperformance compared to the baseline method.

2. Warping both images to a mean shape suffers from the greatest degradation inperformance. It is clear that synthesizing face images with WMS does seriouslydistort the “identity” of the warped image, as long as the performances of thebaseline algorithm and the two WMS ’s methods are very similar (robustnessto pose provokes subject-specific information supression, leading to no improve-ment at all). Furthermore, we assess that significant differences are presentwhen comparing WMS with NFPW and PTW, as the confidence intervals donot include 0 in their range of values.

3. There are no statistically significant differences between WMS 14 and WMS 62,as the confidence interval is symmetric around 0.

2.8. Results with automatic fitting via Invariant Optimal Features ActiveShape Models (IOF-ASM) 53

4. For the same reason, no significant differences are present between NFPW andPTW.

In Section 2.5 we stated that warping the near profile image reported slightlybetter results over a video-sequence and this fact was confirmed on the XM2VTSdatabase, as warping the near frontal face gave a Total Error Rate of 5.45% (comparedto 4.76%). However, we can not conclude that these methods are statistically signif-icantly different as the confidence interval around ∆HTER was [−0.6681%, 1.3581%].In Section 2.9, where results on the CMU PIE database are presented, we will fol-low the scheme of Figure 2.22, performing the two warps and averaging the scoresobtained from both comparisons. As stated before, the use of symmetry on theXM2VTS database leads to small improvements that are not statistically significant(≈ 5% with respect to the non-symmetry versions).

In previous sections it was highlighted that the images synthesized using bothNFPW and PTW were not suitable for holistic feature extraction, but this is notthe case of WMS. In order to assess the performance of a (baseline) holistic methodon this database, we applied eigenfaces [Turk and Pentland, 1991] on the imagesgenerated through WMS and obtained a TER of 16.27%, which is significantly worsethan that of the local feature extraction on WMS images, with a confidence intervalaround ∆HTER of [2.95%, 6.01%].

2.8 Results with automatic fitting via Invariant

Optimal Features Active Shape Models (IOF-

ASM)

In order to assess the degradation in performance when the face is automaticallysegmented within the image, we carried out experiments in collaboration with thePompeu Fabra University [Gonzalez-Jimenez et al., 2006]. Automatic fitting of facialfeatures was performed via Invariant Optimal Features Active Shape Models (IOF-ASM) [Sukno et al., 2007], andNFPW was the technique adopted for pose correction.IOF-ASM is briefly described in Appendix C.

Table 2.4 shows a comparison between the manual version -NFPW(Manual)- andthe automatic approach with IOF-ASM fitting -NFPW(Auto)-. At this point, thereader may notice that there exist differences between the results of NFPW in Table2.2 and those of NFPW(manual) in Table 2.4 (published in [Gonzalez-Jimenez et al.,2006]) which in fact should be equal. The explanation behind this is that [Gonzalez-Jimenez et al., 2006] applied a different feature extraction scheme from the one usedthroughout this chapter§. Moreover, results from a set of algorithms that entered the

§This scheme included an accuracy-based feature selection which will be presented in Chapters


Table 2.4: False Acceptance Rate (FAR), False Rejection Rate (FRR) and Total ErrorRate (TER) over the test set for our method and automatic approaches from [Messeret al., 2003].

Conf. I Conf. IIFAR(%) FRR(%) TER(%) FAR(%) FRR(%) TER(%)

NFPW (Auto) 0.83 2.75 3.58 ± 1.35 0.85 2 2.85 ± 1.15

NFPW (Manual) 0.46 2.75 3.21 ± 1.35 0.72 1.50 2.22 ± 1.00

UPV 1.23 2.75 3.98 ± 1.35 1.55 0.75 2.30 ± 0.71

UNIS-NC 1.36 2.5 3.86 ± 1.29 1.36 2 3.36 ± 1.15

IDIAP 1.95 2.75 4.70 ± 1.35 1.35 0.75 2.10 ± 0.71

competition held in conjunction with the Audio-and Video-based Biometric PersonAuthentication (AVBPA) conference in 2003 [Messer et al., 2003] are also included inrows 3–5 of Table 2.4. All these 3 algorithms are automatic. In this table, and derivedfrom the work in [Bengio and Mariethoz, 2004] (see Appendix B), 90% confidenceintervals for the TER measures are also given. As we can see, it is clear that theuse of IOF-ASM offers accurate results for our task, as the degradation between theerror rates with manual and automatic segmentation is small. Moreover, from thecomparison against rows 3–5, it is clear that the automatic approach offers state-of-the-art error rates in both configurations (with no statistically significant differencesbetween methods).

2.9 Face Identification on the CMU PIE database

Up to now, the obtained results have shown the benefits of normalizing only poseparameters, but the performance of the methods under large pose variations have notbeen assessed yet. Moreover, we want to test whether there exist improvements whenfacial symmetry is taken into account. For these purposes, we used the pose subsetof the CMU PIE database that was introduced in Section 2.3.3.

2.9.1 Experimental setup

Following [Phillips et al., 2000b], we distinguish between gallery and probe images.The gallery contains images of known individuals, which are used to build templates,and the probe set contains images of subjects with unknown identity, that must becompared against the gallery. A closed universe model is used to assess system per-

3 and 4

2.9. Face Identification on the CMU PIE database 55

Table 2.5: Identification rates (%) on the CMU PIE database: No pose correction

Probe Pose c11 c29 c27 c05 c37Gallery Pose

c11 – 94.12 63.24 48.53 25.00c29 97.06 – 92.65 66.18 39.71c27 79.41 91.18 – 92.65 51.47c05 67.65 80.88 98.53 – 88.23c37 23.53 38.24 51.47 77.94 –

formance, meaning that every subject in the probe set is also present in the gallery.We did not restrict ourselves to work with frontal faces as gallery. Instead, the per-formance of the system was computed for all possible (gallery, probe) combinations.

2.9.2 Our results

Table 2.5 shows the baseline results when no pose correction is applied. The av-erage recognition rate is 68.38%. When the NFPW method is used, the correctidentification rate increases to 78.46% (Table 2.6). However, results are poor forcompletely different viewpoints. As it can be seen from Table 2.7, performance isimproved if facial symmetry is taken into account, leading to an average recognitionrate of 87.50%. It seems a rather safe hypothesis to think that better results couldbe obtained if pose is transferred and facial symmetry is used (PTW plus symme-try), specially when viewpoints are quite different. The results shown in Table 2.8confirm this supposition. The average recognition rate over all (gallery, probe) pairsis 91.47%, and the highest improvements are achieved when the gallery and probesets are facing opposite directions –pairs (11, 05), (05, 11), (11, 37), (37, 11), (29, 05),(05, 29), (29, 37) and (37, 29), i.e. the leftmost bottom four cells and the righmosttop four cells–. The average recognition rate in these cells increases from 74.63%using NFPW and symmetry, to 84.56% using PTW and symmetry. The averagerecognition rate in the other cells is the same (96.08%) for the two methods.

2.9.3 Other researchers’ results

Table 2.9 presents the recognition rates achieved with the use of the 3D morphablemodel -based face recognition system described in [Romdhani et al., 2002]. Theaverage recognition rate is 88.45%. As we can see, PTW plus symmetry achievesbetter performance over the set of considered poses. Although their approach is semi-automatic, some parameters such as the face pose, focal length, etc. used for algorithm


Table 2.6: Identification rates (%) on the CMU PIE database: NFPW without facialsymmetry


c11 – 97.06 77.94 55.88 19.12c29 98.53 – 98.53 73.53 44.12c27 89.71 98.53 – 100 85.29c05 67.65 85.29 100 – 97.06c37 36.76 54.41 91.18 98.53 –

Table 2.7: Identification rates (%) on the CMU PIE database: NFPW plus facialsymmetry


c11 – 97.06 88.23 80.88 66.18c29 98.53 – 95.59 85.29 67.65c27 94.12 98.53 – 100 89.71c05 80.88 80.88 98.53 – 100c37 70.59 64.71 94.12 98.53 –

Table 2.8: Identification rates (%) on the CMU PIE database: PTW plus facialsymmetry


c11 – 100 77.94 85.29 82.35c29 95.59 – 97.06 89.71 76.47c27 94.12 98.53 – 100 97.06c05 89.71 94.12 98.53 – 98.53c37 80.88 77.94 97.06 98.53 –


initialization, are computed using data provided by the maker of the database. In[Chai et al., 2006] and [Chai et al., 2005], only frontal images were used as gallery.The recognition rates for these two methods are shown in the first two rows of table2.10, with averages of 85.5% and 94.87% respectively. For the same gallery, NFPWplus symmetry obtains 95.59% and PTW plus symmetry achieves 97.43% of correctrecognition rate.

In [Gross et al., 2004], the authors propose an appearance-based algorithm thatuses an special kind of holistic feature (the so-called eigen light-field: ELF) for facerecognition. From the third to the sixth row, Table 2.10 presents the results achievedwith two different versions of the the ELF approach: the 3-point ELF (3 points - eyesand mouth - are used to warp the face image) and the Complex ELF (where a set ofmanually annotated points is used for the normalization). Due to the use of manuallandmarks, the last one is specially suitable for comparison with our method. We cansee that PTW plus symmetry ouperforms the Complex ELF in the range of consid-ered poses¶ (93% compared to 82.5% correct recognition rate). [Gross et al., 2004]also presented the performance of a baseline method using holistic feature extraction(eigenfaces). The results achieved with this method are shown in the last two rowsfrom table 2.10 and, as can be seen, they are even worse than those of our baselinealgorithm.

In [Gross et al., 2001], results obtained with the use of Visionics’ face recogni-tion module FaceIt were presented. Table 2.11 shows the performance in the set ofconsidered poses. If only one image is used as gallery (first five rows), the averagerecognition rate is 66.10%. This performance is similar to the one obtained with ourbaseline method, where no pose correction was applied. Clearly, correcting pose withany of the approaches we have proposed, provides better results than FaceIt. If notonly frontal images, but also faces rotated to the right and to the left are used asgallery (last two rows of table 2.11), the general performance is improved and therecognition rate, for a given probe pose, is approximately limited by the performanceof the best (single) gallery pose.

2.9.4 Testing the system with large pose changes.

Up to now, we used images from the PIE database ranging from −45◦ to 45◦ ofhorizontal rotation. It has been demonstrated both with realistic examples (Section2.4.2) and good identification rates, that the method is suitable for face recognitionacross pose within the mentioned range of angles. These results are in agreementwith [Lanitis et al., 1997], [Cootes et al., 2000], where the authors state that a linearmodel is enough to simulate large changes in viewpoint, as long as all the landmarks

¶At the moment of writing this PhD Thesis, only numerical results with poses c27 and c37 asgallery could be obtained from the authors of [Gross et al., 2004].


Table 2.9: Identification rates (%) on the CMU PIE database: 3D Morphable Modelwith LiST fitting algorithm [Romdhani et al., 2002]


c11 – 94 94 74 65c29 96 – 96 78 68c27 93 97 – 99 94c05 88 90 99 – 93c37 82 82 93 94 –

Table 2.10: Identification rates (%) on the CMU PIE database: Other results

Probe Pose c11 c29 c27 c05 c37Method Gallery Pose

[Chai et al., 2006] c27 76.5 95.6 – 91.2 77.9[Chai et al., 2005] c27 95 97 – 98 89

ELF 3-point c27 76 85 – 89 75ELF 3-point c37 73 66 80 80 –

ELF Complex c27 76 90 – 94 90ELF Complex c37 70 69 83 88 –

Eigenfaces c27 40 64 – 48 54Eigenfaces c37 10 20 53 35 –

Table 2.11: Identification rates (%) on the CMU PIE database: Visionics’ FaceItresults [Gross et al., 2001].


c11 – 78 73 57 40c29 87 – 91 68 44c27 75 93 – 93 62c05 54 65 91 – 66c37 37 35 53 60 –

c11–c27–c37 – 91 – 99 –c05–c27–c29 88 – – – 66


remain visible. But, what if we exceed this range? Is the system able to deal withsuch extreme pose variations? As it is discussed in [Cootes et al., 2000], when someof the landmarks become occluded, the assumptions of the linear model break downand therefore it is no longer valid.

Clearly, in a near profile image (see Figure 2.23) we can not use the original set oflandmarks from Figure 2.1 because some of them are occluded. Based on the previ-ous statement (“a linear model is enough to simulate large changes in viewpoint, aslong as all the landmarks remain visible”) and for large rotation angles, we decidedto use the restricted set of visible landmarks: after training the PDM, we assess byvisual inspection the specific values of b2 that start producing severe occlusions inboth directions, and determine the subset of landmarks that is visible in the presenceof such extreme rotations. Figure 2.24 shows the effect of using high b2 values onthe reconstructed shape. Apparently, the “occluded” contour seems to be really dis-appearing behind the visible features, while the non-occluded landmarks still defineplausible (rotated) face shapes. Hence, in order to avoid distortions during the warp-ing process, we discard the “occluded” landmarks and their corresponding regions.The upper row of Figure 2.25 shows the result of using the whole set of 62 landmarksin the warping process. Clearly, facial images are seriously distorted when large posechanges are induced. More realistic faces can be synthesized with the restricted setof visible landmarks (lower row of Figure 2.25).

In order to assess the performance of the system in the presence of large rotations,we used the faces from poses 11, 29, 27, 05 and 37 as gallery and the images fromposes 02 and 14 (≈ ±67.5◦ of rotation) as probe. For these two poses, the wholeset of 62 landmarks is not visible and hence, we are not able to obtain the vector ofshape parameters nor the pose parameters as explained in Sections 2.2 and 2.3. Forthe same reason, we can not apply NFPW in a straightforward manner. However, wewill demonstrate that a variant of the PTW method is useful in this case. In Section2.3.3 it was shown that there exists an approximate linear relationship between b2and the angle of horizontal rotation θ. Although we know that in the presence oflarge rotations, the linear dependence may not hold, estimation of the b2 values forθ = ±67.5◦ (namely b2,−67.5◦ and b2,+67.5◦) were computed assuming linearity. For

each of the gallery images, the b2 value of its corresponding mesh was set to b2,−67.5◦

and b2,+67.5◦ , and virtual meshes were obtained. Finally, taking facial symmetry andthe subset of visible landmarks into account, we synthesized virtual faces. In Figure2.26, we can see several examples of warped images obtained with this procedurealong with the original faces at poses 02 and 14. It is clear that:

• Corresponding virtual and original images show similar pose, hence the proce-dure of estimating b2 values for θ = ±67.5◦ assuming linearity turned out towork quite well. However, the reconstructed face does not always adopt thecorrect pose (see, for instance, last two columns in the second row of Figure


Figure 2.23: Near profile image from subject 04004 of the PIE database.

Figure 2.24: Effect of large b2 values on the reconstructed shapes. The “occluded”contour, marked with blue dashed line, seems to disappear behind the visible features.


2.26).

• Identity is preserved in the synthesized images. Although large changes inpose are induced, both PDM and warping process are not introducing seriousdistortions when only visible landmarks are used. However, some features suchas shape of the nose can not always be reconstructed accurately due to lack of3D information in the 2D model.

Table 2.12 shows the results for every (gallery, probe) pose combination. Theaverage recognition rate is 77.5%. Table 2.13 presents the results for the same setusing the 3D Morphable Model with LiST fitting algorithm [Romdhani et al., 2002].The average recognition rate for this technique is 79.7%. As we can see, the variant ofthe PTW method achieves similar performance to that of the 3D Morphable Modelwhen near profile faces are tested. However, using only 2D information, we can notexpect good performance for full profile images. In fact, average recognition rate dropsto 20% with poses 22 and 34 (±90◦). In this case, the 3D model clearly outperformsour system with an average recognition rate of 55%.

Table 2.12: Identification rates (%) on the CMU PIE database: Testing the systemwith extreme pose changes

Probe Pose c14 c02Gallery Pose

c11 95 70c29 80 60c27 62.5 72.5c05 67.5 87.5c37 85 100


Table 2.13: Identification rates (%) on the CMU PIE database: 3D Morphable Modelwith LiST fitting algorithm [Romdhani et al., 2002]

Probe Pose c14 c02Gallery Pose

c11 99 53c29 90 54c27 87 76c05 85 85c37 78 90

Figure 2.25: Upper row: Examples of virtual images using the whole set of 62landmarks. As we can see, serious distortions are induced in the presence of largerotations. Lower row: Synthesized faces using the set of visible landmarks. Clearly,images under large pose changes are much more realistic and seem to preserve identityinformation correctly.

2.10. Conclusions 63

2.10 Conclusions

Based on a subset of the modes of a Point Distribution Model, namely the poseparameters, we have proposed methods which try to minimize differences in posewhile preserving discriminative subject information. We have demonstrated thatthe identified pose parameters are mostly responsible for rigid mesh changes, and donot contain important non-rigid (expression/identity) information that could severelydistort the synthesized images.

Qualitatively, we justified the benefits of normalizing only pose parameters insteadof warping onto an average face shape. This fact was quantitatively confirmed afterauthentication tests on the XM2VTS database not only with a relative improvementof 31-35%, but also with the certainty that there exist statistically significant dif-ferences between both approaches. Moreover, the identification experiments on theCMU PIE database show:

• Taking advantage of facial symmetry does clearly improve system performance.

• Transferring pose performs better than normalizing to frontal pose, at the costof more computation.

• The proposed methods achieve state-of-the-art results, outperforming the 3Dmorphable model and other approaches in a set of rotation angles ranging from−45◦ to 45◦.

• The variant of PTW achieves similar performance to that of the 3D morphablemodel when near profile images are used, but degrades with full profile views.

Hence, we have demonstrated the suitability of the methods (specially PTW ) forface recognition across pose in a set of angles ranging from −45◦ to 45◦ and shownthat a 2D model can deal with rotations up to 67.5◦, obtaining similar performance tothe one achieved using a more complex 3D system. However, the latter outperformsthe 2D model significantly when full profile views are tested.

Although some results with automatic fitting were presented in Section 2.8 show-ing little degradation in comparison with manually annotated landmarks, we did notstill validate our methods on the CMU PIE database with automatic face alignment.Hence, the next step will be to test the pose correction stage on this database with au-tomatic fitting using some well known techniques such as AAM [Cootes et al., 2001].With the help of a face tracker algorithm such as [Baker and Matthews, 2001], weplan to perform experiments on video sequences in order to test the suitability of themethods for pose-robust face recognition from video (see Chapter 6 for preliminaryresults on this topic). Although the variant of the PTW method turned out to workquite well, we should still refine this algorithm on near-profile views. Another possibleimprovement could be to learn a view-based weight function, so that depending on


Figure 2.26: First and third columns show original images at poses 14 and 02 respec-tively, while second and fourth columns present the corresponding synthesized imagesusing the variant of the PTW method introduced in Section 2.9.4


the current pose of the face, some regions get more importance than others in thecomputation of the final similarity score.

Chapter 3

Shape-driven Gabor Jets


3.2 Ridges and Valleys Detector . . . . . . . . . . . . . . . . . 70

3.3 Shape sampling . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.4 Extracting textural information . . . . . . . . . . . . . . . 73

3.5 Shape Contexts . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.5.1 Invariance to scaling, translation and rotation . . . . . . . . 76

3.6 Texture dissimilarity . . . . . . . . . . . . . . . . . . . . . 77

3.7 Measuring dissimilarity between sets of points . . . . . . 77

3.8 Combining Shape and Texture . . . . . . . . . . . . . . . 79

3.9 Testing the system against lighting and expression vari-ations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.9.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.9.2 Facial expression changes . . . . . . . . . . . . . . . . . . . 82

3.9.3 Illumination variation . . . . . . . . . . . . . . . . . . . . . 83

3.10 Face Authentication on the XM2VTS database . . . . . 84

3.10.1 Comparison with EBGM . . . . . . . . . . . . . . . . . . . 84

3.10.2 Measuring GD1 and GD2 performance . . . . . . . . . . . . 87

3.10.3 Shape and texture combination results . . . . . . . . . . . . 90

3.10.4 Results from other researchers . . . . . . . . . . . . . . . . 90

3.10.5 Accuracy-based Feature Selection (AFS) . . . . . . . . . . . 91

3.11 Face Authentication on the BANCA database . . . . . . 93

67

68 Chapter 3. Shape-driven Gabor Jets

3.12 Distance Measures for Gabor Jets Comparison . . . . . 95

3.12.1 Distance between faces . . . . . . . . . . . . . . . . . . . . . 96

3.12.2 Results on BANCA’s MC protocol . . . . . . . . . . . . . . 98

3.13 Conclusions and further research . . . . . . . . . . . . . . 100

3.1 Introduction

Gabor filters [Gabor, 1946] have received great attention in the context of automaticface recognition due to biological reasons, since it has been shown that these filtershave similar shape to the receptive fields of simple cells found in the visual cortexof mammalians, and because of the optimal resolution in both frequency and spatialdomains [Daugman, 1980, Daugman, 1985, Daugman, 1988, MacLennan, 1991].

Recently, it has been published a survey on the use of Gabor filters for face recog-nition [Shen and Bai, 2006a], revealing the huge number of papers that had adoptedsuch features for face processing. Some of the most succesful Gabor-based approachesare those using Elastic Graph Matching to locate a set of pre-defined positions in theimage, pioneered by the Dynamic Link Architecture (DLA) [Lades et al., 1993] andthe Elastic Bunch Graph Matching (EBGM) [Wiskott et al., 1997] approaches. InDLA, a rectangular model graph attached with jets is built for every user in thegallery, and the graph matching stage is required for each image pair. Based onDLA, [Wiskott et al., 1997] proposed the EBGM approach, in which a more appro-priate graph structure to represent faces is employed. Compared to DLA, EBGMuses an object-adapted graph (i.e. a face-like graph) whose nodes refer to specificfacial landmarks (eyes, tip of the nose, mouth. . . , i.e. “universal” landmarks). More-over, in order to reduce the computational burden of matching with each individualmodel graph (as it is done in DLA), they proposed the so-called Face Bunch Graph(FBG). For recognition, multi-scale and multi-orientation Gabor responses (jets) arecomputed from each node of the face graph. Without a doubt, EBGM is one of themost famous approaches in face recognition, and a large number of modifications,improvements and extensions based on graph matching have been proposed [Ducet al., 1999, Kotropoulos et al., 2000, Liao and Li, 2000, Jiao et al., 2002, Mu et al.,2003, Kela et al., 2006, Shin et al., 2007].

[Smeraldi and Bigun, 2002] applied a retinal vision-based algorithm to detect pre-defined facial landmarks, from where Gabor features were extracted and used forauthentication. Instead of a face-like configuration, other researchers have decided touse a rectangular grid to compute features, such as the approaches in [Duc et al., 1999]using Gabor filters, and [Kotropoulos et al., 2000], [Zafeiriou et al., 2005] applyingmulti-scale morphological features. At this point, the next question arises: Giventhat recognition aims to discriminate between subjects, why should these features

3.1. Introduction 69

be extracted at pre-defined, universal positions? Contrarily to the above referencedmethods, we suggest to extract features from positions or regions that are somehowsubject-specific.

Since Elastic Bunch Graph Matching is, for sure, one of the most referencedalgorithms using a pre-defined face-like configuration, we will use it as a baselinecomparing its performance against the one obtained by our approach. Finding everyfiducial point in EBGM relies on a matching process between the candidate jet anda bunch of jets extracted from the corresponding fiducial points of training faces.This matching problem is solved by maximizing a function that takes texture andmesh distortion into account. In this way, there are several variables that can affectthe accuracy of the final positions, as differences in pose, illumination conditions andinsufficient representativeness of the stored bunch of jets. Once fiducial points areadjusted, only textural information (Gabor jets) is used in the classifier.

The main novelty of our approach, namely Shape-driven Gabor Jets (SDGJ),is somewhat conceptual, since our ultimate goal should be to exploit individual facestructure so that the system focuses on subject-specific discriminative points/regions,rather than on universal landmarks like EBGM does. In this sense, the specific pro-cessing that is applied to faces in order to achieve that goal is (although critical) justan issue of implementation. In practical terms, the main differences between EBGMand our current approach are focused on the way we locate and match points and onthe final dissimilarity function that does not only use texture but also geometricalinformation.

Our method locates salient points in face images by means of a ridges and valleysdetector [Lopez et al., 1999]. Low level descriptors, such as edges or ridges and val-leys, have been already used for face recognition motivated by cognitive psychologicalstudies [Biederman and Gu, 1988], [Bruce et al., 1992] which indicated that humanbeings could recognize line drawings as quickly and almost as accurately as gray-levelpictures. For instance [Takacs, 1998] proposes to use edge map coding and a modifiedpixel-wise Hausdorff distance to compare faces. [Gao and Leung, 2002] introduces acompact face feature, the so-called Line Edge Map (LEM), to code and recognizefaces. [Alba-Castro et al., 2003] proposes a supervised discriminant Hausdorff dis-tance to compare sketches obtained by means of a ridges and valleys detector. In thiswork, however, we will also use texture information to improve classification accuracy.In fact, at each of the located points, Gabor jets will be calculated and stored forfurther comparison. One of the main advantages of localizing points by means of aridges and valleys detector is that, as only some basic image operations are needed,the computational load is reduced from the original EBGM algorithm and, at thesame time, possible discriminative locations are found in an early stage of the recog-nition process. In this sense we say that this method is inherently discriminative, incontrast to trainable parametric models. Some of the located points may belong to“universal” landmarks, but many others are person-dependent. The correspondence


between points of two faces only uses geometrical information and it is based on shapecontexts [Belongie et al., 2002]. This way, comparison between shape-driven jets isfeasible. As a byproduct of the correspondence algorithm, we extract measures oflocal geometrical distortion, and the final dissimilarity function compiles geometricaland textural information. To the best of our knowledge, the combination of tools weapply (low-level face description + shape matching + feature extraction) is novel inthe field of face recognition.

A topic that seems to have attracted practically no attention in Gabor-based facerecognition systems is the selection of the similarity (or distance) function used tocompare corresponding features. In fact, although most approaches have used cosinedistance (normalized dot product), this choice is not motivated, neither with a theo-retical nor with an experimental evaluation. To the best of our knowledge, the onlyevaluation of distances for Gabor jet comparison was performed in [Jiao et al., 2002],where the authors concluded that Manhattan (or city block) distance outperformedboth cosine and euclidean distances. In this chapter we also propose a more exten-sive evaluation, comparing seven different distances for measuring similarities betweenshape-driven Gabor jets, as well as assessing the impact of the specific normalizationmethod that is applied to jets before comparison.

This chapter is organized as follows: Section 3.2 introduces the ridges and valleysdetector. Grid adjustment and selection of points is described in Section 3.3, whileSection 3.4 explains texture extraction through Gabor filters. Section 3.5 shows thealgorithm used to match points between two faces. The Sketch Distortion term isintroduced in Section 3.7. Section 3.8 proposes a linear combination to fuse shapeand texture scores. In Section 3.9, we conduct experiments on the AR face database[Martınez and Benavente, 1998] to test the performance of the system against lightingand expression changes. Experimental results are given in Sections 3.10 and 3.11 forthe XM2VTS [Messer et al., 1999] and BANCA [Bailly-Bailliere et al., 2003] databasesrespectively. Section 3.12 presents the empirical evaluation of distances for Gabor jetcomparison. Finally, conclusions and future research lines are drawn in Section 3.13.

3.2 Ridges and Valleys Detector

First of all, shape information must be extracted from face images. Although otherapproaches, such as edges, can be used, face shape has been obtained through a ridgesand valleys detector. Contrarily to edges, where people agree on their mathemati-cal characterization, the case of ridges and valleys is more complex and there existseveral mathematical characterizations that try to formalize the intuitive notion ofridge/valley. In this work, we have used the ridges and valleys obtained by threshold-ing the so-called Multi-local Level Set Extrinsic Curvature (MLSEC) [Lopez et al.,1999], [Lopez et al., 2000]. The main reasons that support the choice of the MLSEC

3.2. Ridges and Valleys Detector 71

are its invariance to both rigid image motions and monotonic grayscale changes and,mainly, its high continuity and meaningful dynamic range [Lopez et al., 2000]. Basi-cally, the MLSEC operator works as follows:

• Computing the normalized gradient vector field of the smoothed image

• Calculating the divergence of this vector field, which is bounded and gives an in-tuitive measure of valleyness (positive values running from 0 to 2) and ridgeness(negative values from -2 to 0), and

• Thresholding the response so that image pixels where the MLSEC responseis smaller than −τ1 are considered ridges, and those pixels larger than τ2 areconsidered valleys.

Several parameters must be adjusted, such as τ1 and τ2. In this work, we fixedτ1 = τ2 = 1. Also, the smoothing filter applied to the faces can be modified, leadingto a more or less detailed shape image. Figure 3.1 shows the result of applying theridges and valleys operator to the same face image using two different smoothingfilters.

One of the interesting properties of the MLSEC operator is that it behaves wellin the presence of illumination changes [Pujol et al., 2001], due to the fact that theresponse of the operator depends on the orientations of the normalized gradient fieldsrather than on their magnitudes. Besides of its desirable illumination behaviour, therelevance of valleys in face shape description has been pointed out by some cognitivescience works. Among others, [Pearson et al., 1990] hipothesize that this kind offilters could be used as an early step by the human visual system (HVS), becausethere exist several similarities between valley responses and the way human beingsanalyze faces:

1. Valley positions provide the observer with 3D information of the shape of theobject that is being observed. Valleys of a 3D surface with uniform albedoare placed at those points where surface normal is perpendicular to the point-observer axis and, in a similar way, ridges are placed at those points wheresurface normal is collinear with the point-observer axis.

2. The response of a valley detector depicts the face in a similar way a human couldhave drawn it, showing the position and extent of the main facial features, asit can be seen in the rightmost column of Figure 3.1.

3. The ability of HVS when recognizing faces decrease dramatically if negativeimages are used instead of positive ones. Valleys, as well as ridges, do notremain at the same position when the image is negated (valleys become ridgesand vice versa), and it seems clear from Figure 3.1 that it is more difficult forhumans to infer identity from the ridges image than from the valleys one.


Although the last statement would seem in favour of the use of valleys instead ofridges, results reported in [Pujol et al., 2001] do not clearly support this hypothesis.Ridges, valleys and edges were evaluated and compared (using both euclidean andHausdorff-based distances) in a face recognition framework under illumination andexpression changes. Both ridges and valleys clearly outperformed edges while, onaverage, ridges turned out to work slightly better than valleys. Following these results,we decided to focus on the use of ridges in our experiments, although some resultswith valleys will be also presented.

Figure 3.1: Applying the ridges and valleys detector to the same face image usingtwo different smoothing filters. Left: Original Image. Center-left: Valleys and ridgesimage. Center-right: Thresholded ridges image. Right: Thresholded valleys image

3.3 Shape sampling

Once the ridges and valleys have been extracted in a new image, we must sample theobtained lines in order to keep a set of points that depicts the face. For generalityand ease of notation, hereinafter we will refer to the binary image (ridges or valleys)obtained as a result of the previous step, as the sketch S (i.e. the methodologythat will be introduced in this and the following sections is valid for both ridges andvalleys. However, in agreement with the discussion at the end of Section 3.2 andunlike otherwise stated, the presented numerical results were obtained using ridges).

In order to select a set of points from the original sketch, a dense rectangulargrid (nx × ny nodes) is applied onto the face image and each grid node ~gi is moved

3.4. Extracting textural information 73

towards its nearest point ~pi of the sketch, i.e. ‖~gi − ~pi‖ ≤ ‖~gi − ~p‖ ∀~p ∈ S (~p ∈ Sif S(~p) = 1). In order to avoid the case in which two or more grid nodes coincideon the same sketch point ~pi, a flag is set to 1 the first time ~pi is used, so that theother grid nodes must find their corresponding sketch points among the remainingones. Finally, we get a vector of points P = {~p1, ~p2, . . . , ~pn}, where n = nx × ny and~pi ∈ R

2. Typical sizes for n are 100 or more nodes. These points sample the originalsketch, as it can be seen in Figure 3.2. Obviously, uniform sampling could also beused, but the main reason to use the “deformable” rectangular grid relies on the factthat, this way, there is a naive mapping between points coming from the same nodeof their respective rectangular grids, which will be used as a baseline algorithm forpoint matching. This will become clearer in Section 3.10, with some experimentalresults.

Figure 3.2: Left: Original rectangular dense grid. Center: Sketch. Right: Gridadjusted to the sketch.

3.4 Extracting textural information

A set of 40 Gabor filters {ψm}m=1,2,...,40, using the same configuration as in [Wiskottet al., 1997], are used to extract textural information. These filters are biologicallymotivated convolution kernels in the shape of plane waves restricted by a Gaussianenvelope [Daugman, 1988], as it is shown next:

ψm (~x) =

∥∥∥~km

∥∥∥2

σ2exp

−∥∥∥~km

∥∥∥2

‖~x‖2

2σ2

[exp

(i · ~km~x

)− exp

(−σ2

2

)](3.1)


where ~km contains information about scale and orientation, and the same standarddeviation σ = 2π is used in both directions for the Gaussian envelope.

The region surrounding a pixel in the image is encoded by the convolution of theimage patch with these filters, and the set of responses is called a jet, J . So, a jet isa vector with 40 complex coefficients, and it provides information about an specificregion of the image. At point ~pi = [xi, yi]

T , we get the following feature vector:

{J~pi}m =

∑

x

∑

y

I(x, y)ψm (xi − x, yi − y) (3.2)

where {J~pi}m stands for the m-th coefficient of the feature vector extracted from ~pi.

So, for a given face with a set of points P = {~p1, ~p2, . . . , ~pn}, we get n Gabor jetsR = {J~p1,J~p2, . . . ,J~pn

}.

3.5 Shape Contexts

Suppose that shape information has been extracted from two images, say F1 and F2.Let S1 and S2 be the sketches for these incoming images, and let P = {~p1, ~p2, . . . , ~pn}be the set of points for S1, and Q = {~q1, ~q2, . . . , ~qn} the set of points for S2. Giventhat we do not have any label regarding which pair of points should be matched, ifwe want to compare feature vectors from both images, we need to obtain a functionξ that maps each point from P to a point within Q:

ξ (i) : ~pi =⇒ ~qξ(i) (3.3)

where ~pi ∈ P and ~qξ(i) ∈ Q. Hence, the feature vector from F1, J~pi, will be compared

to J~qξ(i), extracted from F2.

In order to obtain the mapping (ξ) between sets of points, we have adopted the ideadescribed in [Belongie et al., 2002]. For each point ~pi in the constellation, we computea 2-D histogram h~pi

(called shape context) of the relative position of the remainingpoints, so that a vector of distances D = {d~pi~p1 , d~pi~p2, . . . , d~pi~pn

} and a vector of angles~θ = {θ~pi~p1, θ~pi~p2, . . . , θ~pi~pn

} are calculated for each point. Bins are uniform in log-polarspace, i.e. the logarithm of distances is computed. Each pair

(log(d~pi~pj

), θ~pi~pj

)will

increase the number of counts in the adequate bin of the histogram, as shown inFigure 3.3. So, finally, each face shape is depicted through a set of n 2-D histograms.

Once the sets of histograms are computed for both faces, we must match eachpoint in the first set P with a point from the second set Q. A point ~p from P ismatched to a point ~q from Q if the term C~p~q, defined as:

C~p~q =∑

k

[h~p (k) − h~q (k)]2

h~p (k) + h~q (k)(3.4)

3.5. Shape Contexts 75

Figure 3.3: Log-polar histogram located over a point of the face: shape context


is minimized∗. As explained in [Belongie et al., 2002], not only distances between his-tograms could be considered but also appearance-based differences. In this sense, wecould introduce Gabor jet dissimilarities (Section 3.6) in the function to be minimized,but it would require more computation (n2 jet comparisons). In order to decreasethe burden, we could restrict the search to the neighbourhood of each point, providedthat faces are in a fixed position (i.e. without rotation). In future research, we planto assess the behaviour of the matching process when both geometrical and texturalinformation is used, as well as the impact on performance (and computational time)that causes constraining the search to the region surrounding each point.

3.5.1 Invariance to scaling, translation and rotation

Invariance to translation is intrinsic to the shape context definition since all measure-ments are taken with respect to points over the face lines. To achieve scale invariance,we measure how big the object (face) is. One way to do this is by adding the distancesbetween all points in the constellation, i.e. to proceed as follows:

Dface =

n∑

i=1

n∑

j=i+1

‖~pi − ~pj‖ (3.5)

This distance Dface gives an idea of the size of the face, so that it can be normalizedto standard scale just by resizing the input image by a factor of r = Dstd

Dface, where

Dstd is the distance Dface for a standard face size. Also, if we are looking for a moreaccurate estimation of the size of the face, an iterative process can be applied untilthe ratio r ∈ (1 − ǫ, 1 + ǫ) for a given threshold ǫ > 0.

Furthermore, we can provide for rotation invariance, as explained below. Thevectors of angles ~θ = {θ~pi~p1 , θ~pi~p2, . . . , θ~pi~pn

} are calculated taking the x-axis ( the

vector (1, 0)T ) as reference. This is enough if we are sure that the faces are in anupright position. But, to deal with rotations in plane, i.e. if we do not know therotation angle of the heads, we must take a relative reference for the shape matchingalgorithm to perform correctly. Consider, for the set of points P = {~p1, ~p2, . . . , ~pn},the centroid of the constellation ~cP :

~cP =1

n

n∑

i=1

~pi (3.6)

For each point ~pi, we will use the vector −−→picP = ~cP − ~pi as the x-axis, so that rotationinvariance is achieved.

∗k in equation (3.4) runs over the number of bins in the 2D histogram

3.6. Texture dissimilarity 77

In [Belongie et al., 2002], they treated the tangent vector at each contour pointas the positive x-axis to achieve rotation invariance. As long as we do not work oncontours, it is not easy to define this tangent vector.

Also, the angle between the two images, ϕ, can be computed as follows:

ϕ =1

n

n∑

i=1

∠(−−→picP ,

−−−→qξ(i)cQ)

(3.7)

so that the system is able to put both images in a common position for furthercomparison. If we do not take this angle into account, textural extraction will not beuseful for our purposes.

3.6 Texture dissimilarity

Let R1 = {J~p1,J~p2, . . . ,J~pn} be the set of jets calculated for F1 and R2 = {J~q1,J~q2, . . . ,J~qn

}the set of jets computed for F2. The similarity function between these two faces,SJ (F1,F2) is given by:

SJ = fn

{< J~pi

,J~qξ(i)>}

i=1,...,n(3.8)

where < J~pi,J~qi

> represents the normalized dot product between correspondentjets, but taking into account that only the moduli of jet coefficients are used. InEquation (3.8), fn stands for a generic combination rule of the n dot products. Texturedissimilarity is simply calculated as

DSJ (F1,F2) = 1 − SJ (F1,F2) (3.9)

3.7 Measuring dissimilarity between sets of points

We have used two different terms to measure the geometrical dissimilarity betweentwo sets of points depicting their respective sketches:

GD1 (P,Q) =

n∑

i=1

C~pi~qξ(i)(3.10)

GD2 (P,Q) =

n∑

i=1

∥∥−−→picP −−−−→qξ(i)cQ∥∥ (3.11)

Equation (3.10) computes dissimilarity by adding the n individual costs betweenmatched points represented in (3.4). On the other hand, equation (3.11) calculatesdissimilarity by summing the norm of the difference vector between matched points.


The linear combination of these two distance measures

λ1GD1 + λ2GD2 (3.12)

is what we call Sketch Distortion (SKD). Figures 3.4 and 3.5 give a visual understand-ing of this concept. Figure 3.4 shows two instances of face images from subject A,while faces in Figure 3.5 belong to subject B. The visual geometric difference betweenthe two persons is reflected in the Sketch Distortion term, whose values are shown inTable 3.1 for λ1 = λ2 = 1.

Figure 3.4: Top: Left: First image from subject A. Center: Sketch. Right: Gridadjusted to the sketch. Bottom: Left: Second image from subject A. Center: Sketch.Right: Grid adjusted to the sketch.

3.8. Combining Shape and Texture 79

Figure 3.5: Top: Left: First image from subject B. Center: Sketch. Right: Gridadjusted to the sketch. Bottom: Left: Second image from subject B. Center: Sketch.Right: Grid adjusted to the sketch.

3.8 Combining Shape and Texture

Although shape information is somehow encoded in jets (they have been calculated atshape-sampled points, and they have found their corresponding jets for comparisonby shape matching), we thought of linearly combining both shape and texture scores,leading to the final dissimilarity measure:

DS (F1,F2) = λ0DSJ (F1,F2) + SKD (F1,F2) (3.13)

We have used fn in (3.8) as the median operator to fuse the n dot products, whichis more robust than the mean operator. However, if we change median for mean, fromEquations (3.8) to (3.13), it immediately follows that DS (F1,F2) is equal to:


Table 3.1: Sketch Distortion (SKD) between the face images from Figures 3.4 to 3.5.Subject A Subject B

Im 1 Im 2 Im 1 Im 2

Subject Im 1 0 1851 3335 3326A Im 2 - 0 3053 2821

Subject Im 1 - - 0 1889B Im 2 - - - 0

n∑

i=1

[λ0

1− < J~pi,J~qξ(i)

>

n+ λ1C~pi~qξ(i)

+ λ2

∥∥−−→picP −−−−→qξ(i)cQ∥∥]

(3.14)

Equation (3.14) shows that each contribution of jet dissimilarity is modified by ageometrical distortion (the so-called Local Sketch Distortion or LSKD). A high valuein LSKD from the pair

(~pi, ~qξ(i)

)means that there exist local differences between

matched points, so that jet dissimilarity will also be high. This fact is more likelyto occur when incoming faces do not represent the same person. Even if LSKD islow, but faces do not belong to the same person, textural information will increasethe dissimilarity between them. On the other hand, when faces belong to the samesubject, low LSKD values should be generally achieved, so that matched points arelocated over the same face region, resulting in a low jet dissimilarity. Thus, themeasurement in (3.14) reinforces discrimination between subjects.

As a preliminary result, Figure 3.6 shows the performance of the system over asubset of the XM2VTS database [Messer et al., 1999] using λ0 = 1 and λ1 = λ2 = λ.In this figure, the Total Error Rate (TER)† in the evaluation and test sets are plottedagainst λ. λ = 0 corresponds to the case in which only textural information is takeninto account. As we can see, there is a range of values for which the TER is below theone obtained by only using texture information, and the value of λ which minimizesthe TER in the evaluation set (λopt), also minimizes the TER in the test set.

3.9 Testing the system against lighting and ex-

pression variations

3.9.1 Database

In order to test the behaviour of the system in the presence of illumination andexpression changes, we used the AR face database [Martınez and Benavente, 1998]

†The Total Error Rate is defined as the sum of the False Acceptance and False Rejection Rates(FAR and FRR), which are common measures to assess biometric systems performance.

3.9. Testing the system against lighting and expression variations 81

0 0.5 1 1.5 2 2.5 3

x 10−4

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

lambda

TE

R im

prov

emen

t

Test

Evaluation

lambdaopt

Valid range for lambda

Figure 3.6: TER (Evaluation and Test sets) against λ

(see Appendix A -Section A.1- for a more detailed description of the database). Eachsubject in the database participated in two recording sessions separated by two weeks.For our experiments, we considered the images from 106 subjects (half men and halfwomen) showing different facial expressions and with illumination changes. Figure3.7 presents the shots taken for one subject of the database during the first andsecond sessions (top and bottom row respectively). On top row, from left to right,the first four images present facial expression changes: a) Neutral, b) Smile, c)Anger and d) Scream, while the last three shots were taken under different lightingconditions with neutral expression: e) Left light on, f) Right light on, andg) Both lights on. Analogously, the bottom row presents the same configurationfor images recorded during the second session -h) to n)-. Following [Phillips et al.,2000b], we distinguish between gallery and probe images. The gallery contains imagesof known individuals, which are used to build templates, and the probe set contains


images of subjects with unknown identity, that must be compared against the gallery.A closed universe model is used to assess system performance, meaning that everysubject in the probe set is also present in the gallery.

a) b) c) f)e)d) g)

n)m)l)k)j)i)h)

Figure 3.7: Face images from the AR face database. Top row shows images fromthe first session: a) Neutral, b) Smile, c) Anger, d) Scream, e) Left light on,f) Right light on, and g) Both lights on, while bottom row presents the shotsrecorded during the second session: h)-n).

3.9.2 Facial expression changes

In this experiment, we will assess the recognition accuracy of the system when only aneutral face is available as gallery and the probe images show expression variations.Figure 3.8 presents the cumulative match score for rank N (percentage of successfulidentification of a subject within the N first), when the neutral shot a) is used asgallery and shots b), c) and d) are presented to the system as probe. Clearly, imageswith smiling and angry expressions are correctly recognized, but the algorithm failsto identify screaming faces. The same behaviour is observed with the images fromthe second session, i.e. h) as gallery, and shots i), j) and k) as probe. Averagingresults from both tests, the recognition rates with rank 1 are 92%, 99% and 37%for smiling, angry and screaming faces. Clearly, angry faces are easier than smilingones, due to the fact that the appearance variation when changing from neutral toanger is less than that of changing from neutral to smile (see second row of Figure3.7). Regarding screaming faces, it is clear (bottom row of Figure 3.9) that theappearance variation is very large, and hence it is a difficult task when only neutralfaces are used as gallery (Gabor jets will differ significantly even if they are extractedon exact corresponding positions). No significant differences are obtained when valleysare used instead of ridges (average recognition rates with rank 1 of 93%, 99% and35%). Ridges and valleys are image-based descriptors that sketch the face shape and,accordingly, the obtained representations depend on the current emotion shown in the

3.9. Testing the system against lighting and expression variations 83

image. We would like to highlight that, although the position and shape of the linesobviously vary with expression, these lines keep representing the main facial featuresin a consistent manner (compare the two rows of Figure 3.9).

In [Gao and Leung, 2002], results were reported under the same conditions usingLine Edge Map (LEM). This approach achieves 78.57%, 92.86% and 31.25% withsmiling, angry and screaming expressions respectively. Clearly, our method outper-forms LEM in three cases although the degradation suffered with screaming faces issimilar for both approaches. In order to give less weight to those regions that aremore affected by expression changes, [Martınez, 2003] proposed to use optical flow be-tween the two images to be compared. The best reported results were approximately96% for smiling, 84% for angry and 70% for screaming faces. The optical flow-basedtechnique outperforms ours significantly when screaming faces are tested. However,our approach is comparable to [Martınez, 2003] when testing smiling expression, andclearly provides better performance with angry faces. We would like to highlight thatobtaining an expression-invariant face recognition system was not the point of thisresearch. However, it has been demonstrated that our method behaves reasonablywell (to a certain extent) in the presence of expression changes. As a future researchline, we could apply a similar idea to that of [Martınez, 2003] (weigh the differentjet contributions according to the deformation provoked by expression variations) inorder to improve performance with screaming faces.

3.9.3 Illumination variation

In this experiment, we will assess the performance of the system under differentillumination conditions. The neutral face with diffuse light is used as gallery, whilethe probe images are shots taken under lighting changes. Figure 3.11 presents thecumulative match score for rank N when shot a) is used as gallery and shots e),

f) and g) are presented to the system as probe. Similar behaviour is observedwith images from the second session, i.e. shot h) as gallery and shots l) (100%recognition rate), m) (100% recognition rate) and n) (93% recognition rate) as probe.The results using valleys are even a bit better (100%, 100% and 96% of recognitionrate). Clearly, the performance only drops a bit when both lights are switched on.This evidence shows that the system can be affected by extreme lighting conditionssuch as overillumination, as this could provoke apparent changes on face shape (seeFigure 3.10).

In [Gao and Leung, 2002], results were also reported under the same lightingconditions using LEM. This approach achieves 92.86%, 91.07% and 74.11% with leftlight on, right light on and both lights on respectively. In all cases, SDGJ (both ridgesand valleys) performs better than LEM.


0 5 10 15 20

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Cum

ulat

ive

mat

ch s

core

Smile (b)Anger (c)Scream (d)

Figure 3.8: System performance with expression variations. Gallery: shot a) (neutralface from first session). Probe: shots b), c) and d). Clearly, the system only failsto recognize screaming faces.

3.10 Face Authentication on the XM2VTS database

We tested our method using the XM2VTS [Messer et al., 1999] database on both con-figurations I and II of the Lausanne protocol [Luttin and Maıtre, 1998] (see AppendixA -Section A.4- for a description of both database and protocol).

3.10.1 Comparison with EBGM

As discussed in the introduction, the Elastic Bunch Graph Matching algorithm looksfor a set of pre-defined points such as the pupils, the corners of the mouth, etc. fromwhere jets are extracted. In order to compare our approach with an implementationof the EBGM, we decided to use a set of manually annotated landmarks‡, so that

‡Available at http://www-prima.inrialpes.fr/FGnet/data/07-XM2VTS/xm2vts-markup.html

3.10. Face Authentication on the XM2VTS database 85

Figure 3.9: Top row: ridges and valleys for the neutral expression. Bottom row:ridges and valleys for the screaming expression. Although the position and shape ofthe sketch lines obviously vary with expression, these lines keep representing the mainfacial features in a consistent manner.

we can assess the performance of an “ideal” EBGM (without the effect of fiducialpoints search errors). Strictly speaking, since no graph matching stage is used wecan not talk of EBGM, but of a manually annotated face-like mesh. However, aperfect graph matching step would output those manual positions, and hence weare referring to this approach as an “ideal” EBGM. Although only 68 points weremarked in each face, after tessellation using Delaunay, the middle points of some ofthe edges connecting the original vertices were also included in the final set, as shownin Figure 3.12. In EBGM, the correspondences between points are known, so there isno need to match vertices from the faces to be compared. However, and in order toshow that the shape context algorithm works properly for our purposes, comparisonbetween shape-matched jets (extracted at manual positions) was also included in thetests, namely EBGM-SC (Elastic Bunch Graph Matching with Shape Contexts). Thefirst two rows from Table 3.2 show the results obtained with EBGM and EBGM-SC.As we can see, both approaches perform almost the same over configurations I andII, which confirms that shape context matching is an adequate choice. In [Wiskottet al., 1997], no grid distortion was taken into account and, in order to performa fair comparison, we tested our approach without the Sketch Distortion term, i.e.λ0 = 1, λ1 = λ2 = 0 (third row of Table 3.2). Although there are no statiscallysignificant differences between SDGJ-SC and EBGM, it is clear that our approach


ValleysRidges

Ridges Valleys

Figure 3.10: Top row: ridges and valleys for the neutral expression with diffuselight. Bottom row: ridges and valleys for the neutral expression when both lightsare switched on. Although the obtained sketch is not completely invariant to lightingchanges (for instance, some valleys from the nose region -top row, purple- dissapearin the presence of strong lighting, valleys associated to “wrinkles” appear -bottomrow, blue- and some ridges change -top and bottom rows, red-), the reported results(see text) demonstrate that the system achieves a robust behaviour under the testedconditions.

achieves identical performance (even better) without the need of manual localizationof “fiducial points”.

As explained in Section 3.3, the original rectangular grid is deformed towards thesketch S1 and thus, node (a, b) is displaced to position ~pi, and the same occurs with S2,in which node (a, b) moves towards ~qi. We decided to use a shape-matching algorithmto map points but, in fact, there exists a naive mapping between ~pi and ~qi based onspatial coherence, as long as both of them come from node (a, b) of their respectiverectangular grids. This inherent mapping was used in one experiment (SDGJ-NM,SDGJ with Naive Matching) whose results are presented in the last row of Table 3.2.As shown, the performance with the naive mapping is much worse, reflecting againthe importance of the shape-matching algorithm.

Moreover, as a baseline algorithm, we assessed the performance of a rectangularrigid grid comprising 130 nodes (10×13–see Figure 3.13). Manual annotations of eyesand mouth were used to place the grid onto the face image, whilst the positions ofthe remaining points were automatically computed using constant distance betweennodes. The Total Error Rates obtained using this rectangular grid are TER=8.53%


0 5 10 15 200.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

Rank

Cum

ulat

ive

mat

ch s

core Right(e)

Left(f)Both(g)

Figure 3.11: System performance under lighting variations. Gallery: shot a). Probe:shots e), f) and g).

and TER=5.01% for configurations I and II respectively, demonstrating that shape-driven positions seem to extract better features for subject discrimination.

Finally, we ran experiments using valleys instead of ridges. In both configurations,the performance with valleys was worse than that of ridges: Total Error Rates of8.26% and 5.14% in configurations I and II respectively.

3.10.2 Measuring GD1 and GD2 performance

Up to now, shape distortion has not been taken into account for the experiments.First of all, we consider the different components of the Sketch Distortion term ontheir own, i.e. as individual classifiers. As explained above, shape contexts have beenused to match points from the manually annotated set in the EBGM-SC algorithm,showing an adequate behaviour according to the achieved results. As a byproduct ofthis matching, GD1 and GD2 measures are obtained for the set of pre-defined fiducial


Figure 3.12: Set of points used for jet extraction in the EBGM approach. Bluetriangles represent manually annotated vertices, whilst red dots represent the middlepoint connecting manual vertices.

Figure 3.13: Rectangular rigid grid


Table 3.2: Face Authentication on the XM2VTS database. False Acceptance Rate(FAR), False Rejection Rate (FRR) and Total Error Rate (TER) over the test set forour Shape-driven approach (without sketch distortion) and the EBGM algorithm.


EBGM 2.93 4.25 7.18 1.42 3.25 4.67

EBGM-SC 3.16 4.00 7.16 1.50 3.50 5.00

SDGJ-SC 3.13 3.50 6.63 1.32 3.00 4.32

SDGJ-NM 4.01 6.25 10.26 1.55 5.00 6.55

points. These shape distortions are also computed for the SDGJ-SC approach. FromTable 3.3, which presents the classification performance for each of these measures,we can highlight:

1. None of them are strong classifiers, as long as the error rates are high.

2. GD1 outperforms GD2 in the two configurations for both EBGM-SC and SDGJ-SC algorithms .

3. According to these measures, the distribution of points obtained from the SDGJapproach is much more discriminative than the set of manually localized fiducialpoints.

Although the grid distortion defined in [Wiskott et al., 1997] is slightly differentfrom the GD2 measure introduced here, statement 3) is in agreement with the fact that[Wiskott et al., 1997] does not consider these grid distortions when comparing twofaces. In fact, as a final experiment, the classification performance of grid distortionsgiven by:

GDEBGM (F1,F2) =

NE∑

i=1

∥∥−→e F1i −−→e F2

i

∥∥2

∥∥−→e F1i

∥∥2 (3.15)

i.e. as defined in [Wiskott et al., 1997], was measured. NE stands for the numberof edges connecting manual vertices (see Figure 3.12), −→e F1

i is the i-th vector edge inface F1 and the same for −→e F2

i . The TER obtained was above 70% confirming, onceagain, that the distribution of “universal” fiducial points is not discriminative at all(at least, according to the measures we have tested).

The original shape context matching algorithm proposed in [Belongie et al., 2002]used an iterative procedure to obtain the final set of correspondences: at each iter-ation, one of the meshes was deformed by means of Thin Plate Splines [Bookstein,


Table 3.3: Face Authentication on the XM2VTS database. False Acceptance Rate(FAR), False Rejection Rate (FRR) and Total Error Rate (TER) over the test set forGD1 and GD2 computed from EBGM-SC and SDGJ-SC.


GD1(EBGM-SC) 25.18 25.00 50.18 17.18 26.25 43.43

GD2(EBGM-SC) 27.52 27.75 55.27 25.75 29.25 55.00

GD1(SDGJ-SC) 9.43 17.50 26.93 8.24 17.00 25.24

GD2(SDGJ-SC) 15.09 21.75 36.84 11.49 24.50 35.99

1989] to match the other as close as possible. In order to reduce computationalburden, our implementation used just one iteration of the algorithm without meshdeformation. However, authentication results using only shape do drastically im-prove when applying the original formulation of [Belongie et al., 2002]. Using theGD1 measure with 3 iterations of the algorithm, the Total Error Rate is significantlyreduced from 26.93% to 10.26% in configuration I, performance which is compara-ble to that of some texture-based algorithms tested on the XM2VTS database (see[Matas et al., 2000, Messer et al., 2003, Messer et al., 2006] and three first rows ofTable 3.4). This provides another cue of the fact that the combination of shape infor-mation via ridges/valleys and shape context matching is a good choice for capturingdiscriminative information.

3.10.3 Shape and texture combination results

As discussed in a previous section, a linear combination of shape and texture scoreswas used in the final dissimilarity function. In order to select an adequate value for−→λ = [λ0, λ1, λ2]

T , we fixed λ0 = 1 and performed a grid-search on (λ1, λ2), preservingthe values that minimize the TER on the evaluation set. These optimal valueswere used in the test phase achieving a Total Error Rate of 5.99% and 4.06%in configurations I and II respectively.

3.10.4 Results from other researchers

Three public face competition contests have been organized on the XM2VTS in years2000 [Matas et al., 2000], 2003 [Messer et al., 2003] and 2006 [Messer et al., 2006].The first three rows of Table 3.4 shows results achieved with approaches that sharesome algorithmic similarity with ours and the methods discussed in the introduction:


• The Aristotle University of Thessaloniki -AUT(2000)- tested the morphological-based EGM algorithm [Kotropoulos et al., 2000], competing in year 2000.

• The IDIAP institute -IDIAP(2000)- implemented the system described in [Ducet al., 1999] (rectangular grid attributed with Gabor features), also taking partin the contest held in year 2000.

• The Tubitak Bilten University -TB(2003)- entered the competition of year 2003testing an implementation of EBGM.

In [Zafeiriou et al., 2005], they exploit discriminant information in a modified mor-phological EGM, achieving clear improvements over the raw data (TER=12.9% inconfiguration I). Several steps of discriminant analysis were tested (only with config-uration I):

1. Node weighting: TER=10.7%.

2. Similarity measure (textural and geometrical information) weighting: TER=9.2%.

3. Weighting morphological feature coefficients: TER=5.7%.

4. All discriminative steps: TER=2.8%.

Table 3.4 also shows some of the best results on the XM2VTS: a LDA-basedapproach (UniS-NC (2003)) and a complex ensemble of learning classifiers based onthe manipulation of Gabor features (CAS (2006)). The former took part in thecompetition held in year 2003 [Messer et al., 2003], whilst the latter participated inthe recent contest [Messer et al., 2006] (year 2006).

3.10.5 Accuracy-based Feature Selection (AFS)

It is clear that if no discriminative steps are applied to the approach of [Zafeiriou et al.,2005], our method outperforms it significantly (TER=6.63% against TER=10.7%).However, the inclusion of all discriminative stages in [Zafeiriou et al., 2005] leads tomuch better performance (TER=2.8%) than SDGJ.

Although shape-driven points have been proven to be more discriminative thanuniversal landmarks, and quite robust to illumination and expression changes, wemust remind that the location of these positions rely on an image-based operator(ridges and valleys), which could be affected by an inexact face localization or im-age noise. In such cases, some of the selected positions are likely to be positionedoutside the face region (neck, hair, etc.), while others could lie on “noisy” ridgesand/or valleys. Moreover, it is well known that not all face regions have equal dis-criminatory power [Duc et al., 1999, Tefas et al., 2001, Zafeiriou et al., 2005], and


Figure 3.14: Left: Original set of shape-driven points for client 003 of the XM2VTSdatabase. Right: Set of preserved shape-driven points after accuracy-based selection(Section 3.10.5)

we should take these facts into account to improve our results. In order to discardnoisy/non-discriminative locations, keeping only the best positions, we propose a sim-ple client-specific technique for the selection of such locations (deeper presentationand empirical comparation will be discussed in Chapter 4): by measuring the accu-racy of each Gabor jet (considered as an individual classifier), we only preserve thosewith a Total Error Rate below a threshold on the evaluation phase (see Figure 3.14for an example). Hence, it is a hard weighting function (selected or not) based on theindividual classification accuracy of each jet. In this case, the similarity between atest image (with jets J~qi

) claiming to be identity C (whose jets are J~pi) is given by:

SJ = fnC

{w~pi

< J~pi,J~qξ(i)

>}

i=1,...,n(3.16)

where nC represents the number of selected locations for client C. The weightw~pi

is equal to 1 if the corresponding jet from client C (J~pi) was selected, and 0

otherwise. A TER=2.52% § was achieved in configuration I of the XM2VTS, thusoutperforming (although non significantly) the results of [Zafeiriou et al., 2005].

However, if we observe the two last rows of Table 3.4, it is clear that our methoddoes not provide the best performance on the database. As highlighted throughoutthe chapter, the main novelty of this approach is conceptual in the sense that proposes

§Faces were previously pose-corrected using NFPW (see Chapter 2)

3.11. Face Authentication on the BANCA database 93

a different way (exploiting individual face structure) to look for discriminative pointsin face images and it represents, to the best of our knowledge, a first attempt in thisdirection. We are confident that there is still room for improvement as it has beendemonstrated by the fact that using a simple discriminant analysis clearly reducedthe difference with the CAS algorithm (2.52% against 0.96%), and further researchis needed in order to decrease error rates (e.g. by choosing better metrics to comparefeatures, through selection of the most discriminative jet coefficients, etc.).

Table 3.4: Results from other researchers on XM2VTS database.Configuration I Configuration II

TER(%) TER(%)

IDIAP (2000) 16.6 15.0

AUT (2000) 14.2 9.7

TB (2003) 11.36 7.72

UniS-NC (2003) 1.48 0.75

CAS (2006) 0.96 0.51

3.11 Face Authentication on the BANCA database

We have also used the english part of the BANCA database [Bailly-Bailliere et al.,2003] (see Section A.2 for a detailed description of this database) on protocols MatchedControlled (MC) and Pooled (P) to test our method. The subjects in this databasewere captured in three different scenarios: controlled, degraded and adverse over12 different sessions spanning three months. Examples of images from these threeconditions are shown in Figure 3.15. Given that the images were extracted fromvideo sequences in which the subjects were asked to utter a few sentences, expressionchanges (specially mouth motion) can appear. Moreover, in the degraded and adversescenarios, there are no constraints on lighting, distance to the camera, etc. and theresolution of the degraded images is clearly worse than that of the two remainingscenarios.

In the experiments carried out, three specific operating conditions correspondingto three different values of the Cost Ratio, R = FAR/FRR, namely R = 0.1, R = 1,R = 10 have been considered. The so-called Weighted Error Rate (WER) given by:

WER (R) =FRR +R · FAR

1 +R(3.17)

was calculated for the test data of groups G1 and G2 at the three different values ofR.


Both protocols (MC and P) use the same data to build client models (controlledimages from session 1), but differ significantly in the set of images used for testing,as long as MC only uses controlled images, whilst P also tests adverse and degradedfaces. These facts make protocol P much more challenging than protocol MC. Theexperiments have been carried out employing the pre-registered images of 55 × 51pixels used for the competition contests on ICBA 2004 [Messer et al., 2004c] andICPR 2004 [Messer et al., 2004b]. Results from five methods which entered thesetwo competitions are given in Table 3.5 (Protocol MC was used in ICBA 2004 andconfiguration P was employed for the contest in ICPR 2004). Table 3.6 shows theperformance of the algorithm when λ1 = λ2 = 0. Taking the SKD term into accountyielded a small improvement in average WER: 4.42% and 10.43% for protocols MCand P respectively.

From the comparison between these results and Table 3.5, we can highlight:

• Our system does not provide the best authentication rates over this database,but

• It shows the smallest degradation in performance when changing from protocolMC to protocol P, as the average WER is only 2.36 times worse. For instance,both IDIAP approaches worked better than ours on configuration MC, but ouralgorithm outperformed them working on protocol P.

To provide baseline results for the two contests mentioned above (ICPR andICBA), a set of algorithms developed by the Colorado State University (CSU) [CSU,2003] were tested. The best results obtained with an implementation of the EBGMapproach belonging to this set are much worse (Average WERs of 8.79% and 14.21%for protocols MC and P respectively) than those obtained with our approach even ifonly texture is taken into account, which indicates that SDGJ selects better discrim-inative locations for face authentication.

Figure 3.15: Examples of images from the controlled, degraded and adverse conditionsof the BANCA database.

3.12. Distance Measures for Gabor Jets Comparison 95

Table 3.5: Results reported on the BANCA database from other researchers.Method Protocol Av. WER

IDIAP-HMM MC 3.53IDIAP-HMM P 12.93IDIAP-Fusion MC 2.70IDIAP-Fusion P 11.22

UCL-LDA MC 3.50UCL-LDA P 10.08

UCL-Fusion MC 1.95UCL-Fusion P 7.89UniSurrey MC 2.99UniSurrey P 7.99

Table 3.6: Our results on the BANCA database on configurations MC and P withλ0 = 1 and λ1 = λ2 = 0.

BANCA R=0.1 R=1 R=10 Av.Protocol G1 G2 G1 G2 G1 G2 WER

MC 4.23 3.22 11.03 4.68 4.28 1.89 4.89P 7.73 8.60 18.95 16.47 7.39 6.24 10.90

The same set of system parameters (Gabor filter frequencies for instance) was usedfor the experiments on the XM2VTS and BANCA databases, despite the differencein the resolution of the images tested (≈ 150 × 115 pixels in XM2VTS and 55 × 51pixels in BANCA). The performance on BANCA is expected to be improved whenusing higher resolution images. In fact, with 150×115 pixels, the average WER (Av.WER) obtained through combination of shape and texture was 9.47% for protocolP.

3.12 Distance Measures for Gabor Jets Compari-

son

As stated in the introduction, the selection of the specific distance (or similarity)function to compare Gabor jets have received very little attention in the literature.Although most Gabor-based approaches have used cosine distance in order to comparecorresponding features, this choice is not motivated, neither with a theoretical norwith an experimental evaluation. To the best of our knowledge, the only evaluationof distances for Gabor jet comparison was performed in [Jiao et al., 2002], where theauthors concluded that Manhattan (or city block) distance outperformed both cosine


and euclidean distances. However, it is not explicitely described, neither in [Jiao et al.,2002] nor in other research papers dealing with Gabor jets-based face recognitionsystems, whether jets have been previously normalized or not. We propose a moreextensive evaluation, comparing seven different distances for measuring similaritiesbetween Gabor jets, as well as assessing the impact of the specific normalizationmethod that is applied to jets before comparison. Moreover, three different resolutionsof input images are tested in order to provide a more complete set of results.

3.12.1 Distance between faces

Let R1 = {J~p1,J~p2, . . . ,J~pn} be the set of jets in F1 and R2 = {J~q1,J~q2, . . . ,J~qn

} theset of jets extracted from F2. Before computing distances, each jet J is processed asfollows:

1. Each complex coefficient is replaced by its modulus, obtaining J ′.

2. The obtained vector can be either normalized (to have unit L1 or L2 norm forinstance) or not. Although some of the distances that will be introduced next,such as cosine distance, are invariant to these normalizations, some of them arenot, and it seems that the spcific type of normalization applied to jets could be acritical point. Here, the three possibilities described above (no normalization, L1

normalization and L2 normalization) will be evaluated. Hence, given a vector,J ′, comprising the moduli of jet coefficients, we divide it by a normalizationfactor α given by:

• No normalization: α = 1.

• L1 normalization: α =∑

i |J ′i |.

• L2 normalization: α =√∑

i (J ′i )

2.

We will denote the resulting vector by ~J ( ~J = J ′/α) and, for the sake of simplicity,we will maintain the name of jet. The distance function between the two faces,DSJ (F1,F2) is given by:

DSJ (F1,F2) = fn

{D(~J~pi, ~J~qξ(i)

)}(3.18)

where D(~J~pi, ~J~qξ(i)

)represents the distance used to compare corresponding jets, and

fn {. . .} stands for a generic combination rule of the n local distances.In the EBGM approach [Wiskott et al., 1997] and other Gabor-based face recog-

nition systems, it is proposed to use a normalized dot product to compare jets. Inthis work, we assess the performance of the system varying D (. . .), i.e. we comparethe following distances:


1. Cosine distance (negated normalized dot product as used in [Wiskott et al.,1997]).

D (X, Y ) = −cos (X, Y ) =

−∑n

i=1 xiyi√∑ni=1 x

2i

∑ni=1 y

2i

(3.19)

2. Manhattan distance (L1 metrics or city block distance)

D (X, Y ) = L1 (X, Y ) =

n∑

i=1

|xi − yi| (3.20)

3. Squared Euclidean Distance (sum of squared errors-SSE)

D (X, Y ) = SSE (X, Y ) =

n∑

i=1

(xi − yi)2 (3.21)

4. Chi square distance

D (X, Y ) = χ2 (X, Y ) =

n∑

i=1

(xi − yi)2

xi + yi(3.22)

5. Modified Manhattan distance

D (X, Y ) =

∑ni=1 |xi − yi|∑n

i=1 |xi|∑n

i=1 |yi|(3.23)

6. Correlation-based distance

D (X, Y ) = − n∑n

i=1 xiyi −∑n

i=1 xi

∑ni=1 yi√(

n∑n

i=1 x2i − (

∑ni=1 xi)

2)(

n∑n

i=1 y2i − (

∑ni=1 yi)

2) (3.24)

7. Canberra distance

D (X, Y ) =

n∑

i=1

|xi − yi||xi| + |yi|

(3.25)


In the definition of all presented distances (equation (3.19) to Equation (3.25)),n stands for the length of the vector, i.e. n = 40. It is easy to see that both cosineand correlation-based distances are invariant to α, i.e. the type of normalization (L1,L2 or no normalization at all) applied to jets does not change the result. It is alsostraightforward to realize that the Modified Manhattan distance is equivalent to theManhattan distance when jets are normalized to have unit L1 norm.

3.12.2 Results on BANCA’s MC protocol

In order to provide a more complete set of results, we performed experiments usingimages at 3 different resolutions (55× 51, 150× 115 and 220× 200 pixel images). Weused the median rule to fuse the n local distances, i.e. fn {. . .} ≡ median. Moreover,in protocol MC there are 5 training images to build the client model. Whenever atest image claims a given identity, this test face is compared to each of the 5 trainingimages. Hence, we get 5 scores which are combined (using again the median rule)to obtain the final score ready for authentication. Tables 3.7, 3.8 and 3.9 show theobtained results changing the normalization factor (α) applied to jets (no normal-ization, L1 normalization and L2 normalization respectively). If no normalizationis applied to jets, the best performing distance is cosine. The remaining choicesachieve significantly worse results for all resolutions (except Canberra distance, withsimilar performance). In the EBGM approach [Wiskott et al., 1997], the authorsdid not apply any normalization to jets (at least, they did not state it explicitely)and these results may support their choice of the cosine distance for jet comparison.However, the use of L1 and L2 normalization factors (Tables 3.8 and 3.9) leads tocompletely different conclusions. Cosine is outperformed by other distances such asSSE or Modified Manhattan distance (MMD). If we compare the results obtained,for instance, with the MMD varying the normalization factor, we see that impressiveimprovements are obtained with the use of L1 and L2 normalization factors (WERdecreases from 12.72% to 3.34–2.90% using 220 × 200 pixel images). Hence, we con-clude that the concrete type of normalization that is applied to jets is, in fact, acritical point. In [Jiao et al., 2002], the authors observed that the Manhattan dis-tance outperformed cosine after identification experiments. According to our results,Manhattan outperforms cosine when L1 normalization is used. Although the authorsof [Jiao et al., 2002] do not describe whether they have normalized their jets or not,we have obtained results supporting their finding. As stated previously, both cosineand correlation-based distances are invariant to the tested normalization factors, andthis is reflected in the obtained results. It is also interesting to note that, in generalterms, error rates decrease (or stays approximately equal) as long as the resolution ofinput images grows. A clear exception occurs when testing Manhattan, SSE, χ2 andModified Manhattan distances without normalization. Further research is needed inorder to better understand this behavior.


Table 3.7: Average WER (%) using several distance measures D(~J~pi, ~J~qξ(i)

)to com-

pare jets and different resolution of input images (jets are not normalized).Input Image Resolution

D(~J~pi, ~J~qξ(i)

)55 × 51 150 × 115 220 × 200

Cosine 4.89 4.64 3.40Manhattan 8.54 10.56 13.16

SSE 8.45 9.45 12.60χ2 6.53 7.53 10.28

Modified Manhattan 8.18 9.71 12.72Correlation 6.70 7.01 5.45Canberra 6.10 5.05 5.11


)to com-

pare jets and different resolution of input images (jets are normalized to have unit L1

norm).Input Image Resolution

D(~J~pi, ~J~qξ(i)

)55 × 51 150 × 115 220 × 200


SSE 4.49 3.53 3.17χ2 5.33 4.03 3.01




)to com-

pare jets and different resolution of input images (jets are normalized to have unit L2

norm).Input Image Resolution

D(~J~pi, ~J~qξ(i)

)55 × 51 150 × 115 220 × 200


SSE 5.14 4.61 3.42χ2 5.60 5.29 3.80


3.13 Conclusions and further research

The main novelty of this approach is somewhat conceptual, since it proposes analternative way to the selection of key points in face images. Our ultimate goalshould be to exploit individual face structure so that the system focuses on subject-specific discriminative points/regions, rather than on universal landmarks. In thissense, the choice of the particular face shape descriptor, point matching algorithmand feature extraction method are (although critical) just implementation issues.Biological reasons [Pearson et al., 1990] as well as the better behaviour than edges[Pujol et al., 2001], motivated the use of ridges and valleys for face shape description.Analogously, the selection of Gabor filters for feature extraction was inspired bothby biological reasons [Daugman, 1980, Daugman, 1988] and because its wide use inface recognition [Wiskott et al., 1997, Duc et al., 1999, Smeraldi and Bigun, 2002,Liu, 2004]. Finally, we chose shape context matching because it has proven to bea robust descriptor, performing accurately in object recognition/retrieval tasks viashape [Belongie et al., 2002]. The combination of these techniques is also novel in thefield of face recognition. Briefly, the algorithm can be summarized as follows:

• Facial structure is exploited through the use of a ridges and valleys detector, sothat n points are automatically sampled from lines depicting the subject’s face.

• At each of these shape-driven positions, a set of Gabor filters is applied, obtain-ing n Gabor jets which provide the textural information of the face.

• Given two images and their respective sets of points, shape context matchingis used to determine which pair of jets should be compared, obtaining, at the

3.13. Conclusions and further research 101

same time, two geometrical measures between faces, whose linear combinationforms the Sketch Distortion term.

Further experiments should be conducted in order to assess the performance ofthe matching process when Gabor jet dissimilarities are taken into account along withhistogram distances, as well as the impact (both on performance and computationaltime) that causes constraining the search to the region surrounding each point.

Experimental results on the AR face database demonstrate that although oursystem has not been particularly designed to cope with expression changes, it behavesreasonably well in the presence of such variations. In order to improve the resultswith large expression changes (i.e. screaming faces), we plan to apply a functionthat weighs the different facial regions according to the deformation caused by agiven expression. Moreover, tests under different lighting conditions confirm the goodperformance of the system with illumination changes.

We have demonstrated, empirically, that the distribution of shape-driven points ismuch more discriminative than the distribution of fiducial points as used in [Wiskottet al., 1997]. Experimental results on the XM2VTS database show that our approachperforms marginally better than an ideal EBGM without the need of localizing “uni-versal” fiducial points. Also, it has been demonstrated that a simple linear combina-tion of texture and shape scores improves the performance of the system (comparedto the only-texture method) although this improvement is not always significant.

The comparison with other raw (i.e. without discriminant analysis) EGM methodsreveals that our system achieves lower error rates. The application of a simple (hard)feature selection stage provoked clear improvements in performance (≈ 61%), achiev-ing better results than the morphogical EGM with several steps of discriminativeanalysis. However, as a future research line, we plan to study which jet coefficientsare the most discriminative, as well as selecting appropriate soft local weights for theshape-driven features. The achieved results on the BANCA database also confirmthat changing from an “easy” protocol to a more challenging configuration, provokesless degradation in performance than with other methods and that our approachclearly outperforms an implementation of the EBGM algorithm on this database.

Regarding the empirical evaluation of distance measures, the SDGJ algorithm wastested on the BANCA database with 3 input image resolutions, 3 distinct normaliza-tion factors and 7 distance measures to compare jets. It has been shown that:

• The performance of a given distance strongly depends on the concrete pre-processing applied to jets.

• When no normalization (α = 1) is used, the cosine distance outperforms the re-maining ones. Although the authors of [Wiskott et al., 1997] did not explicitelystate whether they have normalized their jets or not, this result would supporttheir choice.


• The use of L1 and L2 normalization factors lead to completely different results,turning out that other distances, such as SSE or MMD, achieve better perfor-mance than the cosine measure.

Although we have shown that there exist better choices than cosine distance forGabor jet comparison, no theoretical reasons supporting this fact have been provided.As we will see in Chapter 5, Gabor coefficients can be accurately modeled usingGeneralized Gaussian Distributions (GGD’s) and this finding opens new possibilitiesin terms of selecting optimal ways to compare jets from a theoretical point of view.In addition, a more extensive evaluation of the tested distances on different databasesand with diverse Gabor-based systems (EBGM, SDGJ, etc.) is needed.

Chapter 4

Gabor Jets Similarity Fusion


4.2 Mesh Configurations . . . . . . . . . . . . . . . . . . . . . 104

4.3 Jet Similarity Fusion . . . . . . . . . . . . . . . . . . . . . 105

4.3.1 Accuracy-based Feature Selection (AFS) and Best Individ-ual Features (BIF) . . . . . . . . . . . . . . . . . . . . . . . 106

4.3.2 Sequential Floating Forward Search (SFFS) . . . . . . . . . 107

4.3.3 LDA-based fusion . . . . . . . . . . . . . . . . . . . . . . . 108

4.3.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . 109

4.3.5 Adaboosted MLP ensemble . . . . . . . . . . . . . . . . . . 110

4.4 Database and Experimental results . . . . . . . . . . . . . 111

4.4.1 Database and Experimental Setup . . . . . . . . . . . . . . 111

4.4.2 Evaluating AFS and BIF approaches . . . . . . . . . . . . . 111

4.4.3 Evaluating SFFS . . . . . . . . . . . . . . . . . . . . . . . . 114

4.4.4 Adaboosted MLP ensemble, SVMs and LDA-based . . . . . 116

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.1 Introduction

In the context of Gabor jets-based face recognition– as well as in any other localfeature-based approach– not all the jets extracted from a set of facial points (usingany mesh configuration, such as rectangular [Duc et al., 1999], face-like [Wiskott et al.,

103

104 Chapter 4. Gabor Jets Similarity Fusion

1997] or shape-driven (Chapter 3)) contribute equally in terms of discrimination abil-ity, and this fact has been validated by several researchers. The works in [Duc et al.,1999], and [Gokberk et al., 2003] focus on the selection of the best locations for featureextraction according to different criteria: linear discriminant analysis at each of thegrid-nodes was used in [Duc et al., 1999], while several subset selection schemes suchas Best Individual Features (BIF), Sequential Forward Selection (SFS) or SequentialForward Floating Selection (SFFS) were evaluated in [Gokberk et al., 2003]. Apartfrom the referenced methods, different state-of-the-art pattern recognition tools havebeen widely applied in the field of face recognition to improve classification accuracy:Support Vector Machines (SVMs) were used in [Tefas et al., 2001, Smeraldi and Bi-gun, 2002, Czyk et al., 2004, Heisele et al., 2001], MLPs in [Er et al., 2002a], [ArgonesRua et al., 2006], Linear Discriminant Analysis (LDA) in [Zafeiriou et al., 2005] orAdaboosting in [Shen and Bai, 2006b, Yang, 2004, Shan et al., 2005].

The goal in this chapter is to compare the accuracy of several state-of-the-arttechniques for intramodal fusion of jet similarities, namely an Adaboosted ensembleof MLPs, a variant of the Linear Discriminant Analysis proposed in [Argones Ruaet al., 2006], SVMs, SFFS, BIF and a closely related approach to the latter proposedin [Gonzalez-Jimenez and Alba-Castro, 2006] (Accuracy-based Feature Selection orAFS).

The chapter is organized as follows: Section 4.2 briefly reviews the different con-figurations adopted for point distribution: the shape-driven mesh, the face-like repre-sentation, and the rectangular grid. The pattern recognition tools used to select themost discriminative locations are introduced in Section 4.3. Section 4.4 shows theexperimental setup (Section 4.4.1) and the obtained results. Finally, conclusions aredrawn in section 4.5.

4.2 Mesh Configurations

Chapter 3 did already introduce three different techniques in order to choose a setof points over the face image, from where features should be extracted, namely theproposed shape-driven method, a face-like mesh (similar to the one used in the ElasticBunch Graph Matching algorithm [Wiskott et al., 1997]), and a rigid rectangular meshapproach based on the position of eyes and mouth. The comparison of such algorithmshas been reported in Section 3.10∗, demonstrating the superior performance of shape-driven positions. In this chapter, as already stated, we explore the fusion of Gaborsimilarities, not restricting ourselves to work with a specific mesh but evaluating theperformances of the different combination tools (whenever applicable) on the threemesh configurations.

∗Table 4.1 re-displays the obtained results of Section 3.10

4.3. Jet Similarity Fusion 105

Independently of the specific mesh adopted (shape-driven, face-like or rectangu-lar), the jet J~pi

extracted at point ~pi from the training image Itrain, will be comparedto J~qξ(i)

, computed on Itest (Obviously, the mapping function ξ depends on the spe-cific mesh configuration: it is straightforward for face-like and rectangular meshes,since ξ (i) = i. For the shape-driven case, see Chapter 3). The similarity xi betweencorresponding jets J~pi

and J~qξ(i)is given by its normalized dot product, but taking

into account that the magnitudes of jet coefficients are used, i.e. phase informationis discarded. Hence xi =< J~pi

,J~qξ(i)>, where < ·, · > stands for normalized dot

product. The final similarity score S between two images is given by a combinationfunction fn of the n local similarities

S = fn {x1, x2, . . . , xn} (4.1)

In Section 3.12 several distance measures have been compared for Gabor jet match-ing, turning out that cosine distance (i.e. normalized dot product) may not be themost accurate. However, given that this approach has been widely used in the litera-ture, and that this chapter focuses on the methods for combining local similarities, wepreferred to preserve the normalized dot product. Further research may include theuse of the different fusion methods introduced in this chapter along with the distancesof Section 3.12.

4.3 Jet Similarity Fusion

Once that Gabor features have been extracted from the points located with any ofthe mesh configurations mentioned in Section 4.2, we should proceed to preserveand combine only the most discriminative ones, so that noisy and useless jets arediscarded. The goal is to select a fusion rule fn that combines the different n localsimilarities xi in order to obtain a more robust similarity measure between faces. Thissection introduces the different fusion methods that were tested:

• The so-called Accuracy-based Feature Selection approach (AFS, already intro-duced in Section 3.10.5), closely related to Best Individual Feature Selection(BIF) used for instance in [Gokberk et al., 2003].

• Sequential Floating Forward Search (SFFS) [Pudil et al., 1994, Gokberk et al.,2003].

• The Linear Discriminant Analysis proposed in [Argones Rua et al., 2006].

• Support Vector Machines (SVM) [Tefas et al., 2001, Heisele et al., 2001].

• Adaboosting of MLPs, using as inputs for the neural networks both the wholeset of similarities as well as the subset selected by LDA.


4.3.1 Accuracy-based Feature Selection (AFS) and Best In-dividual Features (BIF)

The accuracy-based feature selection (AFS) has been already briefly introduced inChapter 3. This simple technique was devised to preserve the most discriminativeshape-driven jets for a given client. The main characteristic of the shape-drivenmethod is that it searches for locations on the image that depend on the individualface structure and hence, there may not exist an exact correspondence across differentsubjects. For this reason, pattern recognition techniques such as Linear DiscriminantAnalysis (LDA), or Support Vector Machines (SVM) can not be applied to find themost important user-independent locations, as long as these tools need a given pointto be located in the same facial region for every subject†. The main idea behind AFSis to select, in a client-specific fashion, the most discriminative features accordingto their classification accuracy. The problem can be formulated as follows: given atraining image for client C, say Itrain, a set of images belonging to the same client{Icj

}and a set of impostor images

{I imj

}, we want to find which subset, P ⊂ Ptrain,

is the most discriminative. As long as each point ~pi from Ptrain has a correspondingposition ~qξ(i) in every other image (client or impostor, say Itest), we measure theindividual classification accuracy of its associated jet J~pi

, and select those locationswith a Total Error Rate (TER, defined in section 4.4.1) on the evaluation set below athreshold τ . Although this simple idea was thought to be used with the shape-drivenmesh approach (see Section 3.10.5), it can also be applied to any local feature-basedalgorithm. Section 4.4.2 will show the results of applying the AFS method to theshape-driven, face-like and rectangular meshes which have been already tested inChapter 3.

The AFS algorithm is closely related to the Best Individual Feature (BIF) selectionapproach, which has been used by other researchers [Gokberk et al., 2003] in Gabor-based face recognition. The idea behind BIF (as its name reads) is to select thebest individual features according to some criterion (e.g. individual classificationaccuracy). Results using BIF will be also presented in Section 4.4.2 for the shape-driven, face-like and rectangular meshes. Both AFS and BIF methods have thedrawback that each feature is considered in isolation, and hence it may turn outthat two of the selected features are highly correlated in terms of their ability todiscriminate between faces. In such a case, the inclusion of these two highly correlatedfeatures may not give much better performance than one of them alone. In order toovercome this problem, more complex selection techniques have to be considered,such as the SFFS algorithm, which is introduced in next section.

†They could still be applied to select the most important locations in an user-dependent fashionbut this would require much more training data for each client


4.3.2 Sequential Floating Forward Search (SFFS)

Sequential Floating Forward Search approach [Pudil et al., 1994] is a non-exhaustivedeterministic sequential search method. This algorithm has proved its efficiency inGabor kernel location selection [Gokberk et al., 2003] and image classification [Jainand Zongker, 1997]. In [Gokberk et al., 2003] several feature selection algorithmsare evaluated on two Gabor mesh-based verification schemes, and SFFS is the bestwithin all the tested suboptimal deterministic search methods.

Two different experiments have been designed using SFFS. The first experimentis a user-template specific feature selection, where a different discriminant featureset is provided for each user-template. The second uses user-independent (or global)feature selection. In this experiment the feature set is selected regarding its abilityto discriminate true and false identity taking into account all the identities in thedatabase. Eventhough a user-template specific feature selection is probably the bestsolution (discriminant points can be distinguished in all the templates), the scarcityof user images in the evaluation set can lead us to an overfitted solution.

Let X be the selected set of similarities when comparing two images. The imagessimilarity is given by Y = median {X}. The SFFS criterion function J provides ameasure of the classification accuracy for the selected similarities between the tem-plate and target image. Eventhough an inmediate measure could be directly derivedfrom a traditional performance measure such as the TER, i.e. J(X) = 1−TER, thisis not a good criterion function, since perfect classification is possible in the evaluationdataset for feature sets far from the optimal general solution.

The criterion function adopted for the global and user-template specific case cannot be the same, since in the case of the template-specific approach, only a few trueclaims are available in the evaluation set for every user template. The separationbetween true identity claims and false identity claims is, in the user-template specificcase, evaluated by means of the difference between the smallest true claim similarityand the highest false claim similarity:

J(X) = m1 −M0 (4.2)

where

m1 = minX∈ Ev. Set, C=1

Y (4.3)

M0 = maxX∈ Ev. Set, C=0

Y (4.4)

In the global case many true identity claims and false claims are available, andmore robust statistics can be used as criterion functions for the SFFS. The adoptedsolution is:

J(X) = y12% − y0

98%, (4.5)


where P(Yglobal < y12%|C = 1) = 0.02 and P(Yglobal < y0

98%|C = 0) = 0.98 in theevaluation set. This measure evaluates the separation between the distribution of thesimilarities for true and false identity claims.

Both global and user-specific criterion functions are suitable for SFFS, since theycan grow eventhough perfect classification accuracy is reached in the evaluation set.

In our experiments, both template-specific and global approaches runs over user-specific thresholding, using the median as the similarity fusion function fn.

4.3.3 LDA-based fusion

A LDA-based fusion technique was presented in [Argones Rua et al., 2006]. Thistechnique shows that in a two class problem, LDA provides a way to judge wichfeatures are least useful from the point of view of class separation. To explain thispoint, let us look at the LDA solution in more detail. Let X = [x1, . . . , x130] denoteour vector of Gabor jet similarities. Clearly, xi are not independent, since ideally, allsimilarity values should be high for a true identity claim and low in the case of animpostor claim. However, it is not unreasonable to assume that xi is class conditionalindependent of xj ∀i, j|i 6= j and i, j ∈ {1, . . . , 130}. This relatively strong assumptionclearly simplifies the structure of the within scatter matrix, as we will see next.

Let µi,0 = E{xi|C = 0} be the class conditional mean of the ith component when Xcomes from a false identity claim, and let µi,1 = E{xi|C = 1} be the class conditionalmean of the ith component when X comes from a true identity claim. Let µC be thevector of mean similarities for class C, i.e. µC = [µ1,C , µ2,C, . . . , µ130,C ]T , C = {0, 1}.Furthermore, let σ2

i,0 = E{(xi−µi,0)2|C = 0} and σ2

i,1 = E{(xi−µi,1)2|C = 1} denote

the variances of the similarity scores. Let ci = 1/2(σ2i,0 + σ2

i,1). As the xi’s representsimilarities, and the greater the similarity the higher the value of xi, we can assumethat µi,1 > µi,0, ∀i ∈ {1, . . . , 130}.

LDA finds a one dimensional subspace in which the separability of true clientsand impostors is maximized. The solution is defined in terms of the within class andbetween class scatter matrices (Sw and Sb respectively), which are given by:

Sw =

c1 0 . . . 00 c2 . . . 0...

.... . .

...0 . . . 0 c130

(4.6)

Sb = (µ1 − µ0)(µ1 − µ0)T (4.7)

It should be noted that Sw is not usually diagonal, but the class conditional indepen-dence adopted leads us to such a solution. Now the LDA subspace is defined by thesolution to the eigenvalue problem

S−1w Sbv − λv = 0 (4.8)


In our face verification case, equation (4.8) has only one non zero eigenvalue λ andthe corresponding eigenvector defines the LDA subspace. It is easy to show that theeigenvector v is defined as

v = S−1w (µ1 − µ0) (4.9)

Recall that we have assumed that all the components of the difference of thetwo mean vectors are non negative. Then from equations (4.9) and (4.6) it followsthat the components of the LDA vector v are non negative when class conditionalindependence holds. In general, if a component is non positive, it means that theactual training data is such that

• the observations do not satisfy the axiomatic properties of similarities

• the component has a strong negative correlations with some other componentsin the feature vector, so it is most likely encoding random redundant informa-tion emerging from the sampling problems, rather than genuine discriminativeinformation. Reflecting this information in the learnt solution does help to geta better performance on the evaluation set where it is used as a disimilarity.However, this does not extend to the test set.

LDA is not an obvious choice for feature selection, but in the two class case ofcombining similarity evidence it appears that the method offers an instrument foridentifying dimensions which have an undesirable effect on fusion. By eliminatingevery feature with a negative projection coefficient, we obtain a lower dimensionalLDA projection vector with all projection coefficients positive. This projection vectoris not using many of the original similarity features, and therefore performs the roleof an LDA-based feature selection algorithm.

4.3.4 Support Vector Machines

SVM are learning machines that non linearly map their n-dimensional input space intoa high dimensional feature space by means of a non linear kernel function [Vapnik,2000] and have been previously used for face recognition tasks [Tefas et al., 2001,Smeraldi and Bigun, 2002, Heisele et al., 2001, Czyk et al., 2004]. In this highdimesional feature space a linear classifier is constructed by minimizing the structuralrisk over a training set. Two results make this approach successful:

• The generalization ability of this learning machine depends on the Vapnik-Chervonenkis (VC) dimension of the set of functions that the machine imple-ments rather than on the dimensionality of the space. A function that describesthe data well and belongs to a set with low VC dimension will generalize cor-rectly regardless of the dimensionality of the space.


130

21

output

inpu

t

Figure 4.1: MLP architecture chosen for the experiments

• The construction of the classifier only needs to evaluate an inner product be-tween two vectors of the training data. An explicit mapping into the highdimensional feature space is not necessary. In Hilbert spaces, inner productshave simple kernel representations and therefore can be easily evaluated.

In our experiments we have used the SVMLight implementation described in[Joachims, 2002]. We tried several kinds of kernel functions, and finally chose agaussian kernel with parameter σ = 0.5 to perform the experiments. Due to theunbalanced number of positive and negative examples we chose to split the slackvariables cost factor into C+ and C−, taking C+/C− = 100 as defined in [Morik et al.,1999].

4.3.5 Adaboosted MLP ensemble

Boosting is a classifier building technique that produces a very accurate predictionrule by combining rough and moderately inaccurate classifiers. It has been applied toface recognition in previous research papers such as [Lu et al., 2006]. The adaptiveboosting algorithm (Adaboost) [Freund and Schapire, 1995] has also been used forface recognition in [Shen et al., 2005, Yang, 2004, Shan et al., 2005]. Adaboostcreates weak classifiers with adapted training sets that are finally combined into astrong classifier by weighing the successive weak classifiers.

We use a multilayer perceptron with 130 inputs and 3 neurons in its only hiddenlayer (as shown in figure 4.1) as weak classifier. The error back-propagation trainingalgorithm is used to train every MLP with a slight modification: every error froma positive sample is overweighted in order to compensate the unbalanced number ofclient and impostor attempts of the dataset. The criterion to select the final MLPsensemble relies in cross-validation: choosing the smallest ensemble with the best totalerror rate on the XM2VTS evaluation set (see section 4.4.1 for database description).

4.4. Database and Experimental results 111

4.4 Database and Experimental results

4.4.1 Database and Experimental Setup

All experiments have been carried out using the XM2VTS database on both config-urations I and II of the Lausanne protocol [Luttin and Maıtre, 1998] (see SectionA.4 for details on both database and protocols). The performance of the differentmethods will be given in terms of Total Error Rate (TER) on the test set, which isdefined as TER=FAR+FRR. However, as stated in previous chapters, TER measuresare not enough to determine whether two methods are statistically significantly dif-ferent or not. For this purpose, and once again, we will use the method of [Bengioand Mariethoz, 2004] to compute confidence intervals around Half Total Error Rates(HTER = TER/2) measures, assessing whether there exist statistically significantdifferences between two approaches or not (see Appendix B).

Configuration I Configuration IITER(%) TER(%)

Shape-driven 6.63 4.32

Face-like 7.18 4.67

Rectangular 8.53 5.01

Table 4.1: Baseline results obtained when fn ≡ median (already shown in Section 3.10.1)

for the Rectangular, face-like, and shape-driven approaches.

4.4.2 Evaluating AFS and BIF approaches

In this second experiment, we investigate the performance of the simple feature selec-tion techniques of AFS and BIF. Both methods were applied to select locations fromthe shape-driven, face-like and rectangular meshes, using the individual classificationaccuracy of each point as criterion for selection (we will refer to it as criterion A).We would like to emphasize that the best locations are selected in a template-basedfashion, i.e. for each template of a given client, its best positions are chosen, and itmay occur that different templates from the same client have different selected loca-tions. The median rule [Kittler et al., 1998] (i.e. fn ≡ median) was used to combinethe preserved similarities in the three approaches. Also, the final 3 scores (configura-tion I) and 4 scores (configuration II) were fused once again using the median rule,leading to the final score ready for verification. Clearly, the threshold τ that is usedto preserve the most important features in AFS ‡ (and analogously, the number K of

‡a threshold τ = 50 means that only those jets that achieve a Total Error Rate on the evaluationset below 50% are preserved


10 20 30 40 50 60 70 80 90 100

3.3

4

5

6

τ(%)

TE

R(%

)

AFS (configuration I)

face−likeshape−drivenrectangular

10 20 30 40 50 60 70 80 90 100

3

3.5

4

4.5

5

5.5

6

τ(%)

TE

R(%

)

AFS (configuration II)


Figure 4.2: Accuracy-based feature selection (AFS): effect of sweeping τ(%) on systemperformance (TER(%) measures are reported) for the shape-driven, manual face-likemesh and rectangular grid methods in configurations I (left) and II (right).

selected features in BIF) can be varied, and the result of sweeping τ and K is shownin Figures 4.2 and 4.3 respectively . From these figures we can highlight:

1. Although both BIF and AFS are very simple feature selection techniques, thereis a general improvement in performance (over the original results from Table4.1) for all methods and in both configurations, which means that noisy and/ornon discriminative features are discarded through the use of such approaches.

2. AFS (Figure 4.2). In configuration I, the highest improvements are achievedfor values of τ ≤ 60%. In configuration II, the value of τ does not affectsystem performance too much. It must be noted, however, that the best resultsare obtained for low values of τ using the manual face-like mesh. Althoughthere exist differences in performance between approaches, these are not quitesignificant for most values of τ , specially in configuration II. In general terms,the method that performs best is the manual face-like mesh.

3. BIF (Figure 4.3). In configuration I, the best results are obtained with a numberof features between 20 and 50. We should pay special attention to K = 25 forthe face-like mesh, since it achieves very competitive authentication rates (infact, there is a range of values for K around 25 for which the total error rate isbelow 3%). In configuration II, and analogously to AFS, for K > 20 features,performance does not depend very much on the specific value of K.

4. Although when working with the whole set of similarities the shape-driven meshachieves slightly better performance than the manual face-like approach, it isclear that when applying AFS and BIF, results are better using the face-like


0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

2.5

K (number of selected features)

TE

R(%

)

BIF (Configuration I)


0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35

K (numberl of selected features)

TE

R(%

)

BIF (Configuration II)


Figure 4.3: Best Individual Features (BIF) with criterion A: effect of sweeping thenumber K of best selected features on system performance (TER(%) measures arereported) for the shape-driven, manual face-like mesh and rectangular grid methodsin configurations I (left) and II (right).

configuration: the perfect localization of points seem to favour accurate fusionresults.

The criterion we used for selection is not unique and can be obviously modified.A priori, criterion A has one drawback: Imagine there exist L locations with identicalclassification accuracy. In such a case, how do we select the best K points (K <L)? To this aim, we must constrain the selection criterion, considering not onlythe classification accuracy of a given location, but also (for instance) the separationbetween client and impostor similarities the location achieves (the larger the distancebetween similarities of clients and impostors, the better the location). Let us callit criterion B. This criterion was also applied to select the best K locations usingBIF. Overall, the behaviour is very similar to the one obtained with criterion A (seeFigure 4.4 for an example), and the main difference between the results achieved withboth criteria is that for K = 1, criterion B provides better authentication rates thancriterion A, as long as it actually selects the location that best separates betweenclient and impostor attempts. Apart from this case (K = 1), we conclude that bothcriteria are not significantly different when applied to BIF.

With AFS and BIF, hard weights are given to the original locations (i.e. a selectedlocation is given a weight wi = 1, whilst a discarded location is given a weightwi = 0). It seems more reasonable to weigh the selected locations according to theirdiscriminative power (measured by the 1-TER obtained in evaluation). However,the obtained results do not validate this hypothesis, probably due to the fact thataccurate weights can not be estimated with so few training client samples, and it isbetter to work conservatively and weigh all selected locations equally.


0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35


TE

R(%

)

BIF with different criteria for feature selection

Criterion BCriterion A

0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35

K (numberl of selected features)

TE

R(%

)

BIF with different criteria for feature selection

Criterion ACriterion B

Figure 4.4: BIF with two different criteria for feature selection. Criterion A: Classifi-cation accuracy, and Criterion B: Separation between client and impostor similarities.Effect of sweeping K on system performance (TER(%) measures are reported) for theshape-driven mesh in configurations I (left) and II (right).

In order to estimate good values for τ (orK) in a real system (and in a subject/template-dependent fashion), a set of training images from each enrolled subject, as well as aset of images corresponding to other people are needed. The final value of τ (or K) ischosen to be the one that provides the best authentication rates in this developmentset. In case only one image per person is available, this selection should be donefollowing a global (not person-dependent) scheme.

4.4.3 Evaluating SFFS

As discussed in Section 4.3.2, a criterion based on classification accuracy such ascriterion A is not adequate for SFFS. This fact is demonstrated in Figure 4.5, whichplots the performance of SFFS using both criteria A and B. As we can see, when Kis small, performance is much worse with criterion A than with criterion B, and itonly becomes similar for high values of K (this figure was obtained using the face-likemesh in configurations I and II, and similar curves are obtained for the remainingpoint configuration approaches).

The performance of SFFS with criterion B is plotted in Figure 4.6 for the shape-driven, face-like and rectangular meshes. By comparing these performances with thoseof Figure 4.3, we realize that SFFS is outperformed by BIF. One would expect betterperformance from a method such as SFFS that actually takes interaction betweenfeatures into account. Moreover, SFFS has empirically shown better performancethan BIF in [Gokberk et al., 2003]. So, how do we explain the obtained results? Toanswer this question, we should remind that up to now, locations have been selectedin a template-based fashion (i.e. for each template of a given client, we select the best


0 5 10 15 20 25 30 35 405

10

15

20

25

30


TE

R(%

)

SFFS with different criteria for feature selection


0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35SFFS with different criteria for feature selection


TE

R(%

)


Figure 4.5: SFFS with two different criteria for feature selection. Criterion A: Clas-sification accuracy, and Criterion B: Separation between client and impostor similar-ities. Effect of sweeping K on system performance (TER(%) measures are reported)for the shape-driven mesh in configurations I (left) and II (right).

subset of locations), and the problem relies on the scarcity of data (few training imagesare available for a given client). SFFS is not able to correctly estimate the existinginteractions between features, and therefore does not achieve good generalization onthe test set, whilst the less complex BIF performs quite good in such conditions. Thisis just another token of the fact that complex techniques do not necessarily providethe best performance, but strongly depend on the specific scenario we are considering.

In order to demonstrate that the scarcity of data is the reason why SFFS performsworse than BIF, we considered the problem of selecting the best locations from aglobal point of view (i.e. independently of the claimed client). In such conditions,much more data are available for SFFS to estimate interactions and, as demonstratedin Figure 4.7, its performance is superior to that of the “global” BIF in agreementwith the conclusions presented in [Gokberk et al., 2003]. The experiment was carriedout using the face-like mesh in configuration I, but similar curves are obtained withthe other meshes in both configurations I and II. From this result, we can concludethat if enough data were available for each client, SFFS would be preferable to BIF ina client-dependent selection scheme. Finally, we compare the performances betweenthe client-dependent and client-independent (global) schemes:

1. Global SFFS (Figure 4.7, dots) outperforms client-dependent SFFS (Figure 4.6,face-like mesh) for all values of K. Unless enough training data are availablefor each of the clients, SFFS is only suitable for global feature selection.

2. Global BIF (Figure 4.7, dashed line and circles) is outperformed by client-dependent BIF (black line with circles in Figure 4.4) for all values of K. Hence,BIF is suitable for choosing the best locations in a client-dependent fashion,


0 5 10 15 20 25 30 35 406

8

10

12

14

16

18

20


TE

R(%

)

SFFS (Configuration I)


0 5 10 15 20 25 30 35 404

5

6

7

8

9

10

11

12

13

14


TE

R(%

)

SFFS (Configuration II)


Figure 4.6: SFFS with criterion B: effect of sweeping the numberK of selected featureson system performance (TER(%) measures are reported) for the shape-driven, manualface-like mesh and rectangular grid methods in configurations I (left) and II (right).

even if not a large number of training examples is available for each client.

3. Client-dependent BIF outperforms the global SFFS. BIF compensates its sim-plicity with the suitability to select client-dependent locations, which is probablymore desirable for discrimination between individuals.

4.4.4 Adaboosted MLP ensemble, SVMs and LDA-based

We have already seen that the scarcity of data leads to poor performance whenSFFS selects the most discriminative user-dependent (template-dependent indeed)locations, and similar conclusions can be drawn for the methods that will be evaluatedin this section (Adaboosted MLP ensemble, SVMs, and LDA-based). However, wecan still try to use these tools in order to select the most important locations in anuser-independent scheme. For such a problem, the shape-driven configuration is notrecommended since the locations depend on individual face structure and hence, theymay not be positioned over the same face area for all users, as discussed in Section4.4.2. However, the above mentioned techniques can be applied to both rectangularand face-like meshes. Since the face-like mesh has reported better results than therectangular grid (Table 4.1 and Figures 4.2 and 4.6), we will focus on the face-likerepresentation to test the above mentioned tools. Table 4.2 shows the obtained results.The baseline performance using fn ≡ median (already displayed in Table 4.1) is shownonce again for better analysis of the results. For each pairwise comparison between theevaluated methods, we computed confidence intervals around their ∆HTER. Tables4.3 and 4.4 show these confidence intervals for configurations I and II respectively,and from the information provided in tables 4.2, 4.3 and 4.4, we can highlight:


0 5 10 15 20 25 30 35 40 455

10

15

20

25


TE

R(%

)

Configuration I

Global SFFSGlobal BIF

0 5 10 15 20 25 30 35 40 452

4

6

8

10

12

14

16

18

20

22


TE

R(%

)

Configuration II

Global SFFSGlobal BIF

Figure 4.7: Global SFFS Vs. Global BIF (Criterion B is used in both selectionschemes): effect of sweeping the number K of best selected features on system per-formance (TER(%) measures are reported) for the shape-driven mesh in configurationI.

• The simplest method, the non-trainable median, is the worst fusion approachfor this problem.

• The most complex fusion methods (SVM and Adaboosted ensemble of MLPs(MLP-AB)) are the ones that get the best authentication results. For instance,the Adaboosted ensemble of MLPs without feature selection obtains an improve-ment of 51.25% in configuration I and 53.75% in configuration II in comparisonwith the median approach (statistically significantly differences do exist betweenboth approaches). Adaboost selected 50 MLPs in configuration I to build theensemble and 41 MLPs in configuration II. Hence, a great computational effortis needed for this method in order to output a decision. The SVM evaluation ismuch more simple, while its performance is the same, as the obtained confidenceintervals for ∆HTER ([−0.97%, 0.89%] in configuration I and [−0.84%, 0.69%] inconfiguration II) do not show statistically significantly differences between bothapproaches.

• The LDA-based fusion method shows better performance than the median fu-sion, but its results are clearly worse than the ones achieved by SVM and theAdaboost ensemble (with significant differences in both configurations). An in-teresting point is that the Adaboost ensemble built using the locations selectedby LDA (LDA-AB), performs nearly as good as the methods that use the wholeset of similarities. From tables 4.3 and 4.4, we can conclude that LDA-ABis not statistically significantly different from SVM and MLP-AB. These factsdemonstrate that the selected features preserve useful verification information.The number of features that were left after selection is 64 for configuration I


and 62 for configuration II. During verification, we will not need neither to com-pute Gabor jets nor similarities in the set of discarded points, and therefore theverification time is reduced nearly by 50%.

• The performance obtained using the “global” SFFS (Figure 4.7, continue lineand dots), is clearly worse than that of the SVM or MLP-AB.

• Finally, we want to remark that client-specific BIF applied to the face-like meshobtains comparable (and even better) results than more complex techniques,such as MLP-AB or SVM. Indeed, using K = 25, the obtained TER mea-sures (Figure 4.3) are 2.43% (the lowest error rate among all tested techniquesin configuration I) and 3% for configuration II. As discussed previously, BIFcompensates its simplicity with the suitability to select client-specific locations,therefore achieving low error rates. Analogously, AFS (Figure 4.2) obtainscompetitive results specially for small values of τ (TER of 3.13% and 2.62% inconfigurations I and II respectively for τ = 10%).


Median 7.18 4.67

LDA-based 5.94 4.45

MLP-AB 3.50 2.16

LDA-AB 4.15 2.54

SVM 3.58 2.30

Table 4.2: Total Error Rate (TER) using different fusion techniques: Median, LDA-based, Adaboosted MLP ensemble (MLP-AB), MLP ensemble built with Adaboostusing the similarities selected by LDA (LDA-AB), and Support Vector Machines(SVM)

Table 4.3: Confidence interval (%) around ∆HTER = HTERA −HTERB for Zα/2 =1.645 for the different fusion methods according to configuration I.

METHOD LDA MLP-AB LDA-AB SVMMedian [−0.35, 1.59] [0.77, 2.91] [0.46, 2.59] [0.75, 2.85]LDA – [0.38, 2.06] [0.06, 1.74] [0.36, 1.99]

MLP-AB – [−1.27, 0.64] [−0.97, 0.89]LDA-AB – [−0.66, 1.20]


Table 4.4: Confidence interval (%) around ∆HTER = HTERA −HTERB for Zα/2 =1.645 for the different fusion methods according to configuration II.

METHOD LDA MLP-AB LDA-AB SVMMedian [−0.80, 1.02] [0.35, 2.16] [0.16, 1.97] [0.28, 2.09]LDA – [0.38, 1.91] [0.19, 1.72] [0.31, 1.84]

MLP-AB – [−0.96, 0.57] [−0.84, 0.69]LDA-AB – [−0.64, 0.88]

4.5 Conclusions

This chapter has explored the selection and combination of local Gabor similari-ties, testing and comparing several techniques (both client-dependent and client-independent). The main conclusions drawn from the experiments carried out in theXM2VTS database are the following:

1. From the results shown in Section 3.10.1, we concluded that the shape-drivenconfiguration works slightly better than the face-like mesh without the needof localizing a set of (perfectly positioned) universal landmarks, although thedifferences are not significant. However, face-like is the best when intramodalfusion is applied: perfect localization of points and the dense mesh favour ac-curate fusion results.

2. Applying any selection technique results in improved performance and compu-tational savings during the verification stage.

3. Simple selection tools such as BIF (and the closely related AFS), provide goodresults in the authentication scenario considered (few training data available foreach user) when selecting client-dependent locations.

4. Complex pattern recognition tools (SVMs, MLPs, etc.) and the SFFS methodperform better than BIF for the selection of client-independent locations dueto the fact that, in this case, enough data are available to train the classifiers.

The client-dependent selection scheme offers competitive results when a suitable(yet simple, such as BIF) selection tool is applied. It is expected that if enoughtraining data for each client are available, the use of complex techniques will improvethe performance compared to that of the global selection scheme. In order to validatethis hypothesis, we plan to use databases containing video sequences where enoughtraining samples for each client can be collected, therefore providing an adequateframework for comparing global and client-dependent selection schemes.

Chapter 5

Modeling Marginal Distributionsof Gabor Coefficients


5.2 The Face Recognition system . . . . . . . . . . . . . . . . 123

5.3 Modeling Marginal Distributions of Gabor coefficients . 124

5.3.1 Univariate Generalized Gaussians . . . . . . . . . . . . . . . 124

5.3.2 Modeling Gabor coefficients with univariate GG’s . . . . . . 125

5.3.3 Bessel K Form Densities . . . . . . . . . . . . . . . . . . . . 126

5.3.4 Analyzing Estimated GG Parameters . . . . . . . . . . . . 129

5.4 Coefficient quantization by means of Lloyd-Max algorithm133

5.5 Face Verification on the XM2VTS database . . . . . . . 134

5.6 Conclusions and further research . . . . . . . . . . . . . . 135

5.1 Introduction

Following [Shen and Bai, 2006a], Gabor-based approaches can be roughly classifiedinto one of the following categories: a) Extraction of Gabor responses from a setof key points in face images (which has been explored in this PhD Thesis) and b)Convolution of the whole image with a set of Gabor filters. As highlighted in [Shenand Bai, 2006a], one of the main drawbacks of these approaches (specially the onesincluded in category b) is the huge amount of memory that is needed to store a Gabor-based representation of the image. Even in the case of a), considering single floatingpoint representation (4 bytes), 100 points and 40 Gabor filters, the template size

121

122 Chapter 5. Modeling Marginal Distributions of Gabor Coefficients

reaches 32 Kbytes which is considerably bigger than those employed by commercialsystems. For instance, Cognitec’s [Cognitec, 2002] templates occupy 1800 bytes eachone, and L-1 Identity Solutions’ [L-1, 2005] template size ranges from 648 bytes to7 Kbytes. One way to reduce storage is to perform coefficient quantization using anaccurate statistical model for Gabor coefficients.

Statistical analysis of images has revealed, among other characteristics, one inter-esting property: the non-Gaussianity of image statistics when observed in a trans-formed domain, e.g. wavelet decomposition. This means that the coefficients ob-tained through such transformations are quite non-Gaussian being characterized byhigh kurtosis, sharp central cusps and heavy tails. Among others, the works in [Mal-lat, 1989, Simoncelli and Adelson, 1996, Moulin and Liu, 1999, Van de Wouver et al.,1999, Hernandez et al., 2000, Do and Vetterli, 2002, Srivastava et al., 2002] have ob-served this behavior, taking advantage of such a property for different applications.One statistical model that has been widely used to approximate the marginal distri-butions of coefficients is the Generalized Gaussian (GG) distribution [Van de Wouveret al., 1999, Hernandez et al., 2000, Do and Vetterli, 2002]. Other statistical priorsthat have been alternatively applied are the Bessel K Forms (BKF) [Srivastava et al.,2002, Fadili and Boubchir, 2005] and the alpha-stable densities [Achim et al., 2001].

To the best of our knowledge, despite the large number of papers using Gaborfilters for face recognition, no statistical model has been proposed (or used) for Gaborcoefficients in this scenario. We suggest that GG’s, whose parameters are estimatedusing the Maximum Likelihood (ML) approach [Do and Vetterli, 2002], could providea suitable modeling, and empirically validate this hypothesis using the Kullback-Leibler (KL) distance. The KL divergence was also used to compare the fittingprovided by Generalized Gaussians to that of other (state-of-the-art) statistical priors(Bessel K Forms).

The underlying statistics allow us to perform data compression via Lloyd-Maxquantization [Lloyd, 1957, Lloyd, 1982, Max, 1960], and open new possibilities interms of selecting an optimal measure between Gabor reponses from a theoreticalpoint of view.

The chapter is organized as follows: Section 5.2 briefly describes the baseline facerecognition system used. Section 5.3 introduces the formulation of univariate Gener-alized Gaussian (GG) distributions and Bessel K Forms (BKF) densities. Modelingof Gabor coefficients using both GGs and BKFs are also compared in this section.Coefficient quantization by means of Lloyd-Max algorithm is explained in Section5.4. The impact of coefficient quantization on verification performance is reported inSection 5.5 with experimental results on the XM2VTS database [Messer et al., 1999].Finally, conclusions and future research lines are drawn in Section 5.6.

5.2. The Face Recognition system 123

5.2 The Face Recognition system

Orientation

Spa

tial f

requ

ency

Figure 5.1: Real part of the set of 40 (8 orientations × 5 scales) Gabor filters used inthis paper.

The baseline face recognition system used in this chapter relies upon extractionof Gabor responses at each of the nodes from a nx × ny ≡ 10 × 13 rectangular grid(Figure 5.2). All faces were geometrically normalized -so that eyes and mouth arein fixed positions-, cropped to a standard size of 150x116 pixels and photometricallycorrected by means of histogram equalization and local mean removal. The regionsurrounding each grid-node in the image is encoded by the convolution of the imagepatch with these filters (whose real part is shown in Fig. 5.1), forming a jet, J .For a given face with n = nx × ny grid-nodes {~p1, ~p2, . . . , ~pn}, we get n Gabor jets{J (~p1),J (~p2), . . . ,J (~pn)}.


5.3 Modeling Marginal Distributions of Gabor co-

efficients

5.3.1 Univariate Generalized Gaussians

Pioneered by the work of [Mallat, 1989], Generalized Gaussians have been successfullyused to model marginal distributions of coefficients produced by various types oftransforms [Van de Wouver et al., 1999, Hernandez et al., 2000, Do and Vetterli,2002, Simoncelli and Adelson, 1996, Sharifi and Leon-Garcia, 1995, Moulin and Liu,1999, Joshi and Fischer, 1995]. The pdf of a GG is given by the following expression:

Pµ,β,σ =1

Z (β) σA (β)exp

(−∣∣∣∣x− µ

σA (β)

∣∣∣∣β)

(5.1)

where β is the so-called shape parameter, µ represents the mean of the distribution,and σ is the scale parameter. In the following we will consider zero mean data, i.e.

Figure 5.2: Rectangular grid over the preprocessed (geometrically and photomet-rically normalized) face image. At each node, a Gabor jet with 40 coefficients iscomputed and stored.

5.3. Modeling Marginal Distributions of Gabor coefficients 125

−6 −4 −2 0 2 4 60

0.01

0.02

0.03

0.04

0.05

0.06

β=1: Laplacianβ=2: Gaussianβ=1000: Uniform

Figure 5.3: Effect of β on the univariate GG distribution.

µ = 0. Z (β) and A (β) in Eq. (5.1) are given by:

Z (β) =2

βΓ

(1

β

)(5.2)

A (β) =

√Γ (1/β)

Γ (3/β)(5.3)

where Γ(.) represents the Gamma function. It should be noted that the Laplacian,Gaussian and Uniform distributions are just special cases of this generalized pdf, givenby β = 1, β = 2 and β → ∞ respectively (see Figure 5.3 for an example showing theeffect of β on the shape of the distribution).

5.3.2 Modeling Gabor coefficients with univariate GG’s

In this chapter, we attempt to model both real and imaginary parts of each Ga-bor coefficient using GG’s whose parameters (β and σ) have been obtained usingthe Maximum-Likelihood (ML) estimator (see Appendix D for the derivation of theformulas). From a set of face images {F1,F2, . . . ,FT}, we extract Gabor jets usingthe rectangular grid. Regardless of the node from which they have been computed,the coefficients corresponding to a given Gabor filter ψm (real and imaginary partsseparately) are stored together forming two sets of coefficients Sreal

m and Simagm . Now,

our goal is to assess whether GG’s are able to accurately model these distributions.Figure 5.4 shows the histogram for the real part of coefficient g34 along with the

fitted GG. Although it seems clear from this figure that the GG accurately modelsthe coefficient distribution (similar plots were obtained for the remaining coefficients),we used the Kullback-Leibler (KL) distance [Cover and Thomas, 1991] to assess the


goodness of the fits. The Kullback-Leibler distance between two discrete distributionswith probability functions P and Q, is given by KL(P,Q) =

∑Ki=1 P (i)logP (i)

Q(i)≥ 0,

where K stands for the number of intervals in which the sample space is divided.Figure 5.5 (left) plots, for both real and imaginary parts of each Gabor coefficient,the KL between the relative frequency of the coefficient and the fitted GG. Sincethe obtained distances are small, it seems reasonable to conclude that generalizedGaussians are able to model Gabor coefficients accurately. Other tests, such as theχ2 test, have been previously used to assess the quality of the fit (e.g. [Muller, 1993]).Applying the χ2 test to our data leads to the same conclusion -Figure 5.5 right-.

5.3.3 Bessel K Form Densities

Apart from the Generalized Gaussian distribution, other statistical priors, such asBessel K Form (BKF) densities [Srivastava et al., 2002], have recently emerged as avalid alternative for coefficient modeling. As well as the GG, the BKF distribution ischaracterized by two parameters (p and c) with analogous meaning to that of β andσ respectively. The BKF density is given by:

BKF (x; p, c) =2

Z(p, c)|x|p−0.5Kp−0.5

(√2

c|x|)

(5.4)

where Kν is the modified Bessel function of order ν defined in [Abramowitz andStegun, 1970], and Z is the normalizing constant given by:

Z(p, c) =√πΓ(p)(2c)0.5p+0.25 (5.5)

The BKF density is based on a physical model for image formation (the so-called

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.40

10

20

30

40

50

60

70

80HistogramFitted GGD

Figure 5.4: Histogram for coefficient g34 along with the fitted GG.


0 5 10 15 20 25 30 35 400

0.005

0.01

0.015

0.02

0.025

0.03

Coefficient

KLD

Real partImaginary part

0 5 10 15 20 25 30 35 400

50

100

150

200

250

300

350

400

450

500

Coefficient

χ2


Figure 5.5: Kullback-Leibler and χ2 distances between the fitted GG and the datafor both real and imaginary parts of each Gabor coefficient.

transported generator model). In [Fadili and Boubchir, 2005] the authors comparedBKF against GG and the α-stable distribution, concluding that the BKF densityfits the data at least as well as the Generalized Gaussian, and outperforms GGsin capturing the heavy tails of the observed histogram. However, no descriptionof the method used to estimate the Generalized Gaussian parameters was included(moments, Maximum Likelihood, etc.) In the case of BKF densities, parametershave been usually estimated using moments [Srivastava et al., 2002], and k statisticsunbiased cumulants estimators [Fadili and Boubchir, 2005].

In order to compare BKF and GG modeling in the specific case of Gabor coeffi-cients extracted from face images, we performed the following experiment:

• For each pair of orientation and scale, i.e. for each coefficient gm, both BKFand GG parameters were estimated on 10 different sets of randomly chosencoefficients.

• For each coefficient and set, the KL distance was measured between the observedhistogram and the two estimated densities.

• The average KL for the m−th coefficient, as well as the associated standarddeviation, were stored.

The k statistics unbiased cumulants estimators [Fadili and Boubchir, 2005] wereused to determine the parameters of the BKF distributions, while Maximum Likeli-hood (ML) [Do and Vetterli, 2002] was employed to estimate GG parameters. Exam-ples of observed histograms on a log scale along with the fitted densities are shownin Figure 5.6 for coefficients 1, 9, 17, 25 and 33 (i.e. the coefficients with verticalorientation from each frequency subband).

From these plots, it seems that both densities are equivalent in the last 3 (lowest)frequency subbands. However, Generalized Gaussians are quite more accurate than


−0.1 −0.05 0 0.05 0.1−10

−9

−8

−7

−6

−5

−4

−3

−2

−1Gabor Coefficient #1

log(histogram)log(BKF)log(GG)

−0.1 −0.05 0 0.05 0.1−10

−9

−8

−7

−6

−5

−4

−3

−2



−0.1 −0.05 0 0.05 0.1−10

−9

−8

−7

−6

−5

−4

−3

−2



−0.1 −0.05 0 0.05 0.1−10

−9

−8

−7

−6

−5

−4

−3

−2



−0.1 −0.05 0 0.05 0.1−10

−9

−8

−7

−6

−5

−4

−3

−2



Figure 5.6: Examples of observed histograms (on a log scale) along with the BKFand GG fitted densities


BKF in the first two (highest) frequency subbands (specially when fitting the centralcusp). In agreement with [Fadili and Boubchir, 2005], Bessel K Forms seem slightly

better in capturing the heavy tails of the observed histogram for the 1st frequencysubband. Figure 5.7 shows, for each Gabor coefficient, the mean KL distance (left)as well as the associated standard deviation (right) between the observed histogramsand the two estimated densities. It is clear that Generalized Gaussians provide amuch better modeling than BKFs in the two first scales (highest frequency scales, i.e.coefficients from 1 to 16), a slightly better behavior in the third scale (coefficientsfrom 17 to 24) and equal performance in the remaining two scales.

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Gabor coefficient

mea

n K

L di

stan

ce

BKFGG

0 5 10 15 20 25 30 35 400

0.02

0.04

0.06

0.08

0.1

0.12

Gabor coefficient

stan

dard

dev

iatio

n fo

r th

e K

L di

stan

ce

BKFGG

Figure 5.7: Left: Mean KL distance between observed histograms and the two esti-mated densities (GG and BKF). Right: Associated standard deviation

As stated above, BKF parameters were estimated using a robust extension ofthe moments method, while GG parameters were determined using ML. In [Sharifiand Leon-Garcia, 1995, Do and Vetterli, 2002] it is also described a way to estimateGeneralized Gaussian parameters using moments (see Appendix D). In order to com-pare BKFs and GGs with similar parameter estimation procedures, the experimentdescribed above was repeated using GGs fitted via the moments-based method. Re-sults are shown in Figure 5.8, demonstrating that even with comparable estimationprocedures, GGs do outperform BKFs.

Finally, the KL distance was also used to compare the GG densities estimatedvia ML (GGML) with those estimated via the moments method (GGmoments). Asshown in Figure 5.9, ML provides slightly better results in the first two frequencysubbands, while both methods perform the same in the remaining ones.

5.3.4 Analyzing Estimated GG Parameters

Once confirmed the good performance of Generalized Gaussians in modeling marginaldistributions of Gabor coefficients, this section provides an analysis of the obtained


0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Gabor coefficient

mea

n K

L di

stan

ce

BKFGG

moments

0 5 10 15 20 25 30 35 400

0.02

0.04

0.06

0.08

0.1

0.12

Gabor coefficient

stan

dard

dev

iatio

n fo

r th

e K

L di

stan

ce

BKFGG

moments

Figure 5.8: Left: Mean KL distance between observed histograms and the two esti-mated densities (GG fitted via a moments-based method and BKF). Right: Associ-ated standard deviation

0 5 10 15 20 25 30 35 400.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

Gabor coefficient

mea

n K

L di

stan

ce

GGML

GGmoments

Figure 5.9: Mean KL distance between observed histograms and the two estimateddensities (GG fitted via a moments-based method and GG fitted via ML)


ML parameters∗ for each pair of orientation and frequency. Figure 5.10 presents theβ and the σ parameters for the 80 GG’s modeling both real and imaginary parts ofcoefficients. From this figure, we can conclude:

• All the GG’s have a shaping factor β that is well below 2, and hence we canconclude that the distribution of each Gabor coefficient is not well modeledby a Gaussian. This fact was also confirmed with normal probability plots(statistical tool employed to assess whether or not a data set is approximatelynormally distributed).

• The real and imaginary parts of a given coefficient have similar β parameters.The same conclusion can be drawn for σ.

• There exists a pseudo-periodic behavior in the GG parameters. If we examineFigure 5.11, which replots the shaping factors and the standard deviations forthe real part of each coefficient grouped by scale subbands, it seems clear that asimilar pattern emerges on each of these subbands. Further research is neededin order to provide theoretical reasons explaining this “V” pattern at each scalesubband.

• In an analogous way, Figure 5.12 replots β and σ for the real part of the co-efficients grouped by orientation subbands. It can be realized that the GGparameters increase with scale, i.e. as long as spatial frequency decreases. Tak-ing into account the variation of Gabor filters with scale for a fixed orientation(any column from Figure 5.1), it is clear that a filter from the first row (1st

scale, highest frequency) captures texture information from a smaller neighbor-hood than a filter with a lower frequency does. Hence, we can assume that theinformation encoded in a high frequency coefficient is more correlated than theone captured by a low frequency filter and therefore, it is reasonable to con-clude that the variance (and σ) should be smaller for high frequency coefficients.Moreover, the increase of β with scale means that the coefficient distributionsare becoming “more Gaussian” and this fact could be explained by the samehypothesis and the central limit theorem: as long as frequency decreases, thepixels in the image patch that are taken into account for the convolution are lesscorrelated and, applying the central limit theorem, the result of this convolutionshould approach a normal distribution. However, more experiments are neededto assess the validity of this hypothesis.

∗Similar conclusions can be derived for the parameters estimated using moments


0 5 10 15 20 25 30 35 40

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

Coefficient

Sha

ping

fact

or c


0 10 20 30 400.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Coefficient

Sta

ndar

d de

viat

ion

( σ )


Figure 5.10: Obtained β and σ GG parameters for both real and imaginary parts ofeach Gabor coefficient.

0 5 10 15 20 25 30 35 40

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

Coefficient

Sha

ping

fact

or c

Scale 1Scale 2Scale 3Scale 4Scale 5

0 5 10 15 20 25 30 35 400.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Coefficient

Sta

ndar

d de

viat

ion

( σ )

Scale 1Scale 2Scale 3Scale 4Scale 5

Figure 5.11: Shaping factors β and standard deviations σ for the real part of Gaborcoefficients grouped by scale subbands.

5.4. Coefficient quantization by means of Lloyd-Max algorithm 133

1 1.5 2 2.5 3 3.5 4 4.5 5

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

Scale

Sha

ping

fact

or c

Orientation 1Orientation 2Orientation 3Orientation 4Orientation 5Orientation 6Orientation 7Orientation 8

1 1.5 2 2.5 3 3.5 4 4.5 50.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Scale

Sta

ndar

d de

viat

ion

( σ )

Orientation 1Orientation 2Orientation 3Orientation 4Orientation 5Orientation 6Orientation 7data8

Figure 5.12: Shaping factors β and standard deviations σ for the real part of Gaborcoefficients grouped by orientation subbands.

5.4 Coefficient quantization by means of Lloyd-

Max algorithm

Now that we have a way to model marginal distributions of Gabor coefficients throughunivariate Generalized Gaussians, a wide range of applications arises. As highlightedin [Shen and Bai, 2006a], one of the drawbacks of Gabor-based approaches is the largeamount of data that must be stored. Hence, we can think of reducing storage viacoefficient quantization. To achieve this goal, we used the Lloyd-Max quantizer (theone with minimum mean squared error (MSE) for a given number NL of representativelevels) [Lloyd, 1957, Lloyd, 1982], [Max, 1960].

In our case, a face is represented by n jets, each one comprising 40 complexcoefficients. Assuming that each coefficient is represented by 8 bytes (single floatingrepresentation: 4 bytes for the real part + 4 bytes for the imaginary part), a total of8 × 40 × n = 320n bytes are needed per face. If the required database comprises Nf

faces, then 320n×Nf bytes are needed.After GG modeling and data quantization, instead of storing the original coeffi-

cient, we only need to keep two indices (one for the real part and another for theimaginary part) per coefficient (2× 40× n indices per face). Using NL quantizationlevels, we can represent a coefficient with 2 × ⌈log2 (NL)⌉ bits. In our case, a face istherefore represented by

40 × n× 2 × ⌈log2 (NL)⌉8

= 10n× ⌈log2 (NL)⌉ bytes

If Nf faces are to be stored, then

10n×Nf × ⌈log2 (NL)⌉ + 40 × 4(bytes) ×NL(centroids) bytes


are needed. The second term in the previous expression represents the storagerequired for the NL centroids in each coefficient band (given that both real andimaginary parts have very similar GG parameters -Section 5.3.4-, only NL centroidshave been used to quantize each band). If we let, for instance, NL = 8 a storagereduction of approximately 91% is achieved, and for NL = 512, ≈ 72% of space issaved with respect to the raw coefficients.

In [Potzsch et al., 1996], it was shown that a given image can be reconstructedusing the Gabor responses extracted from a sparse graph (like the rectangular gridshown in Figure 5.2). Figure 5.13 presents the reconstruction of the face in Figure 5.2using different quantization levels, along with the reconstruction using the originalcoefficients. As can be seen, the reconstructed face with only just 4 quantization levelsis already quite accurate, and the differences in quality between NL = 8, . . . , 512 levelsand the original coefficients are not easily noticeable from a perceptual point of view.

NL=2 N

L=4 N

L=8 N

L=16 N

L=32

NL=64 N

L=128 N

L=256 N

L=512 Original coefficients

Figure 5.13: Face reconstruction [Potzsch et al., 1996] using original and quantizedcoefficients.

5.5 Face Verification on the XM2VTS database

In order to assess the impact of data compression on system performance, we con-ducted verification experiments using the XM2VTS database on configuration I ofthe Lausanne protocol [Luttin and Maıtre, 1998] (see Appendix A -Section A.4- for adescription of both database and protocol). Table 5.1 presents FAR, FRR and TotalError Rate (TER=FAR+FRR) over the test set varying the number of quantizationlevels, along with the performance using the original coefficients. Using the statisticalanalysis of [Bengio and Mariethoz, 2004] (Appendix B), we confirmed that perfor-mance was significantly worse only for NL = 2 and NL = 4 quantization levels. For

5.6. Conclusions and further research 135

the remaining ones, the performance was even better than that with original coeffi-cients, although we can not conclude that significant improvements were achieved. Inany case, these results suggest that noise reduction may be achieved via coefficientquantization.

5.6 Conclusions and further research

This chapter has shown that Gabor coefficients extracted from face images can beaccurately modeled using generalized Gaussian distributions. Empirical evaluationsagainst the Bessel K Forms density [Srivastava et al., 2002, Fadili and Boubchir, 2005]have demonstrated the benefits of the Generalized Gaussian in this specific scenario.

This finding opens a wide range of possibilities. As a first attempt, we took advan-tage of the underlying statistics to reduce data storage via Lloyd-Max quantization.No degradation was observed even with severe compression using 8 quantization lev-els. Further research is needed to investigate the “V” behavior of GG parametersobserved in each frequency band (Section 5.3.4). Moreover, we have demonstratedthat the distributions of Gabor coefficients are far from being Gaussian, as long asthe obtained shaping factors β are well below 2 for all coefficients.

Gabor-based face recognition systems have used distances for jet comparison thatare not supported by theoretical evidence (cosine distance, as in [Wiskott et al.,1997], is one of the most accepted - see Section 3.12 for a comparison of differentdistance measures-. We think that, based on the GG modeling of Gabor coefficients,optimal ways to compare jets (from a theoretical point of view) could be obtained,and research lines focused on this topic will be addressed.

In this chapter, we have limited our study to marginal distributions. In additionto the non-normal behavior of marginals, studies on joint statistics of filter responseshave shown that there exist non-Gaussian dependencies across scales, orientationsand positions [Shapiro, 1993, Buccigrossi and Simoncelli, 1999]. For instance, con-tour probability plots of 2-D and 3-D histograms display surprising polyhedra-likeshapes [Srivastava et al., 2002, Boubchir and Fadili, 2005]. Section 6.2 presents somepreliminary experiments on the modeling of joint statistics of Gabor coefficients usingmultivariate Generalized Gaussians.


Table 5.1: Face Verification on the XM2VTS database. False Acceptance Rate (FAR),False Rejection Rate (FRR) and Total Error Rate (TER) over the test set using bothraw and compressed data. Moreover, approximate storage saving is provided for eachquantization level

Test SetStorage Saving FAR(%) FRR(%) TER(%)

NL = 2 ≈97% 12.15 18.25 30.40NL = 4 ≈94% 4.19 8.00 12.19NL = 8 ≈91% 3.49 5.50 8.99NL = 16 ≈87% 3.85 5.50 9.35NL = 32 ≈84% 3.71 5.00 8.71NL = 64 ≈81% 3.53 5.50 9.03NL = 128 ≈78% 3.57 5.00 8.57NL = 256 ≈75% 3.63 4.75 8.38NL = 512 ≈72% 3.66 4.75 8.41Raw data 0% 3.79 5.25 9.04

Chapter 6

Recent Results

Contents6.1 Automatic Face Alignment (Still Images and Video Se-

quences) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.1.1 Face Alignment on Still Images . . . . . . . . . . . . . . . . 138

6.1.2 Face Tracking on Video Sequences . . . . . . . . . . . . . . 138

6.2 Multivariate Generalized Gaussians . . . . . . . . . . . . 140

6.2.1 Multivariate Generalized Gaussian Formulation . . . . . . . 141

6.2.2 Poly-β Multivariate Generalized Gaussian . . . . . . . . . . 145

6.3 Generalized Gaussians for Hidden Markov Models . . . 146

6.3.1 Fundamentals of HMMs . . . . . . . . . . . . . . . . . . . . 149

6.3.2 Experimental evaluation . . . . . . . . . . . . . . . . . . . . 150

Throughout this Thesis we have outlined, at the end of the corresponding chap-ter, future research lines regarding each addressed topic. In this chapter, we reportrecent advances and results in some of them. Moreover, we describe other researchdirections that we have recently opened concerning Generalized Gaussians: the useof this density in Hidden Markov Models (HMMs), and the definition (and parameterestimation) of a novel multivariate distribution. The chapter is organized as follows:Section 6.1 reports preliminary results on automatic face alignment in still images,and face tracking in video sequences along with two applications: pose robust facerecognition from video and lip-audio synchrony. Section 6.2 presents the multivariateextension for Generalized Gaussians introduced in [Boubchir and Fadili, 2005]. More-over, a novel multivariate distribution, the Poly-β Multivariate Generalized Gaussian,is defined. Section 6.3 introduces the use of the Generalized Gaussian density (bothunivariate and multidimensional) in the HMM framework, with some preliminaryresults.

137

138 Chapter 6. Recent Results

60.3964 37.7028 11.0498

Figure 6.1: Preliminary results on automatic face fitting on an image from the CMUPIE database. Left: initialization. Center: Fitting after 10 iterations. Right: Finalfitting.

6.1 Automatic Face Alignment (Still Images and

Video Sequences)

6.1.1 Face Alignment on Still Images

In addition to face detection, which provides a coarse estimation of the position andscale of each detected face, face alignment aims to achieve a more accurate localiza-tion therefore allowing to normalize faces geometrically. Different approaches such asActive Shape Models [Cootes et al., 1995] (with extensions [Zhang et al., 2005, Suknoet al., 2007, Cristinacce and Cootes, 2007]), Active Appearance Models [Cootes et al.,2001] (with extensions [Cootes et al., 2000, Cootes and Taylor, 2001, Cristinacce andCootes, 2006]) and elastic graph matching methods [Wiskott et al., 1997] have beenproposed in the literature. In these algorithms, a set of facial features such as nose,eyes, mouth and face outline are located, and these positions are used for geometricalnormalization in order to get rid of in-plane rotation, scale, etc. and even of out-of-the-plane rotations as demonstrated in Chapter 2. In [Baker and Matthews, 2001] amodification to the Active Appearance Model paradigm was introduced, namely theInverse Compositional Image Alignment (ICIA). This modification allowed to achievebetter results and higher speed. Following [Matthews and Baker, 2003] we imple-mented an ICIA-based system. Preliminary experiments on the CMU PIE databaseare shown in figure 6.1.

6.1.2 Face Tracking on Video Sequences

In addition to face alignment in still images, we also performed some experiments onface tracking through video sequences using the Inverse Compositional Image Align-ment (ICIA) algorithm [Baker and Matthews, 2001]. Manual initialization is neededfor the first frame and then, the remaining ones are processed automatically. In theframework of pose-robust face recognition from video, experiments on a 5000-frame

6.1. Automatic Face Alignment (Still Images and Video Sequences) 139

video sequence∗ compared the performance (in terms of similarity score degradation)of the baseline method, where no pose correction was applied, against the system inwhich virtual images are synthesized using NFPW (see Chapter 2 for description ofboth baseline and NFPW ), confirming the advantages of the latter and its suitabilityfor pose-robust face recognition on video sequences. For this experiment, the vectorsof shape parameters b were computed for all images. One of the frames of the videowas used as a template, while the remaining ones were used for testing. Let bα(i)be the value, in frame i, of the pose parameter that accounts for rotations, and letb0

α be the value of this parameter for the template frame. The difference between b0α

and bα(i), namely ∆bα, is a measure of the difference between the rotation angles ofthe template and the probe image. Figure 6.3 presents the similarity scores obtainedwith and without the pose correction stage against ∆bα. The tested video shows aman during conversation and, apart from pose changes, there are other factors suchas expression variations that affect the value of the similarity between the templateand the probe image. However, it is clear that when ∆bα grows, the use of posecorrected images outperforms the original system.

Face Tracking on the BANCA database

The english part of the BANCA database [Bailly-Bailliere et al., 2003] contains videosequences from 52 subjects recorded at 12 different sessions, totalizing 1248 videos. Atwo-layer hierarchical system was designed for tracking. The bottom layer consists ofan ICIA tracker, in which the face region is coarsely tracked (only affine deformationsare allowed). The top layer uses an ICIA tracker, in which a set of facial features aretracked throughout the video sequence (the allowed shape deformations are learntfrom a set of training face meshes, as explained in Section 2.2). The system wasused to track all videos from the first four sessions (controlled scenario). Results weremanually supervised and, overall, it turned out that face tracking was quite fine (seeFigure 6.2).

In combination with features extracted from the audio signal, the tracked lipcoordinates (see Figure 6.2) were used to measure the synchrony between audio andvideo, with the final goal of performing aliveness detection and hence discard fakeattempts [Argones Rua et al., 2008].

Pose-robust Face Recognition from Video

Making use of the NFPW method presented in Chapter 2 and the semi-automatictracker described above, we performed some experiments using a database collectedin the framework of the Biosecure Network of Excellence [BioSecure, 2004]. Oneof the main goals of these experiments [Alba-Castro et al., 2008] was to evaluate

∗http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking face.html


��

��

1021fg

201

1021

en Fitting in frame 2011021

fg2

01

1021


HEIGHT

AREA WIDTH

FEATURE EXTRACTION

1021fg

201

1021


Figure 6.2: Example of a tracked face through a sequence from the BANCA database.Extraction of lip coordinates for audio-video asynchrony detection [Argones Rua et al.,2008]

the benefits of pose correction in video, which were confirmed with improvements inperformance over the original system where no correction was applied. In agreementwith the results obtained in Chapter 2, we assessed that generating virtual imagesvia NFPW provided better results than warping to a mean shape (i.e. WMS).

6.2 Modeling Joint Statistics of Gabor Coefficients:

Multivariate Generalized Gaussians

As introduced in Chapter 5, the marginal distributions of wavelet coefficients presenta high non-Gaussian behavior (high kurtosis, sharp central cusps and heavy tails). Inaddition to the non-normal behavior of marginals, studies on joint statistics of filterresponses have shown that there exist non-Gaussian dependencies across scales, orien-tations and positions [Shapiro, 1993, Buccigrossi and Simoncelli, 1999]. For instance,contour probability plots of 2-D and 3-D histograms display surprising polyhedra-likeshapes [Srivastava et al., 2002, Boubchir and Fadili, 2005]. In [Boubchir and Fadili,2005] a multivariate statistical model that adequately fits this behavior was intro-duced, namely the Anisotropic Multivariate Generalized Gaussian (AMGG), whoseparameters (shape factor β and covariance matrix Σ) were estimated from data.When dealing with univariate signals, the AMGG reduces to the Generalized Gaus-sian introduced in Chapter 5. [Do and Vetterli, 2002] derives the formulas for the

6.2. Multivariate Generalized Gaussians 141

−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.250.65

0.7

0.75

0.8

0.85

0.9

0.95

1

∆ bα

Sim

ilarit

y S

core

No Pose CorrectionPose Correction

Figure 6.3: Similarity scores with and without pose correction in a video sequence

estimation of the univariate GG parameters in a Maximum Likelihood (ML) approach,obtaining that the estimate of σ is a non-linear function of β (see also Appendix D).Despite this fact, the authors of [Boubchir and Fadili, 2005] estimated the covariancematrix directly as the covariance of the data independently of β, as it is done whenworking with multivariate Gaussians. Our first goal is to derive the formulas for theestimation of Σ in a ML approach, demonstrating that the values of its elements dodepend on the shape parameter β.

In Chapter 5 we have shown that Gabor coefficients extracted from face imageswere well modeled using univariate Generalized Gaussians. Moreover, we found thatfor each coefficient (with a given orientation and scale) a different value of β wasobtained. Following this finding and in order to model joint statistics of Gabor coeffi-cients, we present a novel and flexible multivariate statistical model that accounts fora possible different β in each dimension, namely the Poly-β Multivariate GeneralizedGaussian (Pβ-MGG).

We do provide some preliminary results confirming the benefits of i) ML estima-tion of Σ over the estimation adopted in [Boubchir and Fadili, 2005], and ii) Poly-βover Mono-β MGG.

Notation Matrices are represented by bold capital letters, e.g. M , and its (i, j)element is written M i,j. Lowercase bold letters represent vectors, e.g. x. Its i-thelement is indicated by xi

6.2.1 Multivariate Generalized Gaussian Formulation

[Boubchir and Fadili, 2005] recently introduced the so-called Anisotropic MultivariateGeneralized Gaussian Distribution (AMGG), whose D−dimensional pdf is given by:


Pµ,β,Σ (x) =[det (Σ)]−1/2

[Z (β)A (β)]D×

exp

−

∥∥∥∥∥Σ−1/2 (x − µ)

A (β)

∥∥∥∥∥β

(6.1)

where β is the so-called shape parameter, µ represents the mean of the distribu-tion, and Σ is a symmetric positive definite matrix. In the following we will considerzero mean data, i.e. µ = 0. Z (β) and A (β) in Eq. (6.1) are given by:

Z (β) =2

βΓ

(1

β

)(6.2)

A (β) =

√Γ (1/β)

Γ (3/β)(6.3)

where Γ(.) represents the Gamma function. Moreover

‖x‖β =

D∑

d=1

|xd|β (6.4)

stands for the lβ norm of vector x. The formulation of Eq. (6.1) includes theunivariate case (D = 1) introduced in Chapter 5. Although the univariate GG modelhas been extensively used to model the distribution of several types of coefficients,this does not certainly apply to the multivariate extension, whose employment isquite recent and limited [Cho and Bui, 2005, Boubchir and Fadili, 2005]. Since theformulation of AMGG uses one unique shape parameter for all directions, we willrefer to AMGG as Mono-β MGG (Mβ-MGG) or simply MGG.

In [Boubchir and Fadili, 2005] the covariance matrix Σ was estimated as thecovariance of the data independently of β, as it is done when working with multivariateGaussians. Appendix E presents the Maximum Likelihood estimates of Σ and β. Wedemonstrate that the elements of the matrix ΣML do depend on β.

Comparing Models

We now empirically demonstrate that ΣML produces a more accurate model thanthat of [Boubchir and Fadili, 2005] (i.e. the one with Σ = Σ0 = cov(X)), speciallywhen input data are well modeled by low values of the shape parameter (β < 1).To this aim, we computed the Kullback-Leibler (KL) divergence between several 2-Ddata histograms and the two models (let us denote these distances KLML and KL0

respectively). In particular, three different cases were tested:


A) 2−D Histogram B) Mβ−MGG (ΣML

) C) Mβ−MGG (Σ0)

Figure 6.4: Comparison of ΣML and Σ0 for modeling joint statistics of Gabor coeffi-cients

1. Each dimension is taken to be a bunch of coefficients corresponding to a givenGabor filter with a certain scale and orientation. In Chapter 5 it was shownthat Gabor coefficients (real and imaginary parts separately) can be accuratelymodeled using an univariate Generalized Gaussian with shape parameter βdepending on scale and orientation. Figure 6.4 shows the contour probabilitylines of the 2-D histogram (left), the fitted MGG with ΣML (center) and thefitted MGG with Σ0 (right). From these plots, it seems clear that ML estimateof Σ provides a more accurate fitting. In agreement, KLML(0.82) < KL0(1.25).

2. Data are sampled from a 2-D Laplacian distribution with covariance matrix Σ[Eltoft et al., 2006]. Figure 6.5 shows the 2-D histogram of the data, as well asthe fitting provided by both models. In this case, ΣML ≈ Σ0 and hence bothKL distances are very similar (≈ 0.56).

3. Data are sampled from a 2-D Gaussian distribution with covariance matrix Σ.It is well known that in Multivariate Gaussians ΣML = cov(X) = Σ0. Apartfrom numerical errors, obtained results confirm that both estimates coincide,and hence KLML and KL0 are equal (see Figure 6.6 for contour lines and fittedmodels).


A) 2−D Histogram B) Mono−β MGG (ΣML

) C) Mono−β MGG (Σ0)

Figure 6.5: Comparison of ΣML and Σ0 when data are sampled from a 2-D laplaciandistribution [Eltoft et al., 2006]

A) 2−D Histogram B) Mono−β MGG (ΣML

) C) Mono−β MGG (Σ0)

Figure 6.6: Comparison of ΣML and Σ0 when data are sampled from a 2-D Gaussiandistribution


6.2.2 Poly-β Multivariate Generalized Gaussian

The multivariate Generalized Gaussian in Eq. (6.1) uses one unique β for all dimen-sions. Following the finding in Chapter 5 (shape parameter varies with orientationand scale of Gabor filters) and in order to model joint statistics of Gabor coefficients,we propose a novel multivariate formulation which considers a possible different β foreach dimension, namely the Poly-β Multivariate Generalized Gaussian (Pβ-MGG).The pdf of this distribution is given by:

Pβ,Σ (x) =[det (Σ)]−1/2

∏Dd=1 [Z (βd)A (βd)]

×

exp

(−∥∥∥Σ−1/2 (x ÷A (β))

∥∥∥β

)(6.5)

where β = [β1, . . . , βD]T and ÷ stands for elementwise division, i.e.

y = x ÷ A (β) =

[x1

A (β1), . . . ,

xD

A (βD)

]T

(6.6)

Moreover, ‖x‖β =∑D

d=1 |xd|βd. It is easy to see that if βd = β for all dimensions,

then Eq. (6.5) is equivalent to Eq. (6.1). Appendix F derives the formulas forobtaining the Maximum Likelihood estimates of Σ and β.

Comparing models: Mono- versus Poly- β Multivariate Generalized Gaus-sians

In this section, we empirically demonstrate that the Poly-β MGG outperforms theMono-β distribution specially when the marginal statistics of each dimension arecharacterized by significantly different shape parameters. For visualization purposes,we will only consider the 2-D case taking as input data the responses of 2 differentGabor filters extracted from face images (same configuration as in Chapter 5):

The first point we would like to highlight is that if the marginal statistics ofeach dimension are respectively characterized by β1 and β2 (with β1 << β2) andwe choose the Mono-β MGG, then the ML estimation of β in the bivariate modelis in the range (β1, β2), thus trying to accomodate both behaviors. Obviously, ifβ1 ≈ β2, then the Mono-β model is a good choice. Consider, for instance, the case in

which each dimension corresponds to the response of the 1st and 12th Gabor filtersrespectively. The marginal statistics yield β1 = 0.76 and β2 = 0.71, hence being quitesimilar. Figure 6.7 shows probability contour lines of the 2-D histogram (up), alongwith the Mono-β (middle-left) and the Poly-β (middle-right) densities. It is clearthat both models accurately fit the polyhedral shape of the histogram. In addition,


two Gaussian Mixture Models (GMM) with 2 and 10 components whose parameterswere estimated using the Expectation-Maximization (E-M) algorithm have been fittedto the input 2-D data. Figure 6.7 also shows the estimated densities for 2 and 10components (bottom-left and bottom-right respectively). Clearly, the GMM with 2components is not able to fit the shape of the original data. On the other hand, theGMM with 10 Gaussians models the joint statistics quite accurately but at the costof needing 59 parameters (the Mβ-MGG uses 4 parameters and the Pβ-MGG needs5).

Consider now the next case: one of the dimensions is taken to be the responseof a given Gabor filter with β1 = 0.59, while the second dimension is sampled froma Gabor response with β2 = 1.39. Figure 6.8 (top) shows the contour lines of thebivariate histogram. When trying to fit the Mono-β distribution, the final valueof the ML estimate of β is found to be ≈ 0.82: contour lines of the fitted modelare shown in Figure 6.8 (middle-left), visually demonstrating that it is not a goodmodel for the input data. On the other hand, the Poly-β distribution provides amore accurate modeling of data, as can be seen in Figure 6.8 (middle-right). Thisexample corresponds to modeling the joint statistics of Gabor coefficients (13,40), i.e.coefficients with diferent orientations and different (not adjacent) frequency bands.Once again, both Gaussian Mixture Models with 2 and 10 components have beenfitted (Figure 6.8, bottom). As before, only the GMM with 10 Gaussians is able tomodel the underlying data accurately.

In Appendices E and F we have seen that non-linear systems must be solved forestimating the parameters of both Mono- and Poly-β densities. This usually impliesthat initial estimates must be provided to the search algorithms for obtaining a solu-tion which, in fact, may depend on the specific initial values. Hence, we need to studywhich initialization should be applied in order to obtain better results. In addition,more extensive experiments must be performed to test the accuracy of the multivari-ate densities in modeling joint statistics of Gabor coefficients (and other multidimen-sional representations like intrascale and interscale wavelet coefficient distributions).Finally, by taking advantage of the high order dependencies, some applications, suchas compression and denoising, could be devised.

6.3 Generalized Gaussians for Hidden Markov Mod-

els

Hidden Markov Models (HMMs) [Rabiner, 1989] have been widely applied for sequen-tial data analysis. They can be seen as Markovian Models in which the states arenot directly observable, but whose symbols are drawn from state-specific probabilitydensity functions. HMMs have been successfully applied to many contexts (see forexample [Bunke and Caelli, 2001, Cappe, 2001]), mainly due to their (computation-

6.3. Generalized Gaussians for Hidden Markov Models 147

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

Figure 6.7: Modeling joint statistics of Gabor coefficients (1,12). Contour probabil-ity lines for the 2-D histogram are shown on top. The following statistical modelsare displayed: Mono-βMGG (middle-left), Poly-βMGG (middle-right), GMM (2Gaussians, bottom-left), and GMM (10 Gaussians, bottom-right)


−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

Figure 6.8: Modeling joint statistics of Gabor coefficients (13,40). Contour probabil-ity lines for the 2-D histogram are shown on top. The following statistical modelsare displayed: Mono-βMGG (middle-left), Poly-βMGG (middle-right), GMM (2Gaussians, bottom-left), and GMM (10 Gaussians, bottom-right)


ally) efficient training and evaluation algorithms, as well as to their effectiveness whendealing with sequential data.

One of the core elements to be properly defined in a HMM is the emission probabil-ity function, i.e. the pdf governing the emission of symbols from each state. Becauseof its simplicity and the presence of analytical and efficient estimation procedures [Ra-biner, 1989], the most widely adopted probability distribution for continuous valuedobservations is the Mixture of Gaussians, which produces an adequate performancein many circumstances.

As seen in Chapter 5, the Generalized Gaussian density represents a suitable toolfor modeling different classes of features, such as Discrete Cosine Transform (DCT)coefficients [Joshi and Fischer, 1995, Hernandez et al., 2000], wavelet transform coef-ficients [Do and Vetterli, 2002, Moulin and Liu, 1999, Van de Wouver et al., 1999] andsteerable pyramid transform algorithms coefficients [Simoncelli and Adelson, 1996].In this section we explore the use of Generalized Gaussians in the emission functions ofa HMM, proposing a training algorithm for a novel HMM model, namely GeneralizedGaussian HMM (GG-HMM).

Since it has been shown [Bicego et al., 2003] that given one HMM using Mixtureof Gaussians inside each state, there exists an equivalent (in a likelihood sense) HMMwith more states but just one Gaussian per state (this proof could be easily extendedto any kind of mixture), we will give the formulation for state-mixtures with justone component. Furthermore, this eliminates the problem of choosing the number ofcomponents in each Mixture, which is still an open problem, incorporating it in thealready present problem of selecting the number of states in the HMM.

Based on the well known Expectation Maximization (E-M) algorithm [Dempsteret al., 1977, Wu, 1983], we will define a proper training algorithm for GG-HMMs, notonly considering the one dimensional case but also dealing with the Mono-β MGGintroduced in Section 6.2.1. The usefulness of the proposed approach will be assessedwith different synthetic and real world examples, like EEG signal classification andface recognition, pointing out in which cases and under which circumstances the GG-HMM does outperform the standard G-HMM.

6.3.1 Fundamentals of HMMs

A discrete-time first order hidden Markov model [Rabiner, 1989] is a stochastic finitestate machine defined over a set of K states S = {S1, S2, · · · , SK}. The states arehidden, i.e. not directly observable. Each state has an associated probability densityfunction encoding the probability of observing a certain symbol being output fromthat state. Let Q = (Q1, Q2, . . . , QT ) be a fixed state sequence of length T with thecorresponding observations o = (o1, o2, . . . , oT ). A HMM is described by a model λ,determined by a triple {A,B,π} such that


• A = (aij) is a matrix of transition probabilities, in which aij = P (Qt =Sj |Qt−1 =Si) denotes the probability of state Sj following state Si.

• B = (bj(o)) consists of emission probabilities, in which bj(o) = P (ot =o |Qt =Sj)is the probability of emitting the symbol o when being in state Sj .

• π = (πi) is the initial state probability distribution, i.e. πi = P (Q1 =Si).

The standard training phase is carried out by training one model for each class.In the classification step, the unknown sequence is assigned to the class whose modelshows the highest likelihood (Maximum Likelihood classification scheme). We did notlimit ourselves to work with univariate Generalized Gaussians but also applied themultidimensional extension proposed in [Boubchir and Fadili, 2005] and introducedin Section 6.2.

The training algorithm for standard Gaussian HMMs is known as Baum-Welch re-estimation procedure [Rabiner, 1989], and it is based on the well known ExpectationMaximization (E-M) algorithm. Appendix G describes the extension to deal withGeneralized Gaussian HMMs, maintaining as much as possible the notation used in[Rabiner, 1989].

It is well known that the E-M algorithm is very sensitive to the problem of ini-tialization. In all our experiments, we initialized randomly A and π, whereas B wasinitialized by clustering. In particular, the set of points derived from unrolling thetraining sequences was clustered in K clusters (with K the number of states). Af-terwards, the data belonging to each cluster was modeled using a Gaussian, whoseestimated parameters were used to initialize each GG mean (µk) and covariance (Σk).Consequently, each βk was initialized to 2.

6.3.2 Experimental evaluation

Classification problems involving both synthetic and real data have been devised inorder to compare the performance of Generalized Gaussian emission HMMs (GG-HMMs) with standard Gaussian emission HMMs (G-HMMs). This section describesthe details of such experiments, discussing the obtained results.

Experiments on Synthetic Data

For the first two experiments, synthetic data drawn from univariate known modelshave been used.Experiment 1

The first experiment is a two-classes problem, where each sequence is generatedfrom one of the two synthetic 2-states HMMs displayed in Fig. 6.9(a). From theobservation of the HMMs’ parameters we notice that the emission probabilities of


Class A π B

1 [0.5 0.5; 0.5 0.5] [0.5 0.5] [N(1,1),N(3,1)]

2 [0.5 0.5; 0.5 0.5] [0.5 0.5] [N(1.2,1),N(3.2,1)]

(a)

Class A π B

1 [0.5 0.5; 0.5 0.5] [0.5 0.5] [N(1,1),U[2,4])]

2 [0.5 0.5; 0.5 0.5] [0.5 0.5] [N(1.1,1),U[2.1,4.1]]

(b)

Figure 6.9: Generating HMMs for the synthetic problems: (a) first experiment, (b)second experiment. Note that N(µ,σ) represents a Gaussian distribution with mean µand variance σ; U[a,b] represents an uniform distribution in the interval [a, b].

both states are Gaussian. Moreover, since the two classes are generated from verysimilar HMMs (only differing in the means of the gaussians, which are slightly shifted),the problem is quite difficult. The experimental framework is completed with thefollowing numbers:

1. All the sequences have length 100.

2. The number of sequences that are used to train each class has been varied,increasing from 5 to 50 training sequences per class.

3. 100 testing sequences per class have been generated. Moreover, in order to havestatistical significant results, all the experiments were repeated 50 times, finallyaveraging the obtained performances.

Fig. 6.10 shows the averaged accuracies (as well as the corresponding standarderrors of the mean) for both G-HMM and GG-HMM, when varying the numberof training sequences (learning curves). From this figure, we can notice that GG-HMMs perform as good as standard G-HMMs whenever enough training sequencesare available (20 sequences seem to be enough). This behavior seems reasonable sincein GG-HMMs there exists an additional parameter, and hence more data are neededto obtain accurate estimates of the model.Experiment 2

In this case, the emission probability of one of the states is Gaussian whereas theother is uniform (see Fig. 6.9(b) for HMMs parameters). Therefore, this problemrepresents a clear case where the Gaussian distribution is not adequate to model theunderlying data. The experimental setup is the same as in Experiment 1 (train-ing/testing sequences, etc.). As well as in the former problem, Experiment 2 is quitedifficult since the parameters that define both classes are very similar.


0 5 10 15 20 25 30 35 40 45 50 550.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of training sequences

Acc

urac

y

Gauss−HMMGener Gauss HMM

Figure 6.10: Synthetic experiment 1. Generating emission functions are all Gaussian.


0 5 10 15 20 25 30 35 40 45 50 550.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of training sequences

Acc

urac

y

Gauss−HMMGener Gauss HMM

Figure 6.11: Synthetic experiment 2. Underlying data is not gaussian

Fig. 6.11 shows the averaged accuracies (as well as the corresponding standarderrors of the mean) for both G-HMM and GG-HMM. It is clear that GG-HMMsoutperform standard G-HMMs when the underlying data cannot be assumed to begaussian.

Experiments on Real Data

In order to assess the behavior of the proposed methodology in real-world scenarios,two different classification problems are tested: a) EEG signal classification, and b)face recognition. For all experiments the best number of states of the HMMs wasselected using the well known Bayesian Information Criterion (BIC) [Schwarz, 1978],among values in the range [2-10]. Initialization of GG-HMM training was performedby clustering (as explained in Section 6.3.1). The same scheme was applied for G-HMM, in order to focus on the comparison between the two approaches independentlyof the initial parameters. When dealing with multidimensional data, full covariancematrices were employed. Finally, training algorithms were stopped at the likelihoodconvergence. We would like to remark that in all experiments a couple of iterationswas enough for the proposed GG-HMM training algorithm to converge.Experiment 1: EEG signal classification


The first experiment represents a standard KDD-UCI problem [Bay et al., 2000],aiming at distinguishing between alcoholic and non-alcoholic (control) subjects basedon their recorded EEG signals†. There are three different versions of the data. Inour case, we use the Large DataSet, in which the training and test sets are alreadypre-defined. The training set contains data for 10 alcoholic and 10 control subjects,with 10 runs per subject for three different experimental paradigms (600 trainingsequences in total). The test data use the same alcoholic and control subjects, butwith 10 out-of-sample runs per subject per paradigm (600 test sequences). Eachsequence contains 256 symbols, each of 64 dimensions (the 64 electrodes of the EEG).

In order to compare univariate GG-HMM against G-HMM, we conducted 64 ex-periments using one channel at a time, finally averaging the 64 obtained accuracies.The results are shown in the first row of Table 6.1 (together with the standard errorof the mean). It can be noticed that GG-HMM performs slightly better. The secondrow of Table 6.1 shows the comparison between multivariate GG-HMMs and multi-variate G-HMMs. For this experiment the whole set of 64 channels was used, and ascan be seen, the use of GG provokes a remarkable improvement.

Problem G-HMM GG-HMM(1 channel) 59.60% (0.47%) 61.07% (0.46%)

(64 channels) 93.67% 97.50%

Table 6.1: Accuracies of G-HMM and GG-HMM for the two EEG experiments.

Experiment 2: Face RecognitionThe second application tested regards face recognition. Face images were obtained

by acquiring two videos corresponding to two different sessions for each subject in thedatabase (totalizing 24 subjects). An average of 380 images were acquired for eachsubject, with 9106 images in total. Fig. 6.12 shows some faces from the two recordedsessions. The variability of acquisition scenarios can be assessed by comparing odd(session 1) against even rows (session 2). For each video, a face-like mesh comprising50 facial features has been automatically aligned in every frame of the sequence (seeFig. 6.13 for an example), and 8 Gabor filters (2 scales and 4 orientations) were usedto extract feature vectors from each of the nodes in the mesh.

For each image, the sequence to be used by the HMM was obtained by scanning themesh nodes in a predefined order, similarly to what is done in [Kohir and Desai, 1998],therefore resulting in a sequence for each face image. Due to the large dimensionalityof the data set, accuracies were computed using holdout cross validation: half ofthe set (randomly chosen) was used for training the HMMs, one for each subject,whereas the remaining part was used for testing. Classification accuracies using the

†See http://kdd.ics.uci.edu/databases/eeg/eeg.html for all the details on the data


Figure 6.12: Some face images used for testing.

Fitting: frame 11

Figure 6.13: Fitted mesh on one frame.


8 coefficient feature vectors are displayed in the first row of Table 6.2, showing thatboth methods perform almost perfectly.

Problem G-HMM GG-HMM(8 coefficients) 98.54% 98.05%(2 coefficients) 92.45% 95.62%(1 coefficient) 75.49% 83.99%

Table 6.2: Accuracies of G-HMM and GG-HMM for the three face recognition exper-iments.

In order to increase the difficulty of the classification problem, for the followingexperiments we only considered 2 and 1 coefficients from each feature vector. Resultsare shown in the second and third rows of Table 6.2, clearly demonstrating that GG-HMM outperforms the standard G-HMM (specially with the most difficult task, i.e.1 coefficient).

There are two reasons that can explain why the standard Gaussian HMM performsslightly better than the proposed GG-HMM in the 8 coefficient experiment (and alsowhy G-HMM’s performance gets closer to that of GG-HMM when using 2 coefficients):

• Wrong estimation of the covariance matrix. As stated previously, when dealingwith multidimensional signals, the Mono-β Multivariate Generalized Gaussian(with Σ = cov (X)) model was used. We have seen that the Maximum Like-lihood estimate of Σ can offer better results and hence, we need to performexperiments using ΣML.

• Wrong model used. It has been shown in Chapter 5 that each Gabor coefficientcan be accurately modeled by an univariate Generalized Gaussian, and thatthe values of the shape parameter do vary depending on the specific coefficient.Moreover, we have seen in Section 6.2.2 that the Poly-β MGG offers a moreaccurate modeling than the Mono-β version. Since the latter distribution wasthe one used, the multidimensional modeling may not be as accurate as desired.Hence, we must repeat the experiments with the Poly-β density.

Chapter 7

Conclusions

During this PhD Thesis we have tackled completely different problems throughoutthe complex face recognition process, suggesting novel ideas and methods to achievea robust biometric system. The main conclusions that can be drawn from this periodof research are the following:

• Based on a subset of eigenvectors (pose eigenvectors) from a 2D linear PDMand using texture mapping, we have proposed methods which try to minimizedifferences in pose while preserving discriminative subject information. Wehave demonstrated that the identified pose eigenvectors are mostly responsi-ble for rigid mesh changes, and do not contain important non-rigid (expres-sion/identity) information that could severely distort the synthesized images.Moreover, we have seen that the use of facial symmetry does improve the per-formance of the recognition system. We have shown that the proposed methodsachieve state-of-the-art results, outperforming the 3D morphable model [Romd-hani et al., 2002] and other previous approaches in the set of rotation anglesranging from −45◦ to 45◦. In addition, we demonstrated that a 2D modelcan deal with rotations up to 67.5◦, obtaining similar performance to the oneachieved using the more complex 3D Morphable Model. After experiments withautomatic fitting via IOF-ASM [Sukno et al., 2007] on the XM2VTS database,we concluded that practically no degradation occurred.

• We have proposed a novel face recognition approach based on extracting localGabor responses from regions that could be inherently subject-dependent byexploiting individual shape structure (using a ridges and valleys representation).The use of shape-context matching [Belongie et al., 2002] to map the sets ofpoints from the two faces to be compared has been proven to be an appropriatechoice for our purposes. We have empirically demonstrated that the distributionof shape-driven points is much more discriminative than the distribution offiducial points as used in [Wiskott et al., 1997]. Experimental results on the

157

158 Chapter 7. Conclusions

XM2VTS database show that our approach performs marginally better than anideal EBGM without the need of localizing “universal” fiducial points. Tests onthe AR face database show that our approach is robust to moderated expressionand illumination variations.

• Following with Gabor-based recognition systems, we have proposed an empir-ical evaluation of different distances for measuring similarities between Gaborresponses, and assessing the impact of the specific normalization method thatis applied to features before comparison. The main conclusions drawn fromthese empirical tests are the following: i) the type of normalization that is ap-plied to the features is critical, therefore influencing the performance of a givendistance, and ii) the classical cosine distance does not always achieve the bestperformance, being outperformed by distances such as Modified Manhattan(when L2 normalization is applied).

• An extensive evaluation of different fusion techniques for combining local Gaborsimilarities has been proposed. In particular, Support Vector Machines (SVMs),Adaboost, neural networks, a variant of the Linear Discriminant Analysis, Se-quential Forward Feature Selection (SFFS), Best Individual Features (BIF) andthe proposed Accuracy-based Feature Selection (AFS) were compared. Eventhough SFFS has proved to provide better results than BIF, in the specificauthentication scenario considered (with few training data available for eachuser), SFFS is not able to correctly estimate the existing interactions betweenfeatures, and therefore does not achieve good generalization on the test set,whilst the less complex BIF performs quite good in such conditions. In thesame line,we would like to emphasize that simple tools such as BIF (and theclosely related AFS), provide comparable results to more complex schemes suchas SVMs and Adaboost.

• It has been shown that marginal distributions of Gabor coefficients (extractedfrom face images) can be accurately modeled using univariate generalized Gaus-sians. Empirical evaluations against the Bessel K Forms density [Srivastavaet al., 2002, Fadili and Boubchir, 2005] have demonstrated the benefits of theGeneralized Gaussian in this specific scenario. Lloyd-Max quantization hasbeen applied for data compression without noticing performance degradation.In addition, multivariate characterization of Gabor coefficients has been alsoconsidered for study. A novel multivariate extension of the Generalized Gaus-sian has been proposed and tested with success in limited experiments. TheGeneralized Gaussian (both uni- and multivariate) has been also employed forHMMs in recent experiments.

• Following [Matthews and Baker, 2003], we implemented a semi-automatic face

7.1. Future Research 159

tracker in MATLAB. Tracked lip coordinates output from this system havebeen used in experiments on the BANCA database for detecting audio andvideo asynchrony. In addition, face tracking and pose correction were tested inlimited experiments on the BIOSECURE DS1 database.

7.1 Future Research

Although most of them have been already outlined at the end of the previous chapters,this section compiles the main lines of future research that have been opened:

• Regarding pose correction, the immediate step is to fully automate the processusing a face alignment technique and test it in appropriate databases. Pose,oclussion, illumination and expression are some of the major challenges thatshould be dealt with in the fitting process. A view-based weighting functionshould be designed, so that depending on the current pose of the face, some re-gions get more importance than others in the computation of the final similarityscore. One of the drawbacks of the proposed methods is that the generation ofvirtual images is time consuming, since it involves the use of texture mapping.One possible way to reduce this computational burden is to learn a degradationfunction that estimates the dependency of the local similarity score degradationand the pose of the face, i.e. the pose parameters. Therefore, instead of com-puting virtual images, the original ones are compared and the similarity scoresare modified based on the difference of the pose parameters from both faces.

• Subject specific face recognition has been addressed by exploiting face structurein the Shape Driven Gabor Jets method. In this regard, the combination ofshape and texture scores should be improved for more reliable authentication.Other ways to enhance the performance of the recognition system is to performfusion not at the node but at the coefficient level. The reasoning behind this isthat a Gabor filter with a given orientation and frequency may be descriptivein some face regions, whilst useless in others. Therefore, similarity fusion at thecoefficient level should help to improve the classification ability of the system.In addition, other ways to obtain inherent client specific representations of theface should be studied. Further research must be also focused on obtainingrepresentations that are more robust to both lighting and expression variations.

• Provide a more extensive evaluation of distance measures for Gabor jet compar-ison (more distances and normalizations, different databases and Gabor-basedmethods). In addition, research in order to find an optimal way (from a theo-retical point of view) to compare jets should be conducted.

160 Chapter 7. Conclusions

• More experiments are needed to assess the ability of the new proposed multivari-ate formulation in modeling joint statistics of Gabor coefficients. If good resultsare obtained, applications that take advantage of the high order dependenciessuch as image compression or denoising should be designed. Moreover, furtherexperiments are needed to assess the usefulness of the Generalized Gaussian inthe Hidden Markov Model framework.

• Face processing in videos is other important research line. Automate the track-ing module is the first step, as well as obtaining robustness against appearancevariability of the tracked object, i.e. the face. Designing a system that fullyexploits the temporal evolution of the face in the video is another milestonethat should be achieved. In addition, further experiments should be conductedto evaluate the applicability of pose correction in video sequences.

Appendices

161

Appendix A

Face Databases

ContentsA.1 AR Face Database . . . . . . . . . . . . . . . . . . . . . . . 163

A.2 BANCA Database . . . . . . . . . . . . . . . . . . . . . . . 164

A.2.1 BANCA Protocols . . . . . . . . . . . . . . . . . . . . . . . 165

A.3 CMU PIE Database . . . . . . . . . . . . . . . . . . . . . . 166

A.4 XM2VTS Database . . . . . . . . . . . . . . . . . . . . . . 167

A.4.1 Lausanne Protocol for the XM2VTS Database . . . . . . . 169

This appendix describes the four databases that have been used for testing duringthis PhD Thesis: the AR face database [Martınez and Benavente, 1998], the BANCAdatabase [Bailly-Bailliere et al., 2003], the CMU PIE database [Sim et al., 2003] andthe XM2VTS database [Messer et al., 1999]. Moreover, the authentication protocolsdevised for both XM2VTS and BANCA databases are also provided.

A.1 AR Face Database

This face database was created by Aleix Martınez and Robert Benavente in the Com-puter Vision Center (CVC) at the Universidad Autonoma de Barcelona. It containsover 4000 color images corresponding to 126 people’s faces (70 men and 56 women).Images feature frontal view faces with different facial expressions (neutral, angry,smiling and screaming), illumination conditions (ambient, right light on, left lighton, both lights on), and occlusions (sun glasses and scarf). The pictures were takenat the CVC under strictly controlled conditions. No restrictions on wear (clothes,glasses, etc.), make-up, hair style, etc. were imposed to participants. Each personparticipated in two sessions, separated by two weeks (14 days) time. The picturestaken in both sessions were acquired under the same conditions (see Figure A.1 for

163

164 Appendix A. Face Databases

an example). This face database is publicly available and can be obtained fromhttp://cobweb.ecn.purdue.edu/~aleix/aleix face DB.html.

a) b) c) f)e)d) g)

n)m)l)k)j)i)h)

Figure A.1: Face images from the AR face database. Top row shows images fromthe first session: a) Neutral, b) Smile, c) Anger, d) Scream, e) Left light on,f) Right light on, and g) Both lights on, while bottom row presents the shotsrecorded during the second session: h)-n).

A.2 BANCA Database

The BANCA database is a large, realistic and challenging multi-modal database in-tended for training and testing both mono- and multi-modal verification systems.The BANCA database was captured in four European languages (English, French,Italian and Spanish) in two modalities (face and voice). The subjects were recorded inthree different scenarios: controlled, degraded and adverse over 12 different sessionsspanning three months. In total 208 people were captured, half men and half women.

To record the database, two different cameras were used; a cheap analogue webcamand a high quality digital camera. For the duration of the recordings the cameraswere left in automatic mode. In parallel, two microphones, a poor quality one anda good quality one were used. The database was recorded onto a PAL DV system.PAL DV is a proprietary format which captures video at a colour sampling resolutionof 4:2:0. The audio was captured in both 16 bit and 12 bit, with sampling frequencyof 32 kHz. The video data are lossy compressed at the fixed ratio of 5:1. The audiodata remain uncompressed. This format also defines a frame accurate timecode whichis stored on the cassette along with the audio-visual data. This video hardware caneasily be interfaced to a computer allowing frame accurate retrieval of the data in thedatabase onto the computer disk.

Each language - and gender - specific population was itself subdivided into 2 groupsof 13 subjects, denoted in the following G1 and G2. Each subject recorded 12 sessions,

A.2. BANCA Database 165

each of these sessions containing 2 recordings: 1 true client access and 1 informed(the actual subject knew the text that the claimed identity subject was supposedto utter) impostor attack. For different sessions the impostor attack informationchanged to another person in their group. The 12 sessions were separated into 3different scenarios:

• controlled, sessions 1-4,

• degraded, sessions 5-8,

• adverse, sessions 9-12.

The webcam was used in the degraded scenario, while the expensive camera wasused in the controlled and adverse scenarios (see Figure A.2 for image examples). Thetwo microphones were used simultaneously in each of the three scenarios with eachoutput being recorded onto a separate track of the DV tape. During each recording,the subject was prompted to say a random 12 digit number, his/her name, theiraddress and date of birth. Each recording took an average of twenty seconds.

Figure A.2: Examples of images from the controlled, degraded and adverse conditionsof the BANCA database.

A.2.1 BANCA Protocols

In [Bailly-Bailliere et al., 2003], the authors propose seven different protocols using theenglish part of the BANCA database. Each experimental configuration defines whichdata are used for training and which are used for testing. The seven experimentalconfigurations considered are:

• Matched Controlled (MC),

• Matched Adverse (MA)

• Matched Degraded (MD)


• Unmatched Controlled (UC)

• Unmatched Adverse (UA)

• Pooled test (P)

• Grand test (G)

In order to avoid any methodological flaw, it is necessary to use two disjoint sets,one for choosing system parameters such as thresholds, and another for assessingsystem performance. In the BANCA nomenclature, [Bailly-Bailliere et al., 2003]refers to development and evaluation sets. The development set comprises the dataon which the system can be adjusted by setting thresholds, etc. On the other hand,the evaluation set contains the data used to assess system performance∗. For thisreason, the two disjoint subsets above referenced, G1 and G2, were created: when G1is used as development set, G2 is used for evaluation and vice versa.

In [Bailly-Bailliere et al., 2003], three specific operating conditions correspondingto three different values of the Cost Ratio, R = FAR/FRR, namely R = 0.1, R = 1,R = 10 have been considered. Assuming equal a priori probabilities of genuine clientsand impostor, these situations correspond to three quite distinct cases:

• R = 0.1, FAR is an order of magnitude less harmful than FRR,

• R = 1, FAR and FRR are equally harmful,

• R = 10, FAR is an order of magnitude more harmful than FRR.

The so-called Weighted Error Rate (WER) given by:

WER (R) =FRR +R · FAR

1 +R(A.1)

is measured for the test data of groups G1 and G2 at the three different values of R.

A.3 CMU PIE Database

Between October and December 2000 a database of 41,368 images from 68 peoplewas collected at the Carnegie Mellon University (CMU). By extending the CMU 3DRoom [Kanade et al., 1998], each person was captured under 13 different poses, 43different illumination conditions, and with 4 different expressions. Consequently, thisdatabase was called CMU Pose, Illumination, and Expression (PIE) database. Each

∗Note that the concept of evaluation set is different in the BANCA and XM2VTS protocols:BANCA’s evaluation set is analogous to XM2VTS’s test set.

A.4. XM2VTS Database 167

subject in the database was asked to sit in a chair with his head against a pole to fixthe head position. 13 Sony DXC 9000 (3 CCD, progressive scan) cameras with allgain and gamma correction turned off were used to capture different viewpoints ofthe subject. More concisely, 9 of the 13 cameras were located at roughly head heightin an arc from approximately full left profile to full right profile. Each neighboringpair of these 9 cameras are approximately 22.5◦ apart. Of the remaining 4 cameras,2 were placed above and below the central camera (c27), and 2 were placed in thecorners of the room, where surveillance cameras are typically located. The 3D Roomwas augmented with 21 Minolta 220X flashes controlled by an Advantech PCL-734digital output board, duplicating the Yale “flash dome” [Georghiades et al., 2001].The xyz-locations of the head position, the 13 cameras, and the 21 flashes weremeasured with a Leica theodolite and included in the meta-data. Figure A.3 showsthe CMU 3D Room as well as a diagram of the locations of the cameras, flashes andthe head of the subject.

In this PhD Thesis, we used the pose subset of the CMU PIE database for testingthe robustness of systems against viewpoint variations. Figure A.4 shows the imagescaptured from the 13 cameras under ambient lighting for subject 04006.

Figure A.3: Left: Setup of the CMU 3D Room [Kanade et al., 1998]. Right:Diagram of the locations of the cameras, flashes and the head of the subject

A.4 XM2VTS Database

Collected at the Centre for Vision, Speech and Signal Processing of the University ofSurrey, the XM2VTS database contains synchronized image and speech data recordedon 295 subjects (randomly divided into 200 clients, 25 evaluation impostors, and 70


Figure A.4: Images taken from all cameras of the CMU PIE database for subject04006. The 9 cameras in the horizontal sweep are each separated by about 22.5◦ [Simet al., 2003]

test impostors) during four sessions taken at one month intervals. On each visit(session) two recordings (shots) were captured. The first shot consisted of speechwhilst the second consisted of rotating head movements. The entire database wasacquired using a Sony VX1000E digital cam-corder and DHR1000UX digital VCR.

In the speech shot, the subject, with a clip-on microphone attached, was asked toread three sentences which were written on a board positioned just below the camera.The subjects were asked to read at their normal pace, to pause briefly at the end ofeach sentence and to read through the three sentences twice. The three sentencesremained the same throughout all four recording sessions and were:

• “zero one two three four five six seven eight nine”

• “five zero six nine two eight one three seven four”

• “Joe took fathers green shoe bench out”

All the sentences from the database have been grabbed into seperate audio files,a total of 7080 files. The audio was stored in mono, 16 bit, 32 KHz, PCM wave files.

The second shot consisted of a sequence of rotating head movements. The subjectwas asked to rotate his/her head from the centre to the left, to the right, then up,then down, finally returning it to the centre. They were told that a full side-profilewas required and asked to run through the entire sequence twice. The images werestored in colour PPM format and at a resolution of 720x576. Two frontal face imageswere extracted from each rotating sequence. The 4 × 2 frontal face images that wererecorded for each subject were used to devise the Lausanne protocol [Luttin andMaıtre, 1998] (see Figure A.5 for examples of frontal face images).

A.4. XM2VTS Database 169

Figure A.5: Frontal face images from the XM2VTS database

A.4.1 Lausanne Protocol for the XM2VTS Database

The Lausanne protocol was designed to measure system performance in an authenti-cation scenario. According to this protocol, the database was randomly divided intothree sets: a training set, an evaluation set, and a test set. The training set wasused to build client models, while the evaluation set was used to estimate thresholds.Finally, the test set was employed to assess system performance. Configurations Iand II of the Lausanne protocol differ in the distribution of client training and clientevaluation data, representing configuration II the most realistic case. In configurationI there are:

• 3 training images per client (the first image of sessions 1,2 and 3).

• 3 evaluation images per client (the second image of sessions 1,2 and 3).

On the other hand, configuration II is characterized by:

• 4 training images per client (the two images from sessions 1 and 2).

• 2 evaluation images per client (the two images of session 3).

Both configurations share the same data for client testing (the two images corre-sponding to the fourth session) and impostor evaluation and testing (the eight imagesof the corresponding partitions).


As stated in 1.4.1, two commonly used error measures of a verification systemare the False Acceptance Rate (FAR- defined as the number of impostors that areincorrectly accepted by the system divided by the number of impostor attempts) andthe False Rejection Rate (FRR-given by the number of clients that are incorrectlyrejected by the system divided by the number of client trials). Clearly, there is atrade-off between both error rates: we can reduce either FAR or FRR by modifying adecision threshold at the risk of increasing the other one. Hence, the performance ofa system should be given by the FAR and FRR (measured on the separate test set)at a specific pre-defined threshold (chosen using the evaluation data). In practice, itis very common to choose the threshold (Equal Error Rate (EER) threshold) so thatFAR equals FRR in the evaluation set. Afterwards, this a priori EER threshold isused to measure FAR and FRR on the separate test set.

Appendix B

Statistical Significance of TERMeasures

[Bengio and Mariethoz, 2004] adapt statistical tests to compute confidence intervalsaround Half Total Error Rates (HTER = TER/2) measures, and to assess whetherthere exist statistically significant differences between two approaches or not. Givenmethods A and B with respective performances HTERA and HTERB, we computea confidence interval (CI) around ∆HTER = HTERA−HTERB. Clearly, if the rangeof obtained values is (approximately) symmetric around 0, we can not say the twomethods are different. The confidence interval is given by ∆HTER ± σ · Zα/2, where

σ =

√FARA(1−FARA)+FARB(1−FARB)

4·NI+

FRRA(1−FRRA)+FRRB(1−FRRB)4·NC

(B.1)

and

Zα/2 =

1.645 for a 90% CI1.960 for a 95% CI2.576 for a 99% CI

(B.2)

In Equation (B.1), NC stands for the number of client accesses, while NI standsfor the number of imposter trials.

171

Appendix C

Active Shape Models withInvariant Optimal Features(IOF-ASM)

Active Shape Models with Invariant Optimal Features (IOF-ASM) [Sukno et al.,2007] is a statistical modelling method specifically designed and tested to handle thecomplexities of facial images. The algorithm learns the shape statistics as in theoriginal ASMs [Cootes et al., 1995] but improves the local texture description byusing a set of differential invariants combined with non-linear classifiers. As a result,IOF-ASM produces a more accurate segmentation of the facial features [Sukno et al.,2007].

The matching procedure is summarized in Algorithm 1. In line 1 the image ispreprocessed to obtain a set of differential invariants. These invariants are the core ofthe method and they consist on combinations of partial derivatives that result invari-ant to rigid transformations [Walker et al., 1997, Schmid and Mohr, 1997]. Moreover,IOF-ASM uses a minimal set of order K so that any other algebraic invariant upto order K can be reduced to a linear combination of elements of this minimal set[Florack, 1993].

The other key point of the algorithm is between lines 6 and 14. For each landmark,an image-driven search is performed to determine the best position for it to be placed.The process starts by sampling the invariants in a neighborhood of the landmark(line 7). In IOF-ASM this neighborhood is represented by a rectangular grid, whosedimensions are parameters of the model. A non-linear texture classifier analyzes thesampled data to determine if the local structure of the image is compatible with theone learnt during training for this landmark. A predefined number of displacementsare allowed for the position of the landmark (perpendicularly to the boundary, asin [Cootes et al., 1995]), so that the texture classifier analyzes several candidatepositions. Once the best candidate is found, say (xB, yB), the matching between its

173

174Appendix C. Active Shape Models with Invariant Optimal Features

(IOF-ASM)

Algorithm 1 IOF-ASM matching to a new image

1: Compute invariants for the whole image2: T = Initial transformation guess for face position and size3: X = X (modelShape = meanShape)4: for i = 1 to number of iterations do5: Project shape to image coordinates: Y = TX6: for l = 1 to number of landmarks do7: Sample invariants around l-th landmark8: Determine best candidate point to place the landmark9: if the best candidate is good enough then

10: Move the landmark to the best candidate point11: else12: Keep previous landmark position (do not move)13: end if14: end for15: Let the shape with new positions be Y16: Update T and PDM parameters: b = PT (T−1Y − X)

17: Apply PDM constraints: b = PdmConstrain(b, β)18: Get new model shape: X = X + Pb19: end for

local image structure and the one learnt during training is verified (line 9) by meansof a robust metric [Huber, 1981]. The applied metric consists on the evaluation ofthe sampled data grouped according to its distance perpendicularly to the shapeboundary. Grouping this way, the samples can be organized in a one-dimensionalprofile of length lP . Based on the output from the texture classifier, each positionon this profile will result as a supporting point or an outlier (the supporting pointsare those profile points suggesting that (xB, yB) is the best position for the landmarkto be placed, while outliers indicate a different position and, therefore, suggest that(xB, yB) is incorrect). If the supporting points are (at least) two thirds of lP , thenthe matching is considered accurate and the landmark is moved to the new position.Otherwise the matching is not trustworthy (i.e. the image structure does not clearlysuggests a landmark) and the landmark position is kept unchanged (see [Sukno et al.,2007] for furhter details).

The constraints of line 17 ensure that the obtained shape is plausible accordingto the learnt statistics (i.e. it looks like a face). For this purpose, each componentof b is limited so that |bk| ≤ β

√λk, (1 ≤ k ≤ t); where t is the number of modes

of variation of the PDM, λk is the eigenvalue associated to the k-th mode and β isa constant, usually set between 1 and 3, that controls the degree of flexibility of thePDM (see [Cootes et al., 1995]).

Appendix D

Estimation of UnivariateGeneralized Gaussian Parameters

This appendix derives the formulas for the estimation of the univariate GeneralizedGaussian parameters β and σ using two different methods: Maximum Likelihood(ML) approach and a moments-based procedure.

The zero mean pdf of an univariate Generalized Gaussian is given by (Eq. 5.1):

Pβ,σ =1

Z (β) σA (β)exp

(−∣∣∣∣

x

σA (β)

∣∣∣∣β)

(D.1)

with

Z (β) =2

βΓ

(1

β

)(D.2)

A (β) =

√Γ (1/β)

Γ (3/β)(D.3)

D.1 Maximum Likelihood Parameter Estimation

The log-likelihood function of a given set of n points{xk}, k = 1, 2, . . . , n is given

by:

LL =n∑

k=1

logPβ,σ

(xk)

= −n log(σ)

−n [logZ (β) + logA (β)] −n∑

k=1

∣∣∣∣xk

σA (β)

∣∣∣∣β

(D.4)

175

176Appendix D. Estimation of Univariate Generalized Gaussian Parameters

In order to obtain ML estimates of β and σ, the derivatives of (D.4) w.r.t. bothparameters must be set to zero. First, ∂LL

∂βis given by:

∂LL

∂β= − n

[Z ′ (β)

Z (β)+A′ (β)

A (β)

]−

n∑

k=1

∣∣∣∣xk

σA (β)

∣∣∣∣β

×[log

∣∣∣∣xk

σA (β)

∣∣∣∣− βA′ (β)

A (β)

]= 0 (D.5)

where

Z ′ (β)

Z (β)= − 1

β

[1 +

1

βΨ

(1

β

)](D.6)

A′ (β)

A (β)=

1

2β2

[3Ψ

(3

β

)− Ψ

(1

β

)](D.7)

Moreover, ∂LL∂σ

is as follows:

∂LL

∂σ= −n

σ+

βσ−β

σA (β)β

n∑

k=1

∣∣xk∣∣β = 0 ⇒ σ =

1

A (β)

(β

n

n∑

k=1

∣∣xk∣∣β)1/β

(D.8)

To obtain the ML estimate of β, Eq. (D.8) is substituted in (D.5), and β is solvednumerically. This derivation is analogous to the one obtained in [Do and Vetterli,2002].

D.2 Moments-based Parameter Estimation

For a GG, it can be shown that the ratio of the mean absolute value to the standarddeviation is a steadily increasing function FM of the shape parameter β [Do andVetterli, 2002, Sharifi and Leon-Garcia, 1995]:

FM(β) =Γ(2/β)√

Γ(1/β)Γ(3/β)(D.9)

Hence, if we let m1 = (1/n)∑n

k=1

∣∣xk∣∣ and m2 = (1/n)

∑nk=1

(xk)2

be the es-timates of the mean absolute value and the variance respectively, then β can beestimated via:

β = FM−1

(m1√m2

)(D.10)

D.2. Moments-based Parameter Estimation 177

In practice, β in Equation D.10 can be easily obtained using a look-up table,whose entrie whose entries are the corresponding values of m1/

√m2 and β.

Appendix E

Maximum Likelihood ParameterEstimation for (Mono-β)Multivariate GeneralizedGaussians

The D−dimensional pdf of a Multivariate Generalized Gaussian is given by [Boubchirand Fadili, 2005]:

Pµ,β,Σ (x) =[det (Σ)]−1/2

[Z (β)A (β)]D×

exp

−

∥∥∥∥∥Σ−1/2 (x − µ)

A (β)

∥∥∥∥∥β

(E.1)

From (E.1), it follows that the log-likelihood function of a given set of n D-dimensional points

{xk}, k = 1, 2, . . . , n is given by

LL =

n∑

k=1

logPβ,Σ

(xk)

= n log [det (Σ)]−1/2

−nD [logZ (β) + logA (β)] −n∑

k=1

∥∥∥∥∥Σ−1/2xk

A (β)

∥∥∥∥∥β

(E.2)

Estimates of β and Σ can be computed by setting the partial derivatives of the log-likelihood function (E.2) to zero and solving for the parameters in the set of obtainedequations.

179

180Appendix E. (Mono-β) Multivariate Generalized Gaussian Parameter

Estimation

E.1 Estimation of β

It can be shown that the derivative of LL w.r.t. to β is given by:

∂LL

∂β= − nD

[Z ′ (β)

Z (β)+A′ (β)

A (β)

]−

n∑

k=1

D∑

d=1

∣∣∣∣zk

d

A (β)

∣∣∣∣β

×[log

∣∣∣∣zk

d

A (β)

∣∣∣∣− βA′ (β)

A (β)

](E.3)

where zkd is the d-th component of the vector zk ∈ R

D, defined by

zk = Σ−1/2xk (E.4)

Moreover,

Z ′ (β)

Z (β)= − 1

β

[1 +

1

βΨ

(1

β

)](E.5)

A′ (β)

A (β)=

1

2β2

[3Ψ

(3

β

)− Ψ

(1

β

)](E.6)

where Ψ(·) stands for the Digamma function [Abramowitz and Stegun, 1970], i.e.Ψ(x) = Γ′(x)/Γ(x). It can be shown that the derivative in (E.3) is equivalent tothat obtained in [Boubchir and Fadili, 2005]. However, this is not the case of thecovariance matrix, as we will see next.

E.2 Estimation of Σ

In [Boubchir and Fadili, 2005] the authors estimated Σ as the covariance of the data∗

independently of the value of β (let us denote it by Σ0). In this section we derive theestimation of Σ in a ML approach (ΣML), showing that the values of the elements ofthe covariance matrix do depend on β. In Section 6.2.1 we empirically assess that forvalues of β around 1 and greater, ΣML ≈ Σ0. However both estimates differ wheninput data are well modeled by low values of the shape parameter, which is usuallythe case of coefficients produced by several types of transforms (see [Hernandez et al.,2000, Srivastava et al., 2002]).

Let S = Σ−1/2 first. Taking advantage of the chain rule, we can write

∂LL

∂Σ=∂LL

∂S

∂S

∂Σ(E.7)

∗This holds for multivariate Gaussians only

E.2. Estimation of Σ 181

Thus, we can make ∂LL

∂Σ = 0 by setting ∂LL

∂Sto 0, and solving for S. Afterwards, the

estimate of Σ is simply given by Σ =(SS)−1

. Since [det (Σ)]−1/2 = det(Σ−1/2

)=

det (S), it follows from Eq. (E.2) that

∂LL

∂S= n

∂ log det (S)

∂S− 1

A (β)β

n∑

k=1

∂∥∥Sxk

∥∥β

∂S=

n[2S−1 − S−1 • I

]− 1

A (β)β

n∑

k=1

∂∥∥Sxk

∥∥β

∂S(E.8)

where • stands for the Hadamard or elementwise product and I is the D × Didentity matrix. What remains is to compute

Φk =∂∥∥Sxk

∥∥β

∂S∈ MD×D (E.9)

From Eq. (E.4), zk = Sxk. Hence zki =

∑Dj=1 Sijx

kj and

∥∥Sxk∥∥

β=

D∑

i=1

∣∣zki

∣∣β =D∑

i=1

∣∣∣∣∣

D∑

j=1

Sijxkj

∣∣∣∣∣

β

(E.10)

The (m,n) element of Φk is given by the derivative of (E.10) w.r.t. to Smn.Taking into account that S is a symmetric matrix, we should distinguish two cases:a) the derivative w.r.t. a element of the diagonal of S, and b) the derivative w.r.t.the (m,n) element of S, with m 6= n:

1. Φkmm: The derivative w.r.t. the m−th element in the diagonal of S yields

Φkmm =

∂∥∥Sxk

∥∥β

∂Smm=

βxkm

∣∣∣∑D

i=1 Smixki

∣∣∣β

∑Di=1 Smix

ki

=(E.4) βxkm

∣∣zkm

∣∣β

zkm

(E.11)

2. Φkmn: To compute the derivative w.r.t. the (m,n), m 6= n element of S, we

must take into account that Smn = Snm, and thus from Eq. (E.10):

182Appendix E. (Mono-β) Multivariate Generalized Gaussian Parameter

Estimation

Φkmn =

∂∥∥Sxk

∥∥β

∂Smn=

βxkn

∣∣∣∑D

i=1 Smixki

∣∣∣β

∑Di=1 Smix

ki

+ βxkm

∣∣∣∑D

i=1 Snixki

∣∣∣β

∑Di=1 Snix

ki

=(E.4)

β

[xk

n

∣∣zkm

∣∣β

zkm

+ xkm

∣∣zkn

∣∣β

zkn

](E.12)

Finally we obtain D2+D2

different non-linear equations containing β and the coef-

ficients of S (D2+D2

different elements). Adding Eq. E.3, a set of D2+D2

+1 non-linearequations with the same number of unknowns are available, which can be solved nu-merically (We used MATLAB’s fsolve function). For D = 1, it can be shown thatthe formulas for the estimation of the GG parameters are equivalent to those obtainedin Appendix D.

Appendix F

Maximum Likelihood ParameterEstimation for Poly-β MultivariateGeneralized Gaussians

The pdf of the Poly-β Multivariate Generalized Gaussian (Pβ-MGG) introduced inSection 6.2.2 is given by:

Pβ,Σ (x) =[det (Σ)]−1/2

∏Dd=1 [Z (βd)A (βd)]

×

exp

(−∥∥∥Σ−1/2 (x ÷A (β))

∥∥∥β

)(F.1)

where β = [β1, . . . , βD]T and ÷ stands for elementwise division, i.e.

y = x ÷ A (β) =

[x1

A (β1), . . . ,

xD

A (βD)

]T

(F.2)

Moreover, ‖x‖β =∑D

d=1 |xd|βd. It is easy to see that if βd = β for all dimensions,

then Eq. (F.1) is equivalent to the Mono-β distribution (Equations (6.1), (E.1)).From Eq. (F.1) it follows that the log-likelihood of a set of n points

{xk}∈ R

D

is given by

LL =

n∑

k=1

logPβ,Σ(xk)

= n log [det (Σ)]−1/2

− nD∑

d=1

[logZ (βd) + logA (βd)] −n∑

k=1

[∥∥∥Σ−1/2(xk ÷A (β)

)∥∥∥β

](F.3)

183

184Appendix F. Poly-β Multivariate Generalized Gaussian Parameter

Estimation

Estimates of β and Σ can be computed by setting the partial derivatives of thelog-likelihood function to zero and solving for the parameters in the set of obtainedequations.

F.1 Partial derivatives ∂LL∂βd

Bearing in mind that S = Σ−1/2, and calling

wk = S(xk ÷ A(β)=(F.2)Syk (F.4)

it can be shown that ∂LL∂βd

is given by

∂LL

∂βd= −n

[Z ′ (βd)

Z (βd)+A′ (βd)

A (βd)

]−

n∑

k=1

∣∣wkd

∣∣βd

(−βdSddy

kd

wkd

A′ (βd)

A (βd)+ log

∣∣wkd

∣∣)

+

n∑

k=1

D∑

i=1,i6=d

[βiSidy

kd

∣∣wki

∣∣βi

wki

A′(βd)

A(βd)

](F.5)

F.2 Partial derivative ∂LL

∂Σ

As done in Section E.2, instead of setting ∂LL

∂Σ to zero, we will solve ∂LL

∂S= 0. Let

y ∈ RD be defined as in Eq. (F.2). Hence

∂LL

∂S= n

∂ log det (S)

∂S−

n∑

k=1

∂∥∥Syk

∥∥β

∂S=

n[2S−1 − S−1 • I

]−

n∑

k=1

∂∥∥Syk

∥∥β

∂S(F.6)

Following a similar reasoning to that adopted in Section E.2, we have that the

elements of Φk =∂‖Syk‖β

∂S∈ MD×D are given by:

1. Φkmm:

F.2. Partial derivative ∂LL

∂Σ 185

Φkmm =

∂∥∥Syk

∥∥β

∂Smm

=βmykm

∣∣∣∑D

i=1 Smiyki

∣∣∣βm

∑Di=1 Smiy

ki

=

βmykm

∣∣wkm

∣∣βm

wkm

(F.7)

2. Φkmn, m 6= n:

Φkmn =

∂∥∥Syk

∥∥β

∂Smn

=

βmykn

∣∣∣∑D

i=1 Smiyki

∣∣∣βm

∑Di=1 Smiy

ki

+ βnykm

∣∣∣∑D

i=1 Sniyki

∣∣∣βn

∑Di=1 Sniy

ki

=

βmykn

∣∣wkm

∣∣βm

wkm

+ βnykm

∣∣wkn

∣∣βn

wkn

(F.8)

Finally we obtain D2+D2

different non-linear equations (Eqs. (F.7) and (F.8))

containing β and the coefficients of S (D2+D2

different elements). Adding Equation

(F.5) for d = 1, . . . , D, a set of D2+D2

+D non-linear equations with the same numberof unknowns are available, which can be solved numerically.

Appendix G

Training Algorithm for GeneralizedGaussian Hidden Markov Models

The training algorithm for standard Gaussian HMMs is known as Baum-Welch re-estimation procedure [Rabiner, 1989], and it is based on the well known ExpectationMaximization (E-M) algorithm. Here we extend the procedure in order to deal withGeneralized Gaussian HMMs, maintaining as much as possible the notation used in[Rabiner, 1989].

Given a set of sequences {O1,O2, ...,ON}, On = o1, o2, ..., oTnand ot ∈ R

D, thegoal is to estimate the best model λ. In order to simplify the notation, we will providethe re-estimation formulas for N = 1 (just one sequence); the generalization to N > 1is straightforward.

Starting from an initial model λ(0), the E-M algorithm iteratively repeats twosteps: in the first (E-step), the so-called Q function (the expected value of the com-plete log-likelihood given the current parameter estimates) is evaluated; afterwards,in the M-step, such expectation is maximized in order to find the new values of theparameters. For the HMM, the Q function can be splitted in three independent terms,one containing π, other related to A and the third one containing B [Cappe et al.,2005]. Now, the maximization can be carried out by optimizing each term individu-ally. The re-estimation formulas for A and π do not change with respect to standardHMMs [Rabiner, 1989], and therefore we will just provide the re-estimation of B.

In particular, at each iteration ℓ, the following operations are performed:

E-step: among all the values computed in this step (see [Rabiner, 1989]), for thecalculation of B we are interested in γt(i), which is defined as the probability of beingin state Si at time t, given the sequence O and the model estimated in the previousiteration λ(ℓ−1).

M-step: in this step the new parameters should be estimated by maximizing the

187

188 Appendix G. GG-HMM Training Algorithm

following function f :

f =

K∑

i=1

T∑

t=1

γt(i) log(bi(ot)) (G.1)

In our case bi(ot) is the Mono-β Generalized Gaussian defined in Equation (6.1),and the parameters to be estimated are βi,µi,Σi for each state Si. Since it has beendemonstrated in Section 6.2.2 that the Poly-β model outperforms the Mono-β version,the reader may ask himself (herself) the following question: why is the latter modelbeing applied in the HMM framework? The answer to this question is chronological:we needed a multivariate model to extend the unidimensional Generalized Gaussianand after reviewing the literature, chose the AMGG proposed in [Boubchir and Fadili,2005]. Afterwards, we realized that the estimation of Σ could be probably improvedand, hence derived the ML estimate of the covariance matrix (Appendix E). Finally,and based on the results obtained in Chapter 5, we proposed the Poly-β MGG (Section6.2.2). However, up to now, the only multivariate statistical model that has been usedfor Hidden Markov Models is the Mono-β with Σ = cov (X). In order to solve for theabove parameters, we compute the derivative of f w.r.t. each of them, and set theobtained formulas to zero. In the following, we introduce the expressions that leadto the estimation of the three parameters:

• βi – It can be shown that the derivative of f w.r.t to βi is given by:

∂f

∂βi

= 0 =T∑

t=1

γt(i)

[D∑

j=1

(∣∣∣∣ytj

A(βi)

∣∣∣∣βi

log

∣∣∣∣ytj

A(βi)

∣∣∣∣

)

+1

2βi

D∑

j=1

(∣∣∣∣ytj

A(βi)

∣∣∣∣βi(

Ψ

(1

βi

)− 3Ψ

(3

βi

)))]

−T∑

t=1

[Dγt(i)

βi+

3γt(i)

2β2i

(Ψ

(1

βi

)− Ψ

(3

βi

))](G.2)

where ytj is the j-th component of the vector yt, defined by yt = Σ(−1/2)i (ot−µi).

Ψ(·) is the Digamma function [Abramowitz and Stegun, 1970], i.e. Ψ(x) =

Γ′(x)/Γ(x). The new value of β(ℓ)i is obtained by solving the non-linear equation

(G.2), which can be done by means of numerical routines∗.

• µi – The mean µi is a D-dimensional vector, and the expression for the calcu-lation of the h-th component µih is given by:

∗In particular, through the whole experimental session, we used the fzero function of MATLAB.

189

∂f

∂µih

= 0 =T∑

t=1

γt(i)

(D∑

j=1

η(otj ,µih)×

×

∣∣∣∣∣∣∣

(Σ

(−1/2)i

)

jh(oth − µih)

A(βi)

∣∣∣∣∣∣∣

βi−1 (G.3)

where (M)jh is the (jh) entry of the matrix M , and η(·, ·) is defined as follows:

η(a, b) =

{1 if a− b ≥ 0−1 otherwise

(G.4)

Also in this case, the new parameter µ(ℓ)i is obtained by using numerical routines.

• Σi – For the estimation of Σi the following expression (derived from consideringΣ as the covariance of the data [Boubchir and Fadili, 2005], see explanationabove) was used:

Σ(ℓ)i =

T∑

t=1

γt(i)(ot − µi)(ot − µi)′ (G.5)

Appendix H

Resumen en Castellano

En este apendice realiza una breve descripcion de las aportaciones llevadas a cabo enel campo del reconocimiento de caras. Mas concretamente, la Seccion H.1 presenta losalgoritmos disenados para hacer frente al problema de los cambios de pose, mientrasque la extraccion de caracterısticas discriminativas utilizando filtros de Gabor seexpone en la Seccion H.2. Esta seccion tambien recoge los experimentos realizadossobre fusion intramodal de similitudes locales y una evaluacion de distintas medidasde distancia para comparacion de caracterısticas Gabor. El modelado estadıstico delos coeficientes de Gabor se muestra en la Seccion H.3. Las principales conclusionesy lıneas de investigacion futuras se describen en la Seccion H.4.

H.1 Reconocimiento de Caras 2-D Robusto a Cam-

bios de Pose

En esta seccion se aborda uno de los mayores problemas en el marco del reconocimientode caras: hacer frente a cambios de pose. Es bien sabido que los algoritmos de re-conocimiento facial son sensibles a los cambios de apariencia debido a modificacionesen la pose del sujeto y, por lo tanto, se ha convertido en un objetivo principal eldisenar algoritmos que sean capaces de enfrentarse con esta clase de variaciones.

Hasta el momento, los algoritmos de mayor exito son aquellos que han aprovechadolas caracterısticas de la clase de caras. [Pentland et al., 1994] extendio el metodode eigenfaces propuesto por [Turk and Pentland, 1991] a un metodo en el cual uneigenspace individual es construido para cada pose. En [Beymer and Poggio, 1995],los autores amplıan la tentativa anterior presentada en [Beymer, 1994]: a partir deuna imagen de un sujeto y utilizando informacion previa, se generan vistas virtualesen diversas poses, que posteriormente son pasadas al reconocedor. En [Maurer andMalsburg, 1996], los autores proponen un algoritmo invariante a pose basado enElastic Bunch Graph Matching [Wiskott et al., 1997], en el que la transformacion

191

192 Appendix H. Resumen en Castellano

se realiza a los vectores de caracterısticas extraidos. En [Blanz and Vetter, 1999]los autores proponen el uso de 3D Morphable Models. Dada una imagen de test, elmodelo tridimensional se ajusta recuperando parametros de apariencia y forma, queson utilizados para reconocimiento. Estos modelos se han usado en reconocimiento decaras invariante a pose [Romdhani et al., 2002, Blanz and Vetter, 2003, Blanz et al.,2005] con gran exito.

En esta Tesis proponemos dos sistemas de reconocimiento robustos a cambios depose. Ambos estan basados en un modelo lineal bidimensional (Point DistributionModel [Cootes et al., 1995]). A partir de imagenes de caras en las que se han mar-cado manualmente un conjunto de puntos (ojos, boca, nariz, etc. ver Figura H.1),y usando analisis de componentes principales, obtenemos los autovectores que sonresponsables de los cambios de pose (Figuras H.2 y H.3). Posteriormente, dada unacara con su conjnto de puntos ajustado, el vector de coordenadas se proyecta sobre elespacio aprendido por PCA y los coeficientes que pesan los autovectores de pose sonmodificados de tal modo que la malla reconstruida adopta una pose adecuada paranuestros intereses. Posteriormente, utilizando mapeado de textura desde la imagenoriginal (a traves de Thin Plate Splines [Bookstein, 1989]), logramos sintetizar unacara virtual que es pasada al modulo de reconocimiento. Los dos metodos que pro-ponemos difieren en que el primero sintetiza caras frontales independientemente de lapose inicial, mientras que el segundo emplea el conocimiento de la pose de las carasa comparar para generar una cara sintetica que emula la pose de la otra. Ambosmetodos aprovechan la simetrıa facial para hacer frente a oclusiones debidas a lasrotaciones de la propia cabeza.

En esta seccion se muestran tanto ejemplos de imagenes virtuales generadas connuestros metodos como resultados sobre una base de datos rica en rotaciones horizon-tales: la CMU-PIE database [Sim et al., 2003]. A continuacion se describen breve-mente las nociones necesarias relativas a Point Distribution Models y se introduce elconcepto de autovectores de pose.

H.1.1 Point Distribution Model: Autovectores de pose

Un Point Distribution Model (PDM) se obtiene a partir de un conjunto de imagenesde entrenamiento. En cada una de estas imagenes, Ii, se anotan manualmente Npuntos y sus coordenadas normalizadas se almacenan formando un vector

X i = (x1i, x2i, . . . , xNi, y1i, y2i, . . . , yNi)T =

(xi yi

)T(H.1)

Una vez que se han anotado los N puntos en el total de imagenes de entre-namiento, se hace uso de Analisis de Componentes Principales para encontrar losmodos principales de variacion de forma. De esta forma, se obtiene una matrizP = [φ1|φ2| . . . |φt| . . .] cuyas columnas son los autovectores unitarios que de-finen el cambio de base. Cualquier malla de entrenamiento X i puede reconstruirse

H.1. Reconocimiento de Caras 2-D Robusto a Cambios de Pose 193

aproximadamente usando los t primeros autovectores:

X i = X + Pb, (H.2)

donde X es la malla media y b es el vector de parametros especıfico que define lamalla X i. Una vez que hemos encontrado los autovectores principales, estamos in-

1

2

3

4

5

6

78

9

10

11

12

13

14

15

161718

1920 2122

23 24252627

2829

3031 3233

34 3536

37

3839 40 41

42

43

44

4546

47484950

515253545556575859

60 61 62

Figure H.1: Posicion de los 62 nodos usados en esta Tesis

teresados en aquellos que contienen informacion de la rotacion de la malla (pose dela malla). Con el objetivo de identificarlos, utilizamos el siguiente proceso: el vectorde parametros se fija a 0 excepto uno de los coeficientes cuyo valor se varıa dentrode un rango, y se reconstruye la malla utilizando (2.2). De esta forma, evaluamosvisualmente si un determinado autovector contiene informacion de pose. Con nuestroconjunto de entrenamiento, determinamos que el primer autovector φ1 contiene in-formacion de rotacion arriba-abajo (Figura H.2), mientras que el segundo autovectorφ2 es responsable de la rotacion izquierda-derecha (Figura H.3).

Figure H.2: Efecto de cambiar el valor de b1 en las mallas reconstruidas. φ1 controlala rotacion arriba-abajo


Figure H.3: Efecto de cambiar el valor de b2 en las mallas reconstruidas. φ2 controlala rotacion izquierda-derecha

Se demostro que los autovectores de pose identificados son responsables unicamentede variaciones rıgidas de la cara y no contienen informacion de expresion y/o identi-dad del sujeto. De esta forma, el variar los valores de b1 y b2 unicamente influye enla pose de la malla sin afectar a la expresion codificada en la misma.

H.1.2 Generacion de imagenes sinteticas

Dada una imagen I con su malla correspondiente X, podemos calcular el vector deparametros que define dicha malla y cambiar los valores de los coeficientes que pesanlos autovectores de pose (i.e. cambiar los valores de b1 y b2) para de esta formaobtener una malla virtual X ′ con una pose definida por los nuevos valores que leshayamos asignado. A partir de I, X y X ′ podemos generar una imagen virtual I ′ conla pose de X ′. Para ello utilizamos mapeo de textura (Thin Plate Splines [Bookstein,1989]) desde I empleando las correspondencias existentes entre X y X ′.

La Figura H.4 muestra ejemplos de caras sinteticas. Para cada uno de los indi-viduos, unicamente la cara frontal (y su malla asociada) se utilizan como datos deentrada. El vector de parametros que define esta malla es calculada y el valor de b2 sevarıa dentro de un rango. Para cada valor nuevo de b2, se genera la malla asociada yla cara original se “mapea” en la malla virtual generando las caras sinteticas presentesen la figura. El mismo proceso se repite con el parametro b1 (imagenes de ejemplo enla Figura H.5).

En todo nuestro procedimiento, es de vital importancia elegir un conjunto deentrenamiento adecuado para de esta forma obtener “buenos” autovectores de pose.En caso contrario, el efecto de variar alguno de los parametros que hemos asociado acambios rıgidos puede tener un efecto no deseado en la cara generada (Figura H.6).


Figure H.4: Ejemplos de imagenes sinteticas en rotacion de azimuth.


H.1.3 Correccion de pose y Reconocimiento Robusto de Caras

Dado que disponemos un modo de generar caras virtuales con poses diferentes ala de la imagen original, podemos utilizar esta facilidad para obtener metodos dereconocimiento robusto a cambios de pose. Los dos metodos que proponemos di-fieren en que el primero sintetiza caras frontales independientemente de la pose inicial(NFPW , Figura H.7), mientras que el segundo emplea el conocimiento de la posede las caras a comparar para generar una cara sintetica que emula la pose de la otra(PTW , Figura H.8). Con el objeto de hacer frente a oclusiones debidas a la propiacabeza, hemos aprovechado la simetrıa facial para hacer mas robusto cada uno deestos metodos. Las figuras H.9 y H.10 muestran esquemas en los que se explica laaplicacion de la simetrıa a los metodos NFPW y PTW .

En general, PTW obtiene mejores resultados de reconocimiento que NFPW , acosta de una mayor carga computacional. Sobre la base de datos CMU PIE [Simet al., 2003] y con rotaciones de hasta 45◦, las tasas medias de reconocimiento quese lograron fueron: 87.5% para NFPW y 91.5% para PTW . Estos resultados soncomparables a los obtenidos por un metodo tridimensional basado en MorphableModels [Romdhani et al., 2002] (88.5%). Otros metodos propuestos en la literaturatales como Eigen-light Fields [Gross et al., 2004] tambien cosechan peores resultadosque los sistemas desarrollados en esta Tesis en el rango de rotaciones arriba senalado.

Para hacer frente a rotaciones superiores (de hasta 67.5◦), se desarrollo una vari-ante de PTW : dado que para tales angulos, algunos nodos de la malla quedan ocul-tos, se decidio emplear el conjunto restringido de nodos visibles para la generacionde imagenes virtuales. Algunos ejemplos de caras sinteticas se muestran en la FiguraH.11. Con el fin de medir las prestaciones del sistema en condiciones de rotacionextrema (±67.5◦), se realizaron experimentos en la CMU PIE database, obteniendoseuna tasa media de reconocimiento de 77.5%. [Romdhani et al., 2002] consigue resul-tados solo algo mejores (79.5%), lo que indica que un sistema 2-D es capaz de hacerfrente a amplias rotaciones con prestaciones similares a las que ofrece un modelotridimensional.


Figure H.5: Ejemplos de imagenes sinteticas en rotacion arriba-abajo


Figure H.6: Arriba: La identidad no se modifica cuando se escoge un buen conjuntode entrenamiento. Abajo: La identidad se distorsiona cuando no se escoge un buenconjunto de entrenamiento.

Normalization

Pose

Normalization

Pose

TPS Warping

Comparison

Final

TPS Warping

Input image

Face Alignment

Face Alignment Training image

ONLINE

OFFLINE

Figure H.7: Diagrama de bloques para correccion de pose con NFPW. Las dos mallasson corregidas a pose frontal (bloque Pose Normalization), y caras virtuales frontalesson generadas utilizando Thin Plate Splines (TPS).


TPS Warping

Comparison

Final

Virtual face Â

imageTraining

Input image

Face Alignment

Face Alignment

Mesh A

Pose

Mesh B

Face B

Face A

Transfer

Virtual mesh Â

Figure H.8: Diagrama de bloques para correccion de pose con PTW. La malla Aadopta la pose de la malla B (bloque Pose Transfer), y una cara virtual A se generaa traves de Thin Plate Splines (TPS). Finalmente, se comparan las caras A y B.


X

X

Right rotation

Frontal mesh Flip and Warp

Warping

Direct

Blending masks

Frontal Face

Original face and mesh

Figure H.9: Utilizando simetrıa facial en NFPW : La imagen original y la reflejada semapean a la malla frontal sintetica, y posteriormente ambas versiones son mezcladasusando mascaras.


A−>B

POSE

TRANSFER

IMAGE A

IMAGE B

FLIP

B−>A

SCORE 2

SCORE 1

Figure H.10: Utilizando simetrıa facial en PTW. Previa a la transferencia de pose, seobservan los valores de rotacion horizontal de ambas caras (parametros b2). Si son designo contrario, se refleja una de las caras y posteriormente se hace la transferenciade pose.


Figure H.11: La primera y la tercera columnas muestran imagenes originales en±67.5◦ respectivamente, mientras que la segunda y la cuarta presentan las carassinteticas correspondientes

H.2. Extraccion y comparacion de respuestas de Gabor. FusionIntramodal 203

H.2 Extraccion y comparacion de respuestas de

Gabor. Fusion Intramodal

Algunos de los sistemas de reconocimiento facial mas famosos son los basados en ex-traccion de caracterısticas Gabor. Esta eleccion esta motivada por razones biologicasası como por su optima resolucion en los dominios de espacio y frecuencia. Usandofiltros de Gabor como motor de reconocimiento, hemos propuesto un metodo quetrata de extraer caracterısticas en posiciones o regiones que de algun modo son es-pecıficas de cada individuo. Esto constituye un nuevo enfoque con respecto a losclasicos metodos que extraen caracterısticas en puntos predefinidos de la cara.

A nivel practico, nuestro metodo extrae una representacion de la estructura facialindividual a partir de un detector de crestas y valles [Lopez et al., 1999]. Del conjuntode lıneas que se obtienen, necesitamos muestrearlas para quedarnos con una serie depuntos, en los cuales se calcularan repuestas locales de Gabor. Un esquema delproceso se muestra en la Figura H.12.

Ridges &Valleys Thresholding Sampling

Gabor Jets Extraction

Figure H.12: Sistema de extraccion de respuestas de Gabor utilizando crestas y valles.

A la hora de comparar dos caras, el problema con nuestro metodo radica en elhecho de que no existen correspondencias a priori entre los puntos extraidos en ambascaras. Esto no sucede en metodos como Elastic Bunch Graph Matching [Wiskottet al., 1997], donde los puntos se refieren a caracterısticas faciales universales (ojos,nariz, etc.) y por lo tanto las correspondencias estan predefinidas. Con la finalidad de


obtener correspondencias entre puntos, hacemos uso de un metodo de shape matchingbasado en shape contexts [Belongie et al., 2002]. La salida de este algoritmo es unafuncion que liga cada punto de la primera cara con un punto de la segunda. De estemodo, las respuestas de Gabor calculadas en ambas caras pueden ser comparadascorrectamente.

Los principales resultados que se derivan de las pruebas realizadas son los sigu-ientes:

1. La distribucion de puntos, extraidos segun el modo indicado, es mucho masdiscriminativa que la configuracion de puntos universales (ojos, nariz, etc.).

2. El metodo obtiene resultados comparables a los conseguidos por un ElasticBunch Graph Matching en el que los puntos se han localizado manualmente(Tabla H.1).

3. Pruebas en la AR face database [Martınez and Benavente, 1998] (la FiguraH.13 muestra un ejemplo de las imagenes usadas) demuestran que el metodo esbastante robusto a cambios de iluminacion y expresion (Figuras H.14 y H.15).

a) b) c) f)e)d) g)

n)m)l)k)j)i)h)

Figure H.13: Imagenes de la AR face database usadas en los experimentos.

H.2. Extraccion y comparacion de respuestas de Gabor. FusionIntramodal 205

Table H.1: Verificacion en XM2VTS database. Tasa de falsa aceptacion (FAR),falso rechazo (FRR) y tasa total de error (TER) para nuestro metodo y el algoritmoEBGM.


EBGM 2.93 4.25 7.18 1.42 3.25 4.67

SDGJ 3.13 3.50 6.63 1.32 3.00 4.32

0 5 10 15 20

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Cum

ulat

ive

mat

ch s

core

Smile (b)Anger (c)Scream (d)

Figure H.14: Prestaciones con variaciones de expresion. Claramente, el sistema solofalla reconociendo caras con la boca abierta.

0 5 10 15 200.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

Rank

Cum

ulat

ive

mat

ch s

core Right(e)

Left(f)Both(g)

Figure H.15: Prestaciones con variaciones de iluminacion.


Table H.2: Porcentajes de error usando distintas medidas de distancia y normalizacionL1 para los jets

Input Image Resolution

D(~J~pi, ~J~qξ(i)

)55 × 51 150 × 115 220 × 200


SSE 4.49 3.53 3.17χ2 5.33 4.03 3.01


Uno de los puntos que no ha recibido practicamente atencion en la literaturareferida a reconocimiento facial usando filtros de Gabor es la eleccion de la medida dedistancia (o similitud) para comparar respuestas locales de Gabor. Habitualmente,el producto escalar normalizado (o equivalentemente, distancia del coseno) ha sidoutilizado a pesar de que no existe ninguna razon teorica ni empırica que apoye suuso. Con la finalidad de obtener algunos resultados, propusimos una comparativa de7 medidas de distancia diferentes con distintos factores de normalizacion. La con-clusion principal que se deriva de estos experimentos es que la normalizacion que seaplica a los jets antes de compararlos es crıtica e influye decisivamente en las presta-ciones que obtiene una determinada distancia. Dicho esto, existen combinaciones denormalizacion+medida de distancia que ofrecen mejores resultados que la distanciadel coseno (Tabla H.2).

Siguiendo con algoritmos basados en Gabor, se ha propuesto una comparativade diferentes tecnicas para combinacion de similitudes locales (fusion intramodal).Es bien sabido que la seleccion y combinacion de caracterısticas es muy importantepara obtener metodos de prestaciones elevadas. En esta comparativa, se han evalu-ado tecnicas como Support Vector Machines, redes neuronales, Adaboost, SequentialFloating Forward Selection (SFFS), Best Individual Features (BIF) y la ıntimamenterelacionada Accuracy-based Feature Selection (AFS, que ha sido propuesta en estaTesis).

Los principales resultados que se derivan de la experimentacion llevada a cabo sonlos siguientes:

• Todas las tecnicas propuestas mejoran las prestaciones del algoritmo baseline(en la Tabla H.3 se muestran resultados obtenidos por varios de los esquemasde fusion empleados).

• Tecnicas simples como BIF pueden obtener resultados comparables a metodos

H.3. Modelado estadıstico de los coeficientes Gabor. 207


Baseline 7.18 4.67

LDA-based 5.94 4.45

MLP-AB 3.50 2.16

LDA-AB 4.15 2.54

SVM 3.58 2.30

Table H.3: Tasas de error usando diversas tecnicas de fusion

mas complejos como SVMs, dada la escasez de datos existentes. De hecho, losmejores resultados usando BIF son de 2.43% en Configuracion I (comparar conTabla H.3).

H.3 Modelado estadıstico de los coeficientes Ga-

bor.

A pesar de la gran cantidad de artıculos que han utilizado filtros de Gabor comomotor de reconocimiento, en ninguno de ellos se ha propuesto (o usado) un mod-elo estadıstico para los coeficientes. En esta tesis se han comparado dos modelosdiferentes: Gaussianas generalizadas y Bessel K Forms. A pesar de que las segun-das han cobrado importancia ultimamente, con algunos autores defendiendo su su-perioridad frente a las gaussianas generalizadas, lo cierto es que estas ofrecen unmodelado mas preciso que las Bessel K Forms en el escenario concreto que hemosconsiderado: modelar coeficientes de filtros Gabor extraidos en imagenes de caras.Los parametros de las Gaussianas generalizadas son calculados usando un metodo deMaxima Verosimilitud y las densidades teoricas son comparadas a las distribucionesreales, obteniendose buenos resultados (Figura H.16). Como podrıa esperarse, losvalores de estos parametros dependen de la frecuencia y orientacion del filtro con-siderado y tienen un comportamiento similar dentro de cada una de las bandas defrecuencia espacial.

Sacamos provecho de la estadıstica subyacente para cuantificar los coeficientesusando el algoritmo Lloyd-Max. De esta forma, pudimos reducir el espacio nece-sario para guardar una representacion facial basada en caracterısticas Gabor sindegradacion de las prestaciones (reducciones de hasta 90% del espacio sin perdidasen las prestaciones, Tabla H.4).

Ademas de modelar las distribuciones marginales de coeficientes, tambien se inicioel estudio de las distribuciones conjuntas. Con este fin, se propuso una formulacionnueva basada en Gaussianas generalizadas multidimensionales, obteniendose buenos


Table H.4: Verificacion en XM2VTS. Tasas de error obtenidas para los coeficientesoriginales y los comprimidos con distinto numero de niveles de cuantificacion (NL).

Test SetStorage Saving FAR(%) FRR(%) TER(%)

NL = 2 ≈97% 12.15 18.25 30.40NL = 4 ≈94% 4.19 8.00 12.19NL = 8 ≈91% 3.49 5.50 8.99NL = 16 ≈87% 3.85 5.50 9.35NL = 32 ≈84% 3.71 5.00 8.71NL = 64 ≈81% 3.53 5.50 9.03NL = 128 ≈78% 3.57 5.00 8.57NL = 256 ≈75% 3.63 4.75 8.38NL = 512 ≈72% 3.66 4.75 8.41Raw data 0% 3.79 5.25 9.04

resultados pero en experimentos limitados (Varios ejemplos de ajuste bidimensionalse muestran en las Figuras H.17 y H.18). Las gaussianas generalizadas se han usadorecientemente como funcion de emision de sımbolos dentro de cada estado de unModelo Oculto de Markov, con exito en experimentos limitados.

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.40

10

20

30

40

50

60

70

80HistogramFitted GGD

Figure H.16: Histograma de coeficiente 34 junto con la Gaussiana generalizada ajus-tada.

H.3. Modelado estadıstico de los coeficientes Gabor. 209

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

Figure H.17: Modelado estadıstico del histograma bidimensional de los coeficientesGabor (1,12). Contornos de probabilidad del histograma (arriba) y modelo ajustado(abajo)


−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 −5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

Figure H.18: Modelado estadıstico del histograma bidimensional de los coeficientesGabor (13,40). Contornos de probabilidad del histograma (arriba) y modelo ajustado(abajo)

H.4. Conclusiones y Lıneas Futuras 211

H.4 Conclusiones y Lıneas Futuras

Las principales conclusiones que se pueden extraer tras la realizacion de esta Tesis sedetallan a continuacion:

• Es posible disenar sistemas de reconocimiento de caras robustos a cambios depose utilizando modelos lineales 2D para modelar la rotacion y mapeo de texturapara la sıntesis de las caras virtuales. Se demostro que el empleo de la simetrıafacial mejora las prestaciones del sistema de reconocimiento, obteniendo presta-ciones similares a la de un sistema basado en modelos tridimensionales en unrango de rotaciones en azimuth de ±67.5◦.

• Se ha propuesto un sistema de extraccion de caracterısticas Gabor en regionesque podrıan tener una componente discriminativa inherente, usando un oper-ador de crestas y valles. La comparacion con el famoso Elastic Bunch GraphMatching demustra que nuestro metodo obtiene mejores resultados. Asimismo,se vio que la distribucion de puntos obtenida con las crestas y valles es muchomas discriminativa que la asociada a puntos universales (boca, nariz, etc.)

• Se propuso una comparativa de diferentes medidas de distancia para compararcaracterısticas Gabor, concluyendose que la normalizacion que se aplica a dichascaracterısticas previamente a la comparacion es crıtica e influencia las presta-ciones de una determinada distancia. Por otra parte, se vio que existen com-binaciones de normalizacion+distancia que ofrecen mejores resultados que laclasica distancia coseno.

• Propusimos una comparativa de diferentes herramientas para fusion de simil-itudes locales de Gabor. La principal conclusion que se puede extraer es quetecnicas a priori mas sencillas pueden obtener mejores resultados que otras demayor complejidad, dada la escasez de datos existentes en el escenario de veri-ficacion en el que nos encontramos.

• Se demostro que las distribuciones marginales de coeficientes Gabor presentanun comportamiento no gaussiano, siendo modeladas con gran precision por gaus-sianas generalizadas. Se presento, como ejemplo de aplicacion, la compresionde los datos usando el algoritmo Lloyd-Max. Ademas, se comenzo el estudiode las distribuciones bidimensionales de coeficientes Gabor, proponiendose unanueva formulacion adecuada al problema.

• Finalmente, se implementaron algoritmos de seguimiento de caras, sentandolas bases necesarias para el desarrollo de sistemas de reconocimiento facial ensecuencias de vıdeo. Se obtuvieron resultados preliminares en los siguientes con-textos: reconocimiento de caras robusto a cambios de pose en vıdeos y deteccionde asincronıa entre las senales de audio y vıdeo.


60.3964 37.7028 11.0498

Figure H.19: Resultados preliminares en ajuste automatico de la malla. Izquierda:Inicializacion. Centro: ajuste tras 10 iteraciones. Derecha: Ajuste final.

En cuanto a las principales lıneas futuras que se abren tras esta Tesis, podemosdestacar las siguientes:

• Automatizar el ajuste de la malla para proceder posteriormente a la correccionde pose. Algunos experimentos preliminares se han llevado a cabo en la CMUPIE database usando el Inverse Compositional Image Alignment [Baker andMatthews, 2001] (ver Figura H.19). Siguiendo con la correccion de pose, otrasmejoras que podrıan ser realizadas son la definicion de una funcion de peso localque de mayor importancia a unas regiones que a otras segun la rotacion de lacara.

• Mejoras del algoritmo de extraccion de caracterısticas locales de Gabor: Maspruebas en entornos de iluminacion no controlada, mejorar el comportamientofrente a cambios de expresion fuertes. Fusion de las similitudes locales a nivelde coeficiente (no todos los filtros de Gabor son discriminativos en todas lasregiones de la cara). Asimismo, deben realizarse mas pruebas con las difer-entes medidas de distancias sobre mas bases de datos y con diversos algoritmosbasados en extraccion de caracterısticas Gabor.

• Deben realizarse mas pruebas con la gaussiana generalizada multivariada queha sido propuesta. Asimismo, el uso de la gaussiana generalizada en ModelosOcultos de Markov debe ser estudiado en mayor profundidad.

• Procesado de caras en vıdeo. Durante esta Tesis, se desarrollo software deseguimiento de caras en vıdeos, con aplicacion al reconocimiento de caras ro-busto a cambios de pose, y a la deteccion de asincronıa entre las senales de audioy vıdeo (Figura H.20). El primer paso que debe ser realizado es automatizarel seguimiento de caras, dado que el ajuste en el primer frame se realiza deforma manual. Asimismo, lograr robustez del sistema de seguimiento a cambiosde apariencia (iluminacion, pose, expresion, oclusion, etc.) es una importantemeta que debe ser conseguida.

H.4. Conclusiones y Lıneas Futuras 213

��

��

1021fg

201

1021


fg2

01

1021


HEIGHT

AREA WIDTH

FEATURE EXTRACTION

1021fg

201

1021


Figure H.20: Ejemplo de seguimiento de caras en la base de datos BANCA. Extraccionde coordenadas de labios para detectar asincronıa entre audio y vıdeo [Argones Ruaet al., 2008]

Bibliography

[Abramowitz and Stegun, 1970] Abramowitz, M. and Stegun, I. (1970). Handbook ofMathematical Functions. Dover Publications, Inc., New York.

[Achim et al., 2001] Achim, A., Bezerianos, A., and Tsakalides, P. (2001). NovelBayesian Multiscale Method for Speckle Removal in Medical Ultrasound Images.IEEE Transactions on Medical Imaging, 20(8):772–783.

[Adini et al., 1997] Adini, Y., Moses, Y., and Ullman, S. (1997). Face recognition:the problem of compensating for changes in illumination direction. Pattern Analysisand Machine Intelligence, IEEE Transactions on, 19(7):721–732.

[Ahonen et al., 2006] Ahonen, T., Hadid, A., and Pietikainen, M. (2006). Face de-scription with local binary patterns: Application to face recognition. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 28(12):2037 – 2041.

[Alba-Castro et al., 2008] Alba-Castro, J., Gonzalez-Jimenez, D., Argones-Rua, E.,Gonzalez-Agulla, E., Otero-Muras, E., and Garcıa-Mateo, C. (March 2008). Pose-Corrected Face Processing on Video Sequences for Webcam-based Remote Biomet-ric Authentication. SPIE Journal of Electronic Imaging, 17(1):–.

[Alba-Castro et al., 2003] Alba-Castro, J., Pujol, A., Lopez, A. M., and Villanueva,J. (2003). Improving Shape-based Face Recognition by Means of a SupervisedDiscriminant Hausdorff Distance. In Proceedings IEEE International Conferenceon Image Processing, pages 901 – 904.

[Argones Rua et al., 2008] Argones Rua, E., Bredin, H., Garcıa Mateo, C., Chollet,G., and Gonzalez Jimenez, D. (April 2008). Audio-Visual Speech AsynchronyDetection using Co-Inertia Analysis and Coupled Hidden Markov Models. Acceptedfor publication in Pattern Analysis and Applications, Springer.

[Argones Rua et al., 2006] Argones Rua, E., Kittler, J., Gonzalez Jimenez, D., andAlba Castro, J. L. (2006). Information Fusion for Local Gabor Features basedFrontal Face Verification. In Zhang, D. and Jain, A. K., editors, Advances in

215

216 BIBLIOGRAPHY

Biometrics, Proceedings of the International Conference on Biometrics (ICB 2006),LNCS 3832, pages 173 – 181, Hong Kong, China.

[Atal, 1976] Atal, B. (1976). Automatic recognition of speakers from their voices.Proceedings IEEE, 64(4):460 – 475.

[AT&T, 1992] AT&T (1992). AT&T (formerly ORL) face database,http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

[Bailly-Bailliere et al., 2003] Bailly-Bailliere, E., Bengio, S., Bimbot, F., Hamouz,M., Kittler, J., Mariethoz, J., Matas, J., Messer, K., Popovici, V., Poree, F., Ruiz,B., and Thiran, J.-P. (2003). The BANCA Database and Evaluation Protocol. InLecture Notes in Computer Science, volume 2688, pages 625 – 638.

[Baker and Matthews, 2001] Baker, S. and Matthews, I. (2001). Equivalence andEfficiency of Image Alignment Algorithms. In Proceedings IEEE Conference onComputer Vision and Pattern Recognition, pages 1090–1097.

[Bartlett et al., 2002] Bartlett, M. S., Movellan, J. R., and Sejnowski, T. J. (2002).Face Recognition by Independent Component Analysis. IEEE Transactions onNeural Networks, 13(6):1450 – 1464.

[Basri and Jacobs, 2003] Basri, R. and Jacobs, D. W. (2003). Lambertian reflectanceand linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 25(2):218–233.

[Bay et al., 2000] Bay, S. D., Kibler, D., Pazzani, M. J., and Smyth, P. (2000). TheUCI KDD archive of large data sets for data mining research and experimentation.SIGKDD Exploration Newsletter, 2(2):81–85.

[Belhumeur et al., 1997] Belhumeur, P., Hespanha, J., and Kriegman, D. (1997).Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEETransactions on Pattern Analysis and Machine Intelligence, 19(7):711–720.

[Belhumeur and Kriegman, 1998] Belhumeur, P. and Kriegman, D. (1998). WhatIs the Set of Images of an Object Under All Possible Illumination Conditions.International Journal of Computer Vision, 28(3):245–260.

[Belongie et al., 2002] Belongie, S., Malik, J., and Puzicha, J. (2002). Shape Match-ing and Object Recognition Using Shape Contexts. IEEE Transactions on PatternAnalysis and Machine Intelligence, 24(24):509–522.

[Bengio and Mariethoz, 2004] Bengio, S. and Mariethoz, J. (2004). A Statistical Sig-nificance Test for Person Authentication. In ODYSSEY 2004 - The Speaker andLanguage Recognition Workshop, pages 237–244.

BIBLIOGRAPHY 217

[Beymer, 1994] Beymer, D. (1994). Face Recognition under Varying Pose. In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 756 – 761.

[Beymer and Poggio, 1995] Beymer, D. and Poggio, T. (1995). Face Recognitionfrom One Example View. In Proceedings of the IEEE International Conference onComputer Vision, pages 500 – 507.

[Bicego et al., 2008] Bicego, M., Gonzalez-Jimenez, D., Grosso, E., and Alba-Castro,J. (2008). Generalized Gaussians for Sequential Data Analysis. Submitted to theInternational Conference on Pattern Recognition.

[Bicego et al., 2006] Bicego, M., Lagorio, A., Grosso, E., and Tistarelli, M. (2006).On the Use of SIFT Features for Face Authentication. In Proceedings IEEE CVPRWorkshop on Biometrics, page 35.

[Bicego et al., 2003] Bicego, M., Murino, V., and Figueiredo, M. A. T. (2003). ASequential Pruning Strategy for the Selection of the Number of States in HiddenMarkov Models. Pattern Recognition Letters, 24(9-10):1395–1407.

[Biederman and Gu, 1988] Biederman, I. and Gu, J. (1988). Surface Versus Edge-based Determinants of Visual Recognition. Cognitive Psychology, 20:38 – 64.

[BioAPI, 1998] BioAPI (1998). Biometric API standard, http://www.bioapi.org/.

[BioID, 1998] BioID (1998). The BioID Face Database,http://www.bioid.com/downloads/facedb/index.php.

[BioSecure, 2004] BioSecure (2004). Biosecure Network of Excellence,http://www.biosecure.info/.

[Blanz et al., 2005] Blanz, V., Grother, P., Phillips, P., and Vetter, T. (2005). FaceRecognition Based on Frontal Views Generated from Non-Frontal Images. In Pro-ceedings IEEE Conference on Computer Vision and Pattern Recognition, pages454–461.

[Blanz and Vetter, 1999] Blanz, V. and Vetter, T. (1999). A Morphable Model forthe Synthesis of 3D Faces. In Proceedings SIGGRAPH, pages 187–194.

[Blanz and Vetter, 2003] Blanz, V. and Vetter, T. (2003). Face Recognition Basedon Fitting a 3D Morphable Model. IEEE Transactions on Pattern Analysis andMachine Intelligence, 25(9):1063–1074.

218 BIBLIOGRAPHY

[Bookstein, 1989] Bookstein, F. (1989). Principal Warps: Thin-Plate Splines andthe Decomposition of Deformations. IEEE Transactions on Pattern Analysis andMachine Intelligence, 11(6):567–585.

[Boubchir and Fadili, 2005] Boubchir, L. and Fadili, J. (2005). Multivariate Statis-tical Modeling of Images with the Curvelet Transform. Proceedings of the EighthInternational Symposium on Signal Processing and Its Applications, 2:747–750.

[Bronstein et al., 2005] Bronstein, A., Bronstein, M., and Kimmel, R. (2005).Expression-invariant face recognition via spherical embedding.

[Bruce et al., 1992] Bruce, V., Hanna, E., Dench, N., Healey, P., and Burton, M.(1992). The Importance of Mass in Line Drawings of Faces. Applied CognitivePsychology, 6:619 – 628.

[Brunelli and Poggio, 1993] Brunelli, R. and Poggio, T. (1993). Face recognition:features versus templates. IEEE Transactions on Pattern Analysis and MachineIntelligence, 15(10):1042–1052.

[Buccigrossi and Simoncelli, 1999] Buccigrossi, R. W. and Simoncelli, E. P. (1999).Image Compression via Joint Statistical Characterization in the Wavelet Domain.IEEE Transactions on Image Processing, 8(12):1688–1701.

[Bunke and Caelli, 2001] Bunke, H. and Caelli, T. (2001). Hidden Markov Models:Applications in Computer Vision. World Scientific Publishing Co., Inc., River Edge,NJ, USA.

[Burton et al., 2005] Burton, M., Jenkins, R., Hancock, P., and White, D. (2005).Robust representations for face recognition: The power of averages. CognitivePsychology, 51(3):256–284.

[Cappe, 2001] Cappe, O. (2001). Ten years of HMMs.

[Cappe et al., 2005] Cappe, O., Moulines, E., and Ryden, T. (2005). Inference inHidden Markov Models (Springer Series in Statistics). Springer-Verlag New York,Inc., Secaucus, NJ, USA.

[CAS-PEAL, 2002] CAS-PEAL (2002). CAS-PEAL face database,http://www.jdl.ac.cn/peal/index.html.

[Chai et al., 2005] Chai, X., Qing, L., Shan, S., Chen, X., and Gao, W. (2005). PoseInvariant Face Recognition under Arbitrary Illumination based on 3D Face Recon-struction. In Proceedings Audio- and Video- based Biometric Person Authentication(AVBPA), pages 956–965.

BIBLIOGRAPHY 219

[Chai et al., 2006] Chai, X., Shan, S., Chen, X., and Gao, W. (2006). Local linearregression (llr) for pose invariant face recognition. In Seventh IEEE InternationalConference on Automatic Face and Gesture Recognition (FG 2006), 10-12 April2006, Southampton, UK, pages 631–636.

[Cho and Bui, 2005] Cho, D. and Bui, T. (2005). Multivariate Statistical Approachfor Image Denoising. In IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP), volume 4, pages iv/589–iv/592.

[Cognitec, 2002] Cognitec (2002). Cognitec Systems GmbH,http://www.cognitec-systems.de/.

[Cootes et al., 2001] Cootes, T., Edwards, G., and Taylor, C. (2001). Active Appear-ance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence,23(6):681 – 685.

[Cootes and Taylor, 2001] Cootes, T. and Taylor, C. (2001). Constrained active ap-pearance models. In International Conference on Computer Vision, pages 748–754.

[Cootes et al., 1995] Cootes, T., Taylor, C., Cooper, D., and Graham, J. (1995).Active shape models - their training and application. Computer Vision and ImageUnderstanding, 61(1):38–59.

[Cootes et al., 2000] Cootes, T., Walker, K., and Taylor, C. (2000). View-based Ac-tive Appearance Models. In International Conference on Face and Gesture Recog-nition, pages 227–232.

[COST-275, 2001] COST-275 (2001). COST-275 Action,http://www.fub.it/cost275/.

[Cover and Thomas, 1991] Cover, T. and Thomas, J. (1991). Elements of Informa-tion Theory. John Wiley & Sons, New York.

[Cox et al., 1996] Cox, I. J., Ghosn, J., and Yianilos, P. N. (1996). Feature-basedface recognition using mixture-distance. In International Conference on ComputerVision and Pattern Recognition, pages 209–216. IEEE Press.

[Cristinacce and Cootes, 2006] Cristinacce, D. and Cootes, T. (2006). Feature detec-tion and tracking with constrained local models. In 17th British Machine VisionConference, Edinburgh, UK, pages 929–938.

[Cristinacce and Cootes, 2007] Cristinacce, D. and Cootes, T. (2007). Boosted regres-sion active shape models. In 18th British Machine Vision Conference, Warwick,UK, pages 880–889.

220 BIBLIOGRAPHY

[CSU, 2003] CSU (2003). Colorado State University Face Recognition AlgorithmsEvaluation, http://www.cs.colostate.edu/evalfacerec/index.html.

[Czyk et al., 2004] Czyk, J., Sadeghi, M., Kittler, J., and Vandendorpe, L. (2004).Decision fusion for face authentication. In Zhang, D. and Jain, A. K., editors,International Conference on Biometric Authentication, volume 3072/2004, pages686 – 693. Springer-Verlag GmbH.

[Daugman, 1980] Daugman, J. G. (1980). Two-dimensional spectral analysis of cor-tical receptive field profiles. Vision Research, 20:847 – 856.

[Daugman, 1985] Daugman, J. G. (1985). Uncertainty Relation for Resolution inSpace, Spatial Frequency, and Orientation Optimized by Two-dimensional VisualCortical Filters. Journal of the Optical Society of America A: Optics, Image Sci-ence, and Vision, 2(7):1160 – 1169.

[Daugman, 1988] Daugman, J. G. (1988). Complete Discrete 2D Gabor Transformsby Neural Networks for Image Analysis and Compression. IEEE Trans. on Acous-tics, Speech and Signal Processing, 36(7):1169 – 1179.

[Dempster et al., 1977] Dempster, A., Laird, N., and Rubin, D. (1977). MaximumLikelihood from Incomplete Data via the EM Algorithm. Journal of the RoyalStatistical Society. Series B (Methodological), 39(1):1–38.

[Do and Vetterli, 2002] Do, M. N. and Vetterli, M. (2002). Wavelet-Based Tex-ture Retrieval Using Generalized Gaussian Density and Kullback-Leibler Distance.IEEE Transactions on Image Processing, 11(2):146 – 158.

[DoD, 2000] DoD (2000). DoD Biometrics, http://www.biometrics.dod.mil/.

[Duc et al., 1999] Duc, B., Fischer, S., and Bigun, J. (1999). Face Authenticationwith Gabor Information on Deformable Graphs. IEEE Transactions on ImageProcessing, 8(4):504 – 516.

[EBF, 2003] EBF (2003). European Biometrics Forum,http://www.eubiometricforum.com.

[Eickeler et al., 1999] Eickeler, S., Muller, S., and Rigoll, G. (1999). High Perfor-mance Face Recognition Using Pseudo 2D-Hidden Markov Models. In EuropeanControl Conference (ECC), Karlsruhe, Germany.

[Eltoft et al., 2006] Eltoft, T., Kim, T., and Lee, T.-W. (2006). On the MultivariateLaplace Distribution. IEEE Signal Processing Letters, 13(5):300–303.

BIBLIOGRAPHY 221

[Er et al., 2002a] Er, M. J., Wu, S., Lu, J., and Toh, H. L. (2002a). Face recognitionwith radial basis function (RBF) neural networks. IEEE Transactions on NeuralNetworks, 3(13):697–710.

[Er et al., 2002b] Er, M. J., Wu, S., Lu, J., and Toh, H. L. (May 2002b). Facerecognition with radial basis function (RBF) neural networks . IEEE Transactionson Neural Networks, 13(3):697–710.

[Fadili and Boubchir, 2005] Fadili, M.-J. and Boubchir, L. (2005). Analytical formfor a Bayesian wavelet estimator of images using the Bessel K form densities. IEEETransactions on Image Processing, 14(2):231–240.

[Florack, 1993] Florack, L. (1993). The Syntactical Structure of Scalar Images. PhDthesis, Utrecht University, Utrecht, The Netherlands.

[Freund and Schapire, 1995] Freund, Y. and Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. InEuropean Conference on Computational Learning Theory, pages 23–37.

[FRGC, 2004] FRGC (2004). Face Recognition Grand Challenge,http://www.frvt.org/frgc/.

[FRVT, 2000] FRVT (2000). Face Recognition Vendor Test, http://www.frvt.org/.

[Gabor, 1946] Gabor, D. (1946). Theory of communications. Journal of the Instituteof Electrical Engineering, 93(26):429 – 457.

[Gao et al., 2008] Gao, W., Cao, B., Shan, S., Chen, X., Zhou, D., Zhang, X., andZhao, D. (2008). The CAS-PEAL Large-Scale Chinese Face Database and Base-line Evaluations. Systems, Man and Cybernetics, Part A, IEEE Transactions on,38(1):149–161.

[Gao and Leung, 2002] Gao, Y. and Leung, M. (2002). Face Recognition Using LineEdge Map. IEEE Transactions on Pattern Analysis and Machine Intelligence,24(6):764 – 779.

[Georghiades et al., 2001] Georghiades, A., Belhumeur, P., and Kriegman, D. (2001).From Few to Many: Illumination Cone Models for Face Recognition under Vari-able Lighting and Pose. IEEE Transactions on Pattern Analysis and MachineIntelligence, 23(6):643–660.

[Georghiades et al., 1998] Georghiades, A. S., Kriegman, D. J., and Belhumeur, P. N.(1998). Illumination Cones for Recognition under Variable Lighting: Faces. In

222 BIBLIOGRAPHY

Proceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 52–58.

[Gokberk et al., 2003] Gokberk, B., Irfanoglu, M. O., Akarun, L., and Alpaydin, E.(2003). Optimal Gabor Kernel Selection for Face Recognition. In Proceedings ofthe IEEE International Conference on Image Processing (ICIP 2003), volume 1,pages 677 – 680, Barcelona, Spain.

[Goldstein et al., 1971] Goldstein, A., Harmon, L., and Lesk, A. (1971). Identificationof Human Faces. Proceedings IEEE, 59(5):748 – 760.

[Gong et al., 2000] Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision:from Images to Face Recognition. Imperial College Press and World ScientificPublishing.

[Gonzalez-Jimenez and Alba-Castro, 2005] Gonzalez-Jimenez, D. and Alba-Castro,J. (2005). Shape Contexts and Gabor Features for Face Description and Authen-tication. In IEEE International Conference on Image Processing, pages 962–965.

[Gonzalez-Jimenez and Alba-Castro, 2007a] Gonzalez-Jimenez, D. and Alba-Castro,J. (2007a). Shape-Driven Gabor Jets for Face Description and Authentication.IEEE Transactions on Information Forensics and Security, 2(4):769–780.

[Gonzalez-Jimenez and Alba-Castro, 2007b] Gonzalez-Jimenez, D. and Alba-Castro,J. (2007b). Toward Pose-Invariant 2-D Face Recognition Through Point Distribu-tion Models and Facial Symmetry. IEEE Transactions on Information Forensicsand Security, 2(3):413–429.

[Gonzalez-Jimenez and Alba-Castro, 2006] Gonzalez-Jimenez, D. and Alba-Castro,J. L. (2006). Pose Correction and Subject-Specific Features for Face Authentication.In International Conference on Pattern Recognition, volume 4, pages 602–605.

[Gonzalez-Jimenez et al., 2007a] Gonzalez-Jimenez, D., Argones-Rua, E., Alba-Castro, J., and Kittler, J. (2007a). Evaluation of Point Selection and SimilarityFusion Methods for Gabor Jets-Based Face Verification. IET Computer Vision,1(3–4):101–112.

[Gonzalez-Jimenez et al., 2007b] Gonzalez-Jimenez, D., Bicego, M., Tangelder, J.,Schouten, B., Ambekar, O., Alba-Castro, J., Grosso, E., and Tistarelli, M. (2007b).Distance Measures for Gabor Jets-based Face Authentication: A ComparativeEvaluation. In International Conference on Biometrics, pages 474–483.

[Gonzalez-Jimenez et al., 2007c] Gonzalez-Jimenez, D., Perez-Gonzalez, F.,Comesana-Alfaro, P., Perez-Freire, L., and Alba-Castro, J. (2007c). Modeling

BIBLIOGRAPHY 223

Gabor Coefficients via Generalized Gaussian Distributions for Face Recognition.In International Conference on Image Processing, pages 485–488.

[Gonzalez-Jimenez et al., 2006] Gonzalez-Jimenez, D., Sukno, F., Alba-Castro, J.,and Frangi, A. (2006). Automatic Pose Correction for Local Feature-Based FaceAuthentication. In IAPR Conference on Articulated Motion and Deformable Ob-jects, pages 356–365.

[Gross, 2005] Gross, R. (2005). Face databases. In S.Li, A., editor, Handbook of FaceRecognition. Springer, New York.

[Gross et al., 2004] Gross, R., Matthews, I., and Baker, S. (2004). Appearance-BasedFace Recognition and Light-Fields. IEEE Transactions on Pattern Analysis andMachine Intelligence, 26(4):449–465.

[Gross et al., 2001] Gross, R., Shi, J., and Cohn, J. (2001). Quo Vadis Face Recogni-tion? In Third Workshop on Empirical Evaluation Methods in Computer Vision.

[Heisele et al., 2001] Heisele, B., Ho, P., and Poggio, T. (2001). Face Recognitionwith Support Vector Machines: Global versus Component-based Approach. InThe Eighth IEEE International Conference on Computer Vision (ICCV’01), pages688 – 694, Vancouver, Canada. IEEE, IEEE Computer Society.

[Hernandez et al., 2000] Hernandez, J., Amado, M., and Perez-Gonzalez, F. (2000).DCT-domain watermarking techniques for still images: Detector performance anal-ysis and a new structure. IEEE Transactions on Image Processing, 9(1):55 – 68.

[Heusch et al., 2005] Heusch, G., Rodriguez, Y., and Marcel, S. (2005). Local BinaryPatterns as an Image Preprocessing for Face Authentication. Technical report,IDIAP.

[Hietmeyer, 2000] Hietmeyer, R. (2000). Biometric Identification Promises Fast andSecure Processing of Airline Passengers. The International Civil Aviation Organi-zation Journal, 55(9):10 – 11.

[Huber, 1981] Huber, P. (1981). Robust Statistics. Wiley, New York.

[IBG, 1996] IBG (1996). International Biometric Group,http://www.biometricgroup.com.

[Ichikawa et al., 2006] Ichikawa, K., Mita, T., and Hori, O. (2006). Component-basedrobust face detection using AdaBoost and decision tree. In Proceedings of the 7thInternational Conference on Automatic Face and Gesture Recognition, pages 413–420.

224 BIBLIOGRAPHY

[Identix, 1982] Identix (1982). http://www.identix.com/.

[IEEE-TIFS, 2006] IEEE-TIFS (2006). IEEE Transactions on Information Forensicsand Security, http://www.ieee.org/organizations/society/sp/tifs.html.

[Jacobs et al., 1998] Jacobs, D. W., Belhumeur, P. N., and Basri, R. (1998). Com-paring images under variable illumination. In Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition, pages 610–617.

[Jain et al., 1999] Jain, A., Bolle, R., and Pankanti, S. (1999). BIOMETRICS: Per-sonal Identification in a Networked Society. Kluwer Academic Publishers.

[Jain and Zongker, 1997] Jain, A. and Zongker, D. (1997). Feature Selection: Evalu-ation, Application and Small Sample Performance. IEEE Transactions on PatternAnalysis and Machine Intelligence, 19(2):153–158.

[Jenkins and Burton, 2008] Jenkins, R. and Burton, M. (2008). Identification of Hu-man Faces. Science, 319(5862):435.

[Jiao et al., 2002] Jiao, F., Gao, W., and Shan, S. (2002). A Face Recognition MethodBased on Local Feature Analysis. In Asian Conference on Computer Vision, pages188–192.

[Joachims, 2002] Joachims, T. (2002). Learning to Classify Text Using Support Vec-tor Machines, volume 668 of International Series in Engineering and ComputerScience. Springer Verlag.

[Joshi and Fischer, 1995] Joshi, R. and Fischer, T. (1995). Comparison of generalizedgaussian and laplacian modeling in dct image coding. IEEE Signal ProcessingLetters, 2(5):81–82.

[Kanade, 1973] Kanade, T. (1973). Picture Processing System by Computer Complexand Recognition of Human Faces. PhD thesis, Kyoto University.

[Kanade et al., 1998] Kanade, T., Saito, H., and Vedula, S. (1998). The 3d room:Digitizing time-varying 3d events by synchronized multiple video streams. Tech-nical Report CMU-RI-TR-98-34, Robotics Institute, Carnegie Mellon University,Pittsburgh, PA.

[Kanade and Yamada, 2003] Kanade, T. and Yamada, A. (2003). Multi-SubregionBased Probabilistic Approach Toward Pose-Invariant Face Recognition. In Pro-ceedings of 2003 IEEE International Symposium on Computational Intelligence inRobotics and Automation (CIRA), pages 954–959.

BIBLIOGRAPHY 225

[Kela et al., 2006] Kela, N., Rattani, A., and Gupta, P. (2006). Illumination InvariantElastic Bunch Graph Matching for Efficient Face Recognition. In Conference onComputer Vision and Pattern Recognition Workshop, pages 42–47.

[Kittler et al., 1998] Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. (1998).On combining classifiers. IEEE Transactions on Pattern Analysis and MachineIntelligence, 20(3).

[Kohir and Desai, 1998] Kohir, V. and Desai, U. (1998). Face Recognition Using aDCT-HMM Approach. In Proceedings of the 4th IEEE Workshop on Applicationsof Computer Vision (WACV’98), pages 226–231.

[Kotropoulos et al., 2000] Kotropoulos, C., Tefas, A., and Pitas, I. (2000). FrontalFace Authentication Using Morphological Elastic Graph Matching. IEEE Trans-actions on Image Processing, 9(4):555 – 560.

[L-1, 2005] L-1 (2005). L-1 identity solutions, http://www.l1id.com/.

[Lades et al., 1993] Lades, M., Vorbruggen, J. C., Buhmann, J., Lange, J., v.d. Mals-burg, C., and Konen, W. (1993). Distortion Invariant Object Recognition in theDynamic Link Architecture. IEEE Transactions on Computers, 42(3):300 – 310.

[Lanitis et al., 1997] Lanitis, A., Taylor, C., and Cootes, T. (1997). Automatic Inter-pretation and Coding of Face Images Using Flexible Models. IEEE Transactionson Pattern Analysis and Machine Intelligence (Special Issue in Face and GestureRecognition), 19(7):743–756.

[Lee and Surendra, 2003] Lee, M. and Surendra, R. (2003). Pose-Invariant FaceRecognition Using a 3D Deformable Model. Pattern Recognition, 36(8):1835–1846.

[Li and Jain, 2005] Li, S. and Jain, A. (2005). Handbook of Face Recognition. SpringerVerlag.

[Li and Zhang, 2004] Li, S. Z. and Zhang, Z. (2004). FloatBoost Learning and Sta-tistical Face Detection. IEEE Transactions on Pattern Analysis and Machine In-telligence, 26(9):1112–1123.

[Liao and Li, 2000] Liao, R. and Li, S. (2000). Face recognition based on multiplefacial features. In Fourth IEEE International Conference on Automatic Face andGesture Recognition, pages 239–244.

[Lienhart and Maydt, 2002] Lienhart, R. and Maydt, J. (2002). An Extended Setof Haar-like Features for Rapid Object Detection. In Proceedings of the 2002International Conference on Image Procesing, volume I, pages 900 – 903.

226 BIBLIOGRAPHY

[Liu, 2004] Liu, C. (2004). Gabor-Based Kernel PCA with Fractional Power Poly-nomial Models for Face Recognition. IEEE Transactions on Pattern Analysis andMachine Intelligence, 26(5):572 – 581.

[Liu and Chen, 2003] Liu, X. and Chen, T. (2003). Video-Based Face RecognitionUsing Adaptive Hidden Markov Models. In Computer Vision and Pattern Recog-nition, volume 1, pages 340–345.

[Liu and Chen, 2005] Liu, X. and Chen, T. (2005). Pose-Robust Face RecognitionUsing Geometry Assisted Probabilistic Modeling. In Proceedings of IEEE Confer-ence on Computer Vision and Pattern Recognition, pages 502–509.

[Liu et al., 2003] Liu, Y., Schmidt, K., Cohn, J., and Mitra, S. (2003). Facial asym-metry quantification for expression invariant human identification. Computer Vi-sion and Image Understanding, 91(1-2):138 – 159.

[Lloyd, 1957] Lloyd, S. (1957). Least Squares Quantization in PCM. Technical report,Bell Laboratories.

[Lloyd, 1982] Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Trans-actions on Information Theory, IT-28(2):129 – 137.

[Lopez et al., 2000] Lopez, A. M., Lloret, D., Serrat, J., and Villanueva, J. (2000).Multilocal Creaseness Based on the Level-Set Extrinsic Curvature. Computer Vi-sion and Image Understanding, 77:111 – 144.

[Lopez et al., 1999] Lopez, A. M., Lumbreras, F., Serrat, J., and Villanueva, J. J.(1999). Evaluation of Methods for Ridge and Valley Detection. IEEE Transactionson Pattern Analysis and Machine Intelligence, 21(4):327 – 335.

[Lowe, 2004] Lowe, D. (2004). Distinctive Image Features from Scale-Invariant Key-points. International Journal of Computer Vision, 60(2):91 – 110.

[Lu et al., 2006] Lu, J., Plataniotis, K., Venetsanopoulos, A., and Li, S. (2006).Ensemble-based discriminant learning with boosting for face recognition. IEEETransactions on Neural Networks, 17(1):166–78.

[Lu et al., 2003] Lu, J., Plataniotis, K. N., and Venetsanopoulos, A. N. (2003). FaceRecognition Using LDA-based Algorithms. IEEE Transactions on Neural Net-works, 14(1):195 – 200.

[Lucas and Kanade, 1981] Lucas, B. D. and Kanade, T. (1981). An Iterative ImageRegistration Technique with an Application to Stereo Vision. In DARPA ImageUnderstanding Workshop, pages 121–130.

BIBLIOGRAPHY 227

[Luttin and Maıtre, 1998] Luttin, J. and Maıtre, G. (1998). Technical Report RR-21:Evaluation Protocol for the Extended M2VTS Database (XM2VTSDB). Technicalreport, IDIAP.

[Lyons et al., 2000] Lyons, M., Campbell, R., Plante, A., Coleman, M., Kamachi, M.,and Akamatsu, S. (2000). The Noh Mask Effect: Vertical Viewpoint Dependenceof Facial Expression Perception. Proceedings of the Royal Society of London B,267:2239 – 2245.

[MacLennan, 1991] MacLennan, B. (1991). Gabor representations of spatiotemporalvisual images. Technical report, Knoxville, TN, USA.

[Mallat, 1989] Mallat, S. (1989). A Theory for Multiresolution Signal Decomposition:the Wavelet Representation. IEEE Transactions on Pattern Analysis and MachineIntelligence, 11(7):674–693.

[Maltoni et al., 2003] Maltoni, D., Maio, D., Jain, A., and Prabhakar, S. (2003).Handbook of Fingerprint Recognition. Springer Verlag.

[Martınez and Benavente, 1998] Martınez, A. and Benavente, R. (1998). The ARFace Database. Technical report, Computer Vision Center (CVC).

[Martinez and Kak, 2001] Martinez, A. and Kak, A. (Feb 2001). PCA versus LDA.IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233.

[Martınez, 2003] Martınez, A. M. (2003). Recognizing Expression Variant Faces froma Single Sample per Class. In IEEE International Conference on Computer Visionand Pattern Recognition, 2003, volume 1, pages 353–358.

[Matas et al., 2000] Matas, J., Hamouz, M., Jonsson, K., Kittler, J., Li, Y.,Kotropoulos, C., Tefas, A., Pitas, I., Tan, T., Yan, H., Smeraldi, F., Bigun, J.,Capdevielle, N., Gerstner, W., Ben-Yacoub, S., Abeljaoued, Y., and Mayoraz, E.(2000). Comparison of face verification results on the XM2VTS database. In In-ternational Conference on Pattern Recognition, volume 4, pages 858 – 863.

[Matthews and Baker, 2003] Matthews, I. and Baker, S. (2003). Active appear-ance models revisited. Technical Report CMU-RI-TR-03-02, Robotics Institute,Carnegie Mellon University, Pittsburgh, PA.

[Maurer and Malsburg, 1996] Maurer, T. and Malsburg, C. (1996). Single View-based Recognition of Faces Rotated in Depth. In Proceedings of the InternationalWorkshop on Automatic Face and Gesture Recognition, pages 176–181.

[Max, 1960] Max, J. (1960). Quantizing for Minimum Distortion. IRE Transactionson Information Theory, IT-6:7–12.

228 BIBLIOGRAPHY

[Mayoue et al., 2007] Mayoue, A., Allano, L., Dorizzi, B., Verdet, F., and Hennebert,J. (2007). Biosecure Multimodal Evaluation Campaign 2007. Mobile Scenario -Experimental Results. Technical report, Biosecure NoE.

[Messer et al., 2004a] Messer, K., Kittler, J., Sadeghi, M., Hamouz, M., Kostin, A.,Cardinaux, F., Marcel, S., Bengio, S., Sanderson, C., Poh, N., Rodriguez, Y.,Czyk, J., Vandendorpe, L., McCool, C., Lowther, S., Sridharan, S., Chandran, V.,Palacios, R. P., Vidal, E., Bai, L., Shen, L., Wang, Y., Yueh-Hsuan, C., Hsien-Chang, L., Yi-Ping, H., Heinrichs, A., Mueller, M., Tewes, A., von der Malsburg,C., Wurtz, R., Wang, Z., Xue, F., Ma, Y., Yang, Q., Fang, C., Ding, X., Lucey, S.,Goss, R., and Schneiderman, H. (2004a). Face Authentication Test on the BANCADatabase. In IEEE 17th International Conference on Pattern Recognition.

[Messer et al., 2004b] Messer, K., Kittler, J., Sadeghi, M., Hamouz, M., Kostin, A.,Cardinaux, F., Marcel, S., Bengio, S., Sanderson, C., Poh, N., Rodriguez, Y.,Czyz, J., Vandendorpe, L., McCool, C., Lowther, S., Sridharan, S., Chandran, V.,Palacios, R., Vidal, E., Bai, L., Shen, L., Wang, Y., Yueh-Hsuan, C., Hsien-Chang,L., Yi-Ping, H., Heinrichs, A., Muller, M., Tewes, A., von der Malsburg, C., Wurtz,R., Wang, Z., Xue, F., Ma, Y., Yang, Q., Fang, C., Ding, X., Lucey, S., Goss, R.,and Schneiderman, H. (2004b). Face Authentication Test on the BANCA Database.In International Conference on Pattern Recognition, volume 4, pages 523 – 532.

[Messer et al., 2004c] Messer, K., Kittler, J., Sadeghi, M., Hamouz, M., Kostin, A.,Marcel, S., Bengio, S., Cardinaux, F., Sanderson, C., Poh, N., Rodriguez, Y.,Kryszczuk, K., Czyz, J., Vandendorpe, L., Ng, J., Cheung, H., and Tang, B.(2004c). Face Authentication Competition on the BANCA Database. In Interna-tional Conference on Biometric Authentication, pages 8–15.

[Messer et al., 2003] Messer, K., Kittler, J., Sadeghi, M., Marcel, S., Marcel, C.,Bengio, S., Cardinaux, F., Sanderson, C., Czyz, J., Vandendorpe, L., Srisuk, S.,Petrou, M., Kurutach, W., Kadyrov, A., Paredes, R., Kepenekci, B., Tek, F.,Akar, G., Deravi, F., and Mavity, N. (2003). Face Verification Competition onthe XM2VTS Database. In Proceedings Audio- and Video- based Biometric PersonAuthentication.

[Messer et al., 2006] Messer, K., Kittler, J., Short, J., Heusch, G., Cardinaux, F.,Marcel, S., Rodriguez, Y., Shan, S., Su, Y., Gao, W., and Chen, X. (2006). Per-formance Characterisation of Face Recognition Algorithms. In International Con-ference on Biometric Authentication, pages 1 – 11, Hong Kong. Springer Verlag,LNCS.

BIBLIOGRAPHY 229

[Messer et al., 1999] Messer, K., Matas, J., Kittler, J., Luettin, J., and Maitre, G.(1999). XM2VTSDB: The Extended M2VTS Database. Audio- and Video-BasedBiometric Person Authentication, pages 72 – 77.

[Moghaddam et al., 2000] Moghaddam, B., Jebara, T., and Pentland, A. (2000).Bayesian face recognition. Pattern Recognition, 33(11):1771–1782.

[Moghaddam and Pentland, 1997] Moghaddam, B. and Pentland, A. (Jul 1997).Probabilistic visual learning for object representation. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, 19(7):696–710.

[Morik et al., 1999] Morik, K., Brockhausen, P., and Joachims, T. (1999). Combiningstatistical learning with a knowledge-based approach - A case study in intensivecare monitoring. In International Conference on Machine Learning, pages 268–277.

[Moses et al., 1994] Moses, Y., Adini, Y., and Ullman, S. (1994). Face Recognition:The Problem of Compensating for Changes in Illumination Direction. In Proceed-ings of the European Conference on Computer Vision, volume A, pages 286 – 296.

[Moulin and Liu, 1999] Moulin, P. and Liu, J. (1999). Analysis of Multiresolution Im-age Denoising Schemes Using Generalized Gaussian and Complexity Priors. IEEETransactions on Information Theory, 45(3):909–919.

[Mu et al., 2003] Mu, X., Hassoun, M., and Watta, P. (2003). Combining gaborfeatures: summing vs. voting in human face recognition. In IEEE InternationalConference on Systems, Man and Cybernetics, volume 1, pages 737–743.

[Muller, 1993] Muller, F. (1993). Distribution Shape of Two-Dimensional DCT Co-efficients of Natural Images. Electronic Letters, 29(22):1935–1936.

[Nefian and Hayes, 1998] Nefian, A. and Hayes, M. (12-15 May 1998). Hidden Markovmodels for face recognition. In IEEE International Conference on Acoustics, Speechand Signal Processing, 1998, pages 2721–2724 vol.5.

[Omniperception, 2001] Omniperception (2001). http://www.omniperception.com/.

[Osuna et al., 1997] Osuna, E., Freund, R., and Girosi, F. (1997). Training SupportVector Machines: an Application to Face Detection. In Proceedings of the 1997Conference on Computer Vision and Pattern Recognition, pages 130–136.

[Otero-Muras et al., 2007] Otero-Muras, E., Gonzalez-Agulla, E., Alba-Castro, J. L.,Garcıa-Mateo, C., and Marquez-Florez, O. W. (2007). An Open Framework ForDistributed Biometric Authentication In A Web Environment. Accepted in Annalsof Telecommunications: Special Issue on Multimodal Biometrics.

230 BIBLIOGRAPHY

[Pearson et al., 1990] Pearson, D., Hanna, E., and Martinez, K. (1990). Computer-Generated Cartoons. In Barlow, H., Blakemore, C., and Weston-Smith, M., editors,Images and Understanding, pages 46–60. Cambridge University Press, Cambridge.

[Penev and Attick, 1996] Penev, P. S. and Attick, J. J. (1996). Local Feature Anal-ysis: A General Statistical Theory for Object Representation. Neural Systems,7(3):447 – 500.

[Pentland et al., 1994] Pentland, A., Moghaddam, B., and Starner, T. (1994). View-based and Modular Eigenspaces for Face Recognition. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 84 – 91.

[Phillips et al., 2005] Phillips, P., Flynn, P., Scruggs, T., Bowyer, K., Chang, J.,Hoffman, K., Marques, J., Min, J., and Worek, W. (2005). Overview of the FaceRecognition Grand Challenge. In Computer Vision and Pattern Recognition, 2005.CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 947–954.

[Phillips et al., 2000a] Phillips, P., Martin, A., Wilson, C., and Przybocki, M.(2000a). An Introduction to Evaluating Biometric Systems. Computer, 33(2):56–63.

[Phillips et al., 2000b] Phillips, P., Moon, H., Rizvi, S., and Rauss, P. (2000b). TheFERET Evaluation Methodology for Face Recognition Algorithms. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 22(10):1090 – 1104.

[Phillips et al., 2006] Phillips, P. J., Flynn, P. J., Scruggs, W. T., Bowyer, K. W.,and Worek, W. J. (2006). Preliminary face recognition grand challenge results. InFG, pages 15–24. IEEE Computer Society.

[Potzsch et al., 1996] Potzsch, M., Maurer, T., Wiskott, L., and der Malsburg, C. V.(1996). Reconstruction from Graphs Labeled with Responses of Gabor Filters. InInternational Conference on Artificial Neural Networks (ICANN), pages 845 – 850.

[Pudil et al., 1994] Pudil, P., Ferri, F. J., Novovicova, J., and Kittler, J. (1994).Floating Search Methods for Feature Selection with Nonmonotonic Criterion Func-tions. In Proceedings ICPR, volume 2, pages 279–283.

[Pujol et al., 2001] Pujol, A., Lopez, A. M., Alba-Castro, J. L., and Villanueva, J. J.(2001). Ridges, Valleys and Hausdorff Based Similarity Measures for Face Descrip-tion and Matching. In Proceedings International Workshop on Pattern Recognitionand Information Systems, pages 80 – 90.

[Rabiner, 1989] Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Se-lected Applications in Speech Recognition. Proceedings of the IEEE, 77(2):257–286.

BIBLIOGRAPHY 231

[Ramachandran et al., 2005] Ramachandran, M., Zhou, S., Jhalani, D., and Chel-lappa, R. (18-23 March 2005). A method for converting a smiling face to a neutralface with applications to face recognition. In IEEE International Conference onAcoustics, Speech, and Signal Processing, 2005, pages ii/977–ii/980 Vol. 2.

[Romdhani et al., 2002] Romdhani, S., Blanz, V., and Vetter, T. (2002). Face Iden-tification by Fitting a 3D Morphable Model Using Linear Shape and Texture ErrorFunctions. In Proceedings European Conference on Computer Vision, pages 3–19.

[Romdhani et al., 1999] Romdhani, S., Gong, S., and Psarrou, A. (1999). A Multi-View Nonlinear Active Shape Model Using Kernel PCA. In British Machine VisionConference, pages 483–492.

[Ross et al., 2006] Ross, A. A., Nandakumar, K., and Jain, A. K. (2006). Handbookof Multibiometrics. Springer.

[Rowley et al., 1998] Rowley, H. A., Baluja, S., and Kanade, T. (1998). Neuralnetwork-based face detection. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 20(1):23–38.

[Sato et al., 1998] Sato, K., Shah, S., and Aggarwal, J. (14-16 Apr 1998). Partialface recognition using radial basis function networks. In Third IEEE InternationalConference on Automatic Face and Gesture Recognition, pages 288–293.

[SC37, 2002] SC37 (2002). International Organization for Standardization, ISO JTC1/SC 37, http://www.iso.org/.

[Schmid and Mohr, 1997] Schmid, C. and Mohr, R. (1997). Local greyvalue invari-ants for image retrieval. IEEE Transactions on Pattern Analysis and MachineIntelligence, 19(5):530–535.

[Schneiderman and Kanade, 2004] Schneiderman, H. and Kanade, T. (2004). Objectdetection using the statistics of parts. International Journal of Computer Vision,56(3):151–177.

[Schwarz, 1978] Schwarz, G. (1978). Estimating the dimension of a model. TheAnnals of Statistics, 6(2):461–464.

[SecurePhone, 2004] SecurePhone (2004). SecurePhone,http://www.secure-phone.info/.

[Shan et al., 2005] Shan, S., Yang, P., W., X. C., and Gao (2005). AdaBoost Ga-bor Fisher Classifier for Face Recognition. In Proceeding of IEEE InternationalWorkshop on Analysis and Modeling of Faces and Gestures, pages 278–291.

232 BIBLIOGRAPHY

[Shapiro, 1993] Shapiro, J. (1993). Embedded Image Coding Using Zerotrees ofWavelet Coefficients. IEEE Transactions on Signal Processing, 41(12):3445–3462.

[Sharifi and Leon-Garcia, 1995] Sharifi, K. and Leon-Garcia, A. (1995). Estimationof Shape Parameter for Generalized Gaussian Distributions in Subband Decompo-sitions of Video. IEEE Transactions on Circuits and Systems for Video Technology,5(1):52–56.

[Shashua, 1992] Shashua, A. (1992). Geometry and Photometry in 3D Visual Recog-nition. PhD thesis, M.I.T Artificial Intelligence Laboratory.

[Shashua and Riklin-Raviv, 2001] Shashua, A. and Riklin-Raviv, T. (Feb 2001). Thequotient image: class-based re-rendering and recognition with varying illumina-tions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):129–139.

[Shen and Bai, 2006a] Shen, L. and Bai, L. (2006a). A Review on Gabor Waveletsfor Face Recognition. Pattern Analysis and Applications, 9(2):273 – 292.

[Shen and Bai, 2006b] Shen, L. and Bai, L. (2006b). Mutualboost Learning for Select-ing Gabor Features for Face Recognition. Pattern Recognition Letters, 27(15):1758– 1767.

[Shen et al., 2005] Shen, L., Bai, L., Bardsley, D., and Wang, Y. (2005). Gaborfeature selection for face recognition using improved adaboost learning. In IWBRS,pages 39–49.

[Shin et al., 2007] Shin, H.-C., Park, J. H., and Kim, S.-D. (2007). Combination ofwarping robust elastic graph matching and kernel-based projection discriminantanalysis for face recognition. IEEE Transactions on Multimedia, 9(6):1125–1136.

[Sim et al., 2003] Sim, T., Baker, S., and Bsat, M. (2003). The CMU Pose, Illumi-nation, and Expression Database. IEEE Transactions on Pattern Analysis andMachine Intelligence, 25(12):1615–1618.

[Simoncelli and Adelson, 1996] Simoncelli, E. and Adelson, E. (1996). Noise Removalvia Bayesian Wavelet Coring. In Third International Conference on Image Pro-cessing, volume I, pages 379–382.

[Sinha et al., 2006] Sinha, P., Balas, B., Ostrovsky, Y., and Russell, R. (2006).Face Recognition by Humans: Nineteen Results All Computer Vision ResearchersShould Know About. Proceedings of the IEEE, 94(11):1948–1962.

BIBLIOGRAPHY 233

[Smeraldi and Bigun, 2002] Smeraldi, F. and Bigun, J. (2002). Retinal Vision Ap-plied to Facial Features Detection and Face Authentication. Pattern RecognitionLetters, 23(4):463 – 475.

[Sochman and Matas, 2004] Sochman, J. and Matas, J. (17-19 May 2004). AdaBoostwith totally corrective updates for fast face detection. In Sixth IEEE InternationalConference on Automatic Face and Gesture Recognition, 2004, pages 445–450.

[Srivastava et al., 2002] Srivastava, A., Liu, X., and Grenander, U. (2002). UniversalAnalytical Forms for Modeling Image Probabilities. IEEE Transactions on PatternAnalysis and Machine Intelligence, 24(9):1200–1214.

[Sukno et al., 2007] Sukno, F., Ordas, S., Butakoff, C., Cruz, S., and Frangi, A.(2007). Active Shape Models with Invariant Optimal Features: Application toFacial Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,29(7):1105–1117.

[Takacs, 1998] Takacs, B. (1998). Comparing Face Images Using the Modified Hauss-dorf Distance. Pattern Recognition, 31(12):1873 – 1881.

[Tefas et al., 2001] Tefas, A., Kotropoulos, C., and Pitas, I. (2001). Using SupportVector Machines to Enhance the Performance of Elastic Graph Matching for FrontalFace Authentication. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 23(7):735 – 746.

[Turk and Pentland, 1991] Turk, M. and Pentland, A. (1991). Eigenfaces for Recog-nition. Journal of Cognitive Neuroscience, 3(1):71 – 86.

[Van de Wouver et al., 1999] Van de Wouver, G., Scheunders, P., and Dyck, D. V.(1999). Statistical Texture Characterization from Discrete Wavelet Representa-tions. IEEE Transactions on Image Processing, 8(4):592 – 598.

[Vapnik, 2000] Vapnik, V. (2000). The Nature of Statistical Learning Theory. Statis-tics for Engineering and Information Science. Springer Verlag, Berlin.

[Viisage, 1993] Viisage (1993). http://www.viisage.com/ww/en/pub/home.cfm.

[Viola and Jones, 2001] Viola, P. and Jones, M. (2001). Rapid Object Detection usinga Boosted Cascade of Simple Features. In IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, volume 1, pages 511 – 518. IEEEComputer Society.

[Viola and Jones, 2002] Viola, P. and Jones, M. (2002). Robust Real-Time ObjectDetection. Int. Journal of Computer Vision.

234 BIBLIOGRAPHY

[Walker et al., 1997] Walker, K., Cootes, T., and Taylor, C. (1997). Correspondenceusing distinct points based on image invariants. In British Machine Vision Con-ference, volume 1, pages 540 – 549.

[Wiskott et al., 1997] Wiskott, L., Kruger, J. M., and von der Malsburg, C. (1997).Face recognition by Elastic Bunch Graph Matching. IEEE Transactions on PatternAnalysis and Machine Intelligence, 19(7):775 – 779.

[Wu, 1983] Wu, C. (1983). On the convergence properties of the EM algorithm. TheAnnals of Statistics, 11(1):95–103.

[Yale, 1997] Yale (1997). Yale face database,http://cvc.yale.edu/projects/yalefaces/yalefaces.html.

[Yang et al., 2005a] Yang, J., Frangi, A., Yang, J.-Y., Zhang, D., and Jin, Z. (2005a).KPCA plus LDA: a complete kernel Fisher discriminant framework for featureextraction and recognition. IEEE Transactions on Pattern Analysis and MachineIntelligence, 27(2):230–244.

[Yang et al., 2005b] Yang, J., Gao, X., Zhang, D., and Yang, J.-Y. (2005b). KernelICA: An alternative formulation and its application to face recognition. PatternRecognition, 38(10):1784–1787.

[Yang, 2004] Yang, M.-H. (2004). ICPR 2004 tutorial on Recent Advances in Face De-tection. http://vision.ai.uiuc.edu/mhyang/papers/icpr04 tutorial.pdf.

[Yang et al., 1999] Yang, M.-H., Roth, D., and Ahuja, N. (1999). A SNoW-BasedFace Detector. In NIPS, pages 862–868.

[Yang et al., 2004] Yang, P., Shan, S., Gao, W., Li, S., and Zhang, D. (17-19 May2004). Face recognition using Ada-Boosted Gabor features. In Sixth IEEE In-ternational Conference on Automatic Face and Gesture Recognition, 2004., pages356–361.

[Zafeiriou et al., 2005] Zafeiriou, S., Tefas, A., and Pitas, I. (2005). Exploiting Dis-criminant Information in Elastic Graph Matching. In Proceedings IEEE Interna-tional Conference on Image Processing, volume 3, pages 768–771.

[Zhang et al., 2004] Zhang, G., Huang, X., Li, S., Wang, Y., and Wu, X. (2004).Boosting Local Binary Pattern (LBP)-based Face Recognition. In SINOBIOMET-RICS, pages 179–186.

[Zhang et al., 2005] Zhang, L., Ai, H., Xin, S., Huang, C., Tsukiji, S., and Lao, S.(11-14 Sept. 2005). Robust face alignment based on local texture classifiers. InIEEE International Conference on Image Processing, 2005, pages II,354–357.

BIBLIOGRAPHY 235

[Zhang and Samaras, 2006] Zhang, L. and Samaras, D. (2006). Face RecognitionFrom a Single Training Image under Arbitrary Unknown Lighting Using SphericalHarmonics. IEEE Transactions on Pattern Analysis and Machine Intelligence,28(3):351–363.

[Zhao et al., 2007] Zhao, L.-H., Zhang, X.-L., and Xu, X.-H. (2007). Face recognitionbased on KPCA with polynomial kernels. In International Conference on WaveletAnalysis and Pattern Recognition, 2007, volume 3, pages 1213–1216.

[Zhao and Chellappa, 1999] Zhao, W. and Chellappa, R. (1999). Robust Face Recog-nition using Symmetric Shape-from-Shading. Center for Automation Research,University of Maryland, College Park, Technical Report CARTR -919, 1999.

[Zhao et al., 2003] Zhao, W., Chellappa, R., Phillips, P. J., and Rosenfeld, A. (2003).Face Recognition: A Literature Survey. ACM Computing Surveys, 35(4):399 – 458.

[Zheng, 2006] Zheng, W. (2006). KDA Plus KPCA for Face Recognition. In Third In-ternational Symposium on Neural Networks: Advances in Neural Networks - ISNN,pages 85–92.

[Zhou et al., 2004] Zhou, S., Chellappa, R., and Moghaddam, B. (2004). Visual track-ing and recognition using appearance-adaptive models in particle filters. IEEETransactions on Image Processing, 13(11):1491 – 1506.

[Zhou et al., 2003] Zhou, S., Krueger, V., and Chellappa, R. (2003). ProbabilisticRecognition of Human Faces from Video. Computer Vision and Image Under-standing, 91(1-2 Special Issue on Face Recognition):214 – 245.