SecurePhone Workshop - 24/25 June 2004
1
Speaking Faces Verification
Kevin McTaitRaphaël BlouetGérard Chollet
Silvia ColónGuido Aversano
SecurePhone Workshop - 24/25 June 2004
2
Outline
- Speaking faces verification problem
- State of the art in speaking faces verification
- Choice of system architecture
- Fusion of audio and visual modalities
- Initial results using BANCA database (Becars: voice only system)
SecurePhone Workshop - 24/25 June 2004
3
Problem definition-Detection and tracking of lips in the video sequence:
-Locate head/face in image frame-Locate mouth/lips area (Region of Interest)-Determine/calculate lip contours coordinates and intensity parameters (visual feature extraction) -Other parameters: visible teeth, tongue jaw movement, eyebrows, cheeks etc…
-Modelling parameters-Model deformation of lip (or other) parameters over time:-HMMs, GMMs…-Fusion of visual and acoustic parameters/models
-Calculate likelihood of model relative to client/world model in order to accept/reject-Augment in-house speaker verification system (Becars) with visual parameters
SecurePhone Workshop - 24/25 June 2004
4
Limitations
-Limited device (storage and CPU processing power)-Subject variability (aging, beard, glasses…), pose, illumination-Low complexity algorithms
-Subspace transforms, learning methods-Image based approaches, hue colouration/chromaticity clues-Model based approaches
SecurePhone Workshop - 24/25 June 2004
5
Active Shape Models
-Identification: based on spatio-temporal analysis of video sequence-Person represented by deformable parametric model of visible speech articulators (usually lips) with their temporal characteristics- Active Shape Model consists of shape parameters (lip contours) and greyscale/colour intensity (for illumination)-Model trained on training set using PCA to recover principal modes of deformation of the model- Model used to track lips over time, model parameters recovered from lip tracking results- Shape and intensity modelled by GMMs, temporal dependencies (state transition probabilities) by HMMs-Verification: using a Viterbi algorithm, if estimation of likelihood of model generating the observed sequence of features corresponding to a client is above a threshold, then accept, else reject
SecurePhone Workshop - 24/25 June 2004
6
Active Shape Models-Robust detection, tracking & parameterisation of visual features-Statistical, avoids use of constraints, thresholds, penalties-Model only allowed to deform to shapes similar to those seen in training set (trained using PCA)-Represent object by set of labelled points representing contours, height width, area etc.-Model consists of 5 Bézier curves (B-spline functions), each defined as two end points PO and P1 and one control point P1 :
P(t) = θ0(t)P0 + θ1(t)P1 + θ2(t)P2
points distribution model shape approximation
SecurePhone Workshop - 24/25 June 2004
7
Spatio-temporal model-Visual observation of speaker: O = o1, o2…oT
-Assumption: feature vectors follow normal distribution as in acoustic domain, modelled by GMMs-Assumption: temporal changes are piece-wise stationary and follow first order Markov process-Each state in HMM represents several consecutive feature vectors
SecurePhone Workshop - 24/25 June 2004
11
Image Based Approach
-Hue and saturation levels to find lip region (ROI)
-Eliminate outliers (red blobs) by constraints (geometric, gradient, saturation)
-Motion constraints: difference image (1d) pixelwise absolute difference between two adjacent frames
-a) greyscale image
-b) hue image
-c) binary hue/saturation threshholding
-c) accumulated difference image
-e) binary image after threshholding
-f) combined binary image c AND e-Find largest connecting region
SecurePhone Workshop - 24/25 June 2004
12
Image Based Approach (2)
-Derive lip dimensions using colour and edge information
-Random Markov field framework to combine two sources of info and segment lips from background
-Implementation close to completion
SecurePhone Workshop - 24/25 June 2004
13
Other Approaches-Deformable template/model/contour based:
-Geometric shapes, shape models, eigen vectors, appearance models, deform in order to minimise energy/distance function relating to template paramaters and image, template matching (correlation), best fit template, active shape models, active appearance models, model fitting problem
-Learning based approach:-MLP, SVMs…
-Knowledge based approach:-Subject rules or information to find and extract features, eye/nose detection symmetry
-Visual Motion analysis:-Motion analysis techniques, motion cues, difference images after thresholding and filtering-Optical flow, filter tracking (computationally expensive)
-Hue and saturation threshholding-Intensity of ruddy areas, pb of removal of outliers
-Image subspace transforms:-DCT, PCA, Discrete Wavelet, KLT (DWT + PCA analysis of ROI), FFT
SecurePhone Workshop - 24/25 June 2004
14
Fusion of audio-visual information-Instance of general classifier problem (bimodal classifier)
-2 observation streams: audio + video providing info about hidden class labels
-Typically each observation stream used to train a single modality classifier
-Aim: combine both streams to produce bimodal classifier to recognise pertinent classes with higher level of accuracy
-2 general types/levels of fusion:
-Feature fusion
-Decision fusion
SecurePhone Workshop - 24/25 June 2004
15
Feature Fusion
-Feature fusion: HMM classifier, concatenated feature vector of audio and visual parameters – time synchronous features, possibly including upsampling)
-Generation process of feature vector
-Using single stream HMM with emission (class conditional observation) probabilities given by Gaussian distribution:
SecurePhone Workshop - 24/25 June 2004
16
Decision Fusion
-State synchronous decision fusion-Captures reliability of each stream-HMM state level-combine single modality HMM classifier outputs-Class conditional log-likelihoods from the 2 classifiers linearly combined with appropriate weights-Various level: state (phone, syllable, word…)-multi-stream HMMs classifier, state emission probs:
-Product HMMs, factorial HMMs…-Other classifiers (SVMs, Bayesian classifiers, MLP…)
SecurePhone Workshop - 24/25 June 2004
17
Banca: results