Upload
donna-kennedy
View
226
Download
1
Tags:
Embed Size (px)
Citation preview
System Description
Inputs : Speech signal Outputs: Facial Animation
A generic 3D face in MPEG4 standard
Speech stream
Agenda
MPEG4 Standard Speech Processing Different Approaches Learning Phase Face Feature Extraction Training Neural Networks Experimental Results Conclusion
MPEG4 Standard
Multimeida Communication Standard 1999 / Moving Picture Expert Group High quality / Low bit rate Interaction of users with media
Object Oriented Object Properties
Scalable quality SNHC (Synthetic Natural Hybrid Coding)
Synthetic faces and bodies
Facial Animation in MPEG4
FDP (Face Definition Parameters) Shape
84 Feature Points Texture
FAP ( Face Animation Parameters) For animating feature points 68 parameter High level / Low level Global and local parameters FAP units
Speech Processing
Phases: Noise Reduction
Simple noise Framing
Feature Extraction Speech features:
LPC ,MFCC, Delta MFCC, Delta Delta MFCC
Frame 1Frame 2
Feature VectorX1
Feature VectorX2
Two Approaches
Phoneme-Viseme Mapping Approaches Transitions among visemes Discrete phonetic units Extremely stylized Language dependent
Acoustic-Visual Mapping Approaches Relation between speech features and facial expressions Functional approximation Language independent Neural networks and HMM : learning machines for mapping
Learning Phase
Speaker Video
Speech stream
Feature Extraction Feature Extraction
Training NNFAP Extraction
FAP Player
Face Feature Extraction
Deformable template based approach Semi automatic Candid model
A wire frame model For model based coding Parameterized 113 vertex 168 face
Candid Model
Parameters of WFM Global
3d Rotation , 2d Translation, Scale
Shape UnitsLip Width, Eyes Distance, …
Action UnitsLip Shape, Eyebrow, …
Each parameter value is a real number
Texture
Transformation
(a(a11, b, b11))PP PP**
OO OO**
YY
XX
Correspondences: Correspondences: (a(a11, b, b11)) ((xx11, , yy11)), , (a(a22, b, b22)) ((xx22, , yy22)), , (a(a33, b, b33)) ((xx33, , yy33),),
**
((aa22, , bb22))
(a(a33, b, b33))
((xx22, , yy22))
((xx33, , yy33))
((xx11, , yy11))
**
sourcesource targettarget
Model Adaptation
Selecting Optimal Parameters Global Parameters:
3d Rotation , 2d Translation, Scale Lip Parameters:
Upper Lip Jaw Open Lip Width Lip Corners Vertical Movements
Full Search ( expensive ) Using Previous Frame Information
Lip Reading
Using of color data to guess lip area Using extracted lip area to guess lip model
parameters. Upper lip, jaw open, mouth width, lip corners
Using related vertex of Candide model. Two regions from first frame:
Lip regions Non lip regions
Lip Area Classification
Fisher Linear Discriminant Simple Fast
Two point sets X , Y in n dimensions
m1 is projection of X on unit vector α
m2 is projection of Y on unit vector α
Find α that maximizes
22
21
221)(
mm
mm
ssJ
Estimating Lip Parameters
FLD is trained by first frames pixels rgb data of pixels
HSV is better than RGB. Robust in different brightness conditions
Lip Area Classification
A simple approach for estimating lip parameters. Column scanning Row scanning
Generating FAPs from model
Generating FAP file from model
FAP file format Trial and error approach Open source FAP players
FAP and wave file as input
Training Neural Networks
60 videos as data set 45 sentences for train 15 sentences for test
Multilayer Perceptrons One input layer, One hidden layer, One output layer Back propagation algorithm
Nine neuron in output layer Five global parameters Four lip parameters
Training Neural Networks
Four speech features LPC, MFCC, Delta MFCC, Delta Delta MFCC
Six networks for each speech feature One feature vector as input
30, 60, 90 neuron in hidden layer Three feature vector as input
90, 120, 150 neuron in hidden layer
frame rate Video : 25 fps Speech : 50 fps
Assessment Criterion
A performance metric to measure the predicted accuracy of audio-visual mapping
Correlation Coefficients G is one if two vectors are equal
k : frame number
N : number of frames in the test set
N
k bp
bp kbkp
NG
1
))()()((1
Results For LPC Networks
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6
Upper lip
Jaw Open
Mouth Width
Mouth Corners
Mean
Results For MFCC Networks
00.10.20.30.40.50.60.70.8
1 2 3 4 5 6
Upper Lip
Jaw Open
Mouth Width
Mouth Corners
Mean
Results For Delta MFCC Networks
00.10.20.30.40.50.60.70.8
1 2 3 4 5 6
Upper Lip
Jaw Open
Mouth Width
Mouth Corners
Mean
Results For Delta Delta MFCC Networks
00.10.20.30.40.50.60.70.8
1 2 3 4 5 6
Upper Lip
Jaw Open
Mouth Width
Mouth Corners
Mean
Conclusion
Speech driven facial animation is possible! Delta Delta MFCC has the best performance Using previous and next speech frames improves
the performance. Using combination of different speech features
Future Works
More train data Speaker independent train data Multi language Other speech features Combination of speech features Facial emotions HMM for storing the mappings