Upload
lesley-clarke
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Centre for Vision, Speech and Signal Processing, University of Surrey, UK
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Introduction
Method
Audio-visual experiments
Summary
Overview
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Introduction
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Human-to-human interaction: emotions play an important role by allowing people to express themselves beyond the verbal domain.
Message: Linguistic information: textual content of a message Paralinguistic information: body language, facial expressions, pitch and tone of voice, health and identity. [Busso et al., 2007]
Audio based emotion recognition: Borchert et al., 2005; Lin and Wei, 2005; Luengo et al., 2005.
Visual based emotion recognition: Ashraf et al., 2007; Bartlett et al., 2005; Valstar et al., 2007.
Audio-visual based emotion recognition: Busso et al., 2004; Schuller et al., 2007; Song et al., 2004.
Recording of audio-visual database
Feature extraction
Feature selection
Feature reduction
Classification
Method
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
• Four English male actors recorded in 7 emotions: anger, disgust, fear, happiness, neutral, sadness and surprise.
• 15 phonetically-balanced TIMIT sentences per emotion: 3 common, 2 emotion specific and 10 generic sentences.
• Size of database: 480 sentences, 120 utterances per subject.
• For visual features, face was painted with 60 markers.
• 3dMD dynamic face capture system [3dMD capture system, 2009 ].
Audio-visual database
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Figure: Subjects with expressions (from left): Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral).
Audio features Pitch features: min., max., mean and standard deviation of Mel frequency and its first order difference. Energy features: min., max., mean, standard deviation and range of energies in speech signal and in the frequency bands 0-0.5 kHz, 0.5-1 kHz, 1-2 kHz, 2-4 kHz and 4-8 kHz. Duration features: voiced speech, unvoiced speech, sentence duration, voiced-to-unvoiced speech duration ratio, speech rate, voiced-speech-to-sentence duration ratio. Spectral features: mean and standard deviation of 12 MFCCs.
Visual features mean and standard deviation of 2D facial marker coordinates (60 markers).
Feature extraction
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Figure: Audio feature extraction with Speech Filing System software (left), and video data (right) with overlaid tracked marker locations.
Feature extraction
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Plus l-Take Away r algorithm [Chen, 1978]
• Combines both Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) algorithms.
• Criterion: Bhattacharyya distance
Feature selection
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
• Principal Component Analysis• Linear Discriminant Analysis
Feature reduction
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
SummaryWithin-class distance
Between-class distance LDAPCA
Gaussian classifier A Gaussian classifier uses Bayes decision theory where the class-
conditional probability density is assumed to have Gaussian distribution for each class . The Bayes decision rule
is described as
Classification
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
where is the posterior probability, is the prior class probability.
• Human recognition results provide a useful benchmark for audio, visual and audio-visual scenarios.
• 10 subjects: 5 native speaker, 5 non-native. • Half of the evaluators were female.• 10 different sets of the data were created by using Balanced
Latin Square method. [Edwards, 1962]
• Training: three facial expression pictures, two audio files, and a short movie clip for each of the emotion.• Displeased (anger, disgust), Gloomy (fear, sadness), Excited
(happiness, surprise) and Neutral (neutral).
Human evaluation experiments
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
• Audio• Visual• Audio-visual
Decision-level fusion• Overcomes the problem of different time scales and metric
levels of audio and visual data.• High dimensionality of the concatenated vector resulted in
case of feature-level fusion.• Haq, Jackson & Edge, 2008.
Busso et al., 2004; Wang et al., 2005; Zeng et al., 2007; Pal et al., 2006;
Experiments
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 7 emotion classes.
Results
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 4 emotion classes.
Results
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
7 Emotion Classes
Human KL JE JK DC Mean (±CI)
Audio 53.2 67.7 71.2 73.7 66.5 ± 2.5
Visual 89.0 89.8 88.6 84.7 88.0 ± 0.6
Audio-visual 92.1 92.1 91.3 91.7 91.8 ± 0.1
LDA 6 KL JE JK DC Mean (±CI)
Audio 35.0 62.5 56.7 70.8 56.3 ± 6.7
Visual 96.7 92.5 92.5 100 95.4 ± 1.6
Audio-visual 97.5 96.7 95.8 100 97.5 ± 0.8
PCA 10 KL JE JK DC Mean (±CI)
Audio 30.8 51.7 55.0 62.5 50.0 ± 6.0
Visual 91.7 87.5 89.2 98.3 91.7 ± 2.1
Audio-visual 92.5 86.7 94.2 98.3 92.9 ± 2.1
Results
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Human KL JE JK DC Mean (±CI)
Audio 63.2 80.9 79.2 82.0 76.3 ± 2.4
Visual 90.6 97.2 90.0 87.4 91.3 ± 1.1
Audio-visual 94.4 98.3 93.5 94.5 95.2 ± 0.6
PCA 7 KL JE JK DC Mean (±CI)
Audio 50.0 58.3 65.8 72.5 61.7 ± 4.3
Visual 80.8 98.3 93.3 88.3 90.2 ± 3.3
Audio-visual 83.3 97.5 95.0 93.3 92.3 ± 2.7
LDA 3 KL JE JK DC Mean (±CI)
Audio 60.0 75.8 58.3 80.0 68.5 ± 4.8
Visual 97.5 100 94.2 100 97.9 ± 1.2
Audio-visual 97.5 100 94.2 100 97.9 ± 1.2
Results
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
4 Emotion Classes
Conclusions• For British English emotional database a recognition rate
comparable to human was achieved (speaker-dependent).• The LDA outperformed PCA with the top 40 features. • Both audio and visual information were useful for emotion
recognition.• Important audio features: Energy and MFCC.• Important visual features: mean of marker y-coordinate (vertical
movement of face).• Reason for higher performance of system: evaluators were not
given equal opportunity to exploit the data.
Future Work• To perform speaker-independent experiments, using more data
and other classifiers, such as GMM and SVM.
Summary
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Ashraf, A.B., Lucey, S., Cohn, J.F., Chen, T., Ambadar, Z., Prkachin, K., Solomon, P. & Theobald, B.J. (2007). The Painful Face: Pain Expression Recognition Using Active Appearance Models, Proc. Ninth ACM Int'l Conf. Multimodal Interfaces, pp. 9-14.
Bartlett, M.S., Littlewort, G., Frank, M. et al. (2005). Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior, Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition , pp. 568-573.
Borchert, M. & Düsterhöft, A. (2005). Emotions in Speech – Experiments with Prosody and Quality Features in Speech for Use in Categorical and Dimensional Emotion Recognition Environments, Proc. NLPKE, pp. 147–151.
Busso, C. & Narayanan, S.S. (2007). Interrelation between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study. IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331-2347.
Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U. & Narayanan, S. (2004). Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 205-211.
Chen, C. (1978). Pattern Recognition and Signal Processing. Sijthoff & Noordoff, The Netherlands.
Edwards, A. L. (1962). Experimental Design in Psychological Research. New York: Holt, Rinehart and Winston.
3dMD 4D Capture System. Online: http://www.3dmd.com, accessed on 3 May, 2009.
References
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Haq, S., Jackson, P.J.B. & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification, Proc. Auditory-Visual Speech Processing, pp. 185-190.
Lin, Y. & Wei, G. (2005). Speech Emotion Recognition Based on HMM and SVM, Proc. 4th Int’l Conf. on Mach. Learn. and Cybernetics, pp. 4898-4901.
Luengo, I., Navas, E. et al. (2005). Automatic Emotion Recognition using Prosodic Parameters, Proc. Interspeech, pp. 493-496.
Pal, P., Iyer, A.N. & Yantorno, R.E. (2006). Emotion Detection from Infant Facial Expressions and Cries, Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing, vol. 2, pp.721-724.
Schuller, B., Muller, R., Hornler, B. et al. (2007). Audiovisual Recognition of Spontaneous Interest within Conversations, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 30-37.
Song, M., Bu, J., Chen, C. & Li, N. (2004). Audio-Visual-Based Emotion Recognition: A New Approach, Proc. Int'l Conf. Computer Vision and Pattern Recognition, pp. 1020-1025.
Valstar, M.F., Gunes, H. & Pantic, M. (2007). How to Distinguish Posed from Spontaneous Smiles Using Geometric Features, Proc. ACM Int’l Conf. Multimodal Interfaces, pp. 38-45.
Wang, Y. & Guan, L. (2005). Recognizing Human Emotion from Audiovisual Information, Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1125-1128.
Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D. & Levinson, S. (2007). Audio-Visual Affect Recognition. IEEE Trans. Multimedia, vol. 9, no. 2, pp. 424-428.
References
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary
Thank you for your attention!
R
Speaker-Dependent Audio-Visual Emotion Recognition
Sanaul HaqPhilip J.B. Jackson
Outline
Introduction
Method
Audio-visual experiments
Summary