Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Centre for Vision, Speech and Signal Processing, University of Surrey, UK



Outline

Introduction

Method

Audio-visual experiments

Summary

Introduction

Method


Summary

Overview



Outline

Introduction

Method


Summary

Introduction



Outline

Introduction

Method


Summary

Human-to-human interaction: emotions play an important role by allowing people to express themselves beyond the verbal domain.

Message: Linguistic information: textual content of a message Paralinguistic information: body language, facial expressions, pitch and tone of voice, health and identity. [Busso et al., 2007]

Audio based emotion recognition: Borchert et al., 2005; Lin and Wei, 2005; Luengo et al., 2005.

Visual based emotion recognition: Ashraf et al., 2007; Bartlett et al., 2005; Valstar et al., 2007.

Audio-visual based emotion recognition: Busso et al., 2004; Schuller et al., 2007; Song et al., 2004.

Recording of audio-visual database

Feature extraction

Feature selection

Feature reduction

Classification

Method



Outline

Introduction

Method


Summary

• Four English male actors recorded in 7 emotions: anger, disgust, fear, happiness, neutral, sadness and surprise.

• 15 phonetically-balanced TIMIT sentences per emotion: 3 common, 2 emotion specific and 10 generic sentences.

• Size of database: 480 sentences, 120 utterances per subject.

• For visual features, face was painted with 60 markers.

• 3dMD dynamic face capture system [3dMD capture system, 2009 ].

Audio-visual database



Outline

Introduction

Method


Summary

Figure: Subjects with expressions (from left): Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral).

Audio features Pitch features: min., max., mean and standard deviation of Mel frequency and its first order difference. Energy features: min., max., mean, standard deviation and range of energies in speech signal and in the frequency bands 0-0.5 kHz, 0.5-1 kHz, 1-2 kHz, 2-4 kHz and 4-8 kHz. Duration features: voiced speech, unvoiced speech, sentence duration, voiced-to-unvoiced speech duration ratio, speech rate, voiced-speech-to-sentence duration ratio. Spectral features: mean and standard deviation of 12 MFCCs.

Visual features mean and standard deviation of 2D facial marker coordinates (60 markers).

Feature extraction



Outline

Introduction

Method


Summary

Figure: Audio feature extraction with Speech Filing System software (left), and video data (right) with overlaid tracked marker locations.

Feature extraction



Outline

Introduction

Method


Summary

Plus l-Take Away r algorithm [Chen, 1978]

• Combines both Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) algorithms.

• Criterion: Bhattacharyya distance

Feature selection



Outline

Introduction

Method


Summary

• Principal Component Analysis• Linear Discriminant Analysis

Feature reduction



Outline

Introduction

Method


SummaryWithin-class distance

Between-class distance LDAPCA

Gaussian classifier A Gaussian classifier uses Bayes decision theory where the class-

conditional probability density is assumed to have Gaussian distribution for each class . The Bayes decision rule

is described as

Classification



Outline

Introduction

Method


Summary

where is the posterior probability, is the prior class probability.

• Human recognition results provide a useful benchmark for audio, visual and audio-visual scenarios.

• 10 subjects: 5 native speaker, 5 non-native. • Half of the evaluators were female.• 10 different sets of the data were created by using Balanced

Latin Square method. [Edwards, 1962]

• Training: three facial expression pictures, two audio files, and a short movie clip for each of the emotion.• Displeased (anger, disgust), Gloomy (fear, sadness), Excited

(happiness, surprise) and Neutral (neutral).

Human evaluation experiments



Outline

Introduction

Method


Summary

• Audio• Visual• Audio-visual

Decision-level fusion• Overcomes the problem of different time scales and metric

levels of audio and visual data.• High dimensionality of the concatenated vector resulted in

case of feature-level fusion.• Haq, Jackson & Edge, 2008.

Busso et al., 2004; Wang et al., 2005; Zeng et al., 2007; Pal et al., 2006;

Experiments



Outline

Introduction

Method


Summary

Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 7 emotion classes.

Results



Outline

Introduction

Method


Summary

Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 4 emotion classes.

Results



Outline

Introduction

Method


Summary

7 Emotion Classes

Human KL JE JK DC Mean (±CI)

Audio 53.2 67.7 71.2 73.7 66.5 ± 2.5

Visual 89.0 89.8 88.6 84.7 88.0 ± 0.6

Audio-visual 92.1 92.1 91.3 91.7 91.8 ± 0.1

LDA 6 KL JE JK DC Mean (±CI)

Audio 35.0 62.5 56.7 70.8 56.3 ± 6.7

Visual 96.7 92.5 92.5 100 95.4 ± 1.6

Audio-visual 97.5 96.7 95.8 100 97.5 ± 0.8

PCA 10 KL JE JK DC Mean (±CI)

Audio 30.8 51.7 55.0 62.5 50.0 ± 6.0

Visual 91.7 87.5 89.2 98.3 91.7 ± 2.1

Audio-visual 92.5 86.7 94.2 98.3 92.9 ± 2.1

Results



Outline

Introduction

Method


Summary

Human KL JE JK DC Mean (±CI)

Audio 63.2 80.9 79.2 82.0 76.3 ± 2.4

Visual 90.6 97.2 90.0 87.4 91.3 ± 1.1

Audio-visual 94.4 98.3 93.5 94.5 95.2 ± 0.6

PCA 7 KL JE JK DC Mean (±CI)

Audio 50.0 58.3 65.8 72.5 61.7 ± 4.3

Visual 80.8 98.3 93.3 88.3 90.2 ± 3.3

Audio-visual 83.3 97.5 95.0 93.3 92.3 ± 2.7

LDA 3 KL JE JK DC Mean (±CI)

Audio 60.0 75.8 58.3 80.0 68.5 ± 4.8

Visual 97.5 100 94.2 100 97.9 ± 1.2

Audio-visual 97.5 100 94.2 100 97.9 ± 1.2

Results



Outline

Introduction

Method


Summary

4 Emotion Classes

Conclusions• For British English emotional database a recognition rate

comparable to human was achieved (speaker-dependent).• The LDA outperformed PCA with the top 40 features. • Both audio and visual information were useful for emotion

recognition.• Important audio features: Energy and MFCC.• Important visual features: mean of marker y-coordinate (vertical

movement of face).• Reason for higher performance of system: evaluators were not

given equal opportunity to exploit the data.

Future Work• To perform speaker-independent experiments, using more data

and other classifiers, such as GMM and SVM.

Summary



Outline

Introduction

Method


Summary

Ashraf, A.B., Lucey, S., Cohn, J.F., Chen, T., Ambadar, Z., Prkachin, K., Solomon, P. & Theobald, B.J. (2007). The Painful Face: Pain Expression Recognition Using Active Appearance Models, Proc. Ninth ACM Int'l Conf. Multimodal Interfaces, pp. 9-14.

Bartlett, M.S., Littlewort, G., Frank, M. et al. (2005). Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior, Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition , pp. 568-573.

Borchert, M. & Düsterhöft, A. (2005). Emotions in Speech – Experiments with Prosody and Quality Features in Speech for Use in Categorical and Dimensional Emotion Recognition Environments, Proc. NLPKE, pp. 147–151.

Busso, C. & Narayanan, S.S. (2007). Interrelation between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study. IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331-2347.

Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U. & Narayanan, S. (2004). Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 205-211.

Chen, C. (1978). Pattern Recognition and Signal Processing. Sijthoff & Noordoff, The Netherlands.

Edwards, A. L. (1962). Experimental Design in Psychological Research. New York: Holt, Rinehart and Winston.

3dMD 4D Capture System. Online: http://www.3dmd.com, accessed on 3 May, 2009.

References



Outline

Introduction

Method


Summary

http://www.3dmd.com/

Haq, S., Jackson, P.J.B. & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification, Proc. Auditory-Visual Speech Processing, pp. 185-190.

Lin, Y. & Wei, G. (2005). Speech Emotion Recognition Based on HMM and SVM, Proc. 4th Int’l Conf. on Mach. Learn. and Cybernetics, pp. 4898-4901.

Luengo, I., Navas, E. et al. (2005). Automatic Emotion Recognition using Prosodic Parameters, Proc. Interspeech, pp. 493-496.

Pal, P., Iyer, A.N. & Yantorno, R.E. (2006). Emotion Detection from Infant Facial Expressions and Cries, Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing, vol. 2, pp.721-724.

Schuller, B., Muller, R., Hornler, B. et al. (2007). Audiovisual Recognition of Spontaneous Interest within Conversations, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 30-37.

Song, M., Bu, J., Chen, C. & Li, N. (2004). Audio-Visual-Based Emotion Recognition: A New Approach, Proc. Int'l Conf. Computer Vision and Pattern Recognition, pp. 1020-1025.

Valstar, M.F., Gunes, H. & Pantic, M. (2007). How to Distinguish Posed from Spontaneous Smiles Using Geometric Features, Proc. ACM Int’l Conf. Multimodal Interfaces, pp. 38-45.

Wang, Y. & Guan, L. (2005). Recognizing Human Emotion from Audiovisual Information, Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1125-1128.

Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D. & Levinson, S. (2007). Audio-Visual Affect Recognition. IEEE Trans. Multimedia, vol. 9, no. 2, pp. 424-428.

References



Outline

Introduction

Method


Summary

Thank you for your attention!

R



Outline

Introduction

Method


Summary

Documents

Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,