20
Speaker-Dependent Audio- Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey, UK Speaker- Dependent Audio- Visual Emotion Recogniti on Sanaul Haq Philip J.B. Jackson Outline Introduct ion Method Audio- visual experimen ts

Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Embed Size (px)

Citation preview

Page 1: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Centre for Vision, Speech and Signal Processing, University of Surrey, UK

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 2: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Introduction

Method

Audio-visual experiments

Summary

Overview

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 3: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Introduction

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Human-to-human interaction: emotions play an important role by allowing people to express themselves beyond the verbal domain.

Message: Linguistic information: textual content of a message Paralinguistic information: body language, facial expressions, pitch and tone of voice, health and identity. [Busso et al., 2007]

Audio based emotion recognition: Borchert et al., 2005; Lin and Wei, 2005; Luengo et al., 2005.

Visual based emotion recognition: Ashraf et al., 2007; Bartlett et al., 2005; Valstar et al., 2007.

Audio-visual based emotion recognition: Busso et al., 2004; Schuller et al., 2007; Song et al., 2004.

Page 4: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Recording of audio-visual database

Feature extraction

Feature selection

Feature reduction

Classification

Method

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 5: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

• Four English male actors recorded in 7 emotions: anger, disgust, fear, happiness, neutral, sadness and surprise.

• 15 phonetically-balanced TIMIT sentences per emotion: 3 common, 2 emotion specific and 10 generic sentences.

• Size of database: 480 sentences, 120 utterances per subject.

• For visual features, face was painted with 60 markers.

• 3dMD dynamic face capture system [3dMD capture system, 2009 ].

Audio-visual database

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Figure: Subjects with expressions (from left): Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral).

Page 6: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Audio features Pitch features: min., max., mean and standard deviation of Mel frequency and its first order difference. Energy features: min., max., mean, standard deviation and range of energies in speech signal and in the frequency bands 0-0.5 kHz, 0.5-1 kHz, 1-2 kHz, 2-4 kHz and 4-8 kHz. Duration features: voiced speech, unvoiced speech, sentence duration, voiced-to-unvoiced speech duration ratio, speech rate, voiced-speech-to-sentence duration ratio. Spectral features: mean and standard deviation of 12 MFCCs.

Visual features mean and standard deviation of 2D facial marker coordinates (60 markers).

Feature extraction

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 7: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Figure: Audio feature extraction with Speech Filing System software (left), and video data (right) with overlaid tracked marker locations.

Feature extraction

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 8: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Plus l-Take Away r algorithm [Chen, 1978]

• Combines both Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) algorithms.

• Criterion: Bhattacharyya distance

Feature selection

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 9: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

• Principal Component Analysis• Linear Discriminant Analysis

Feature reduction

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

SummaryWithin-class distance

Between-class distance LDAPCA

Page 10: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Gaussian classifier A Gaussian classifier uses Bayes decision theory where the class-

conditional probability density is assumed to have Gaussian distribution for each class . The Bayes decision rule

is described as

Classification

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

where is the posterior probability, is the prior class probability.

Page 11: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

• Human recognition results provide a useful benchmark for audio, visual and audio-visual scenarios.

• 10 subjects: 5 native speaker, 5 non-native. • Half of the evaluators were female.• 10 different sets of the data were created by using Balanced

Latin Square method. [Edwards, 1962]

• Training: three facial expression pictures, two audio files, and a short movie clip for each of the emotion.• Displeased (anger, disgust), Gloomy (fear, sadness), Excited

(happiness, surprise) and Neutral (neutral).

Human evaluation experiments

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 12: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

• Audio• Visual• Audio-visual

Decision-level fusion• Overcomes the problem of different time scales and metric

levels of audio and visual data.• High dimensionality of the concatenated vector resulted in

case of feature-level fusion.• Haq, Jackson & Edge, 2008.

Busso et al., 2004; Wang et al., 2005; Zeng et al., 2007; Pal et al., 2006;

Experiments

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 13: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 7 emotion classes.

Results

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 14: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 4 emotion classes.

Results

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 15: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

7 Emotion Classes

Human KL JE JK DC Mean (±CI)

Audio 53.2 67.7 71.2 73.7 66.5 ± 2.5

Visual 89.0 89.8 88.6 84.7 88.0 ± 0.6

Audio-visual 92.1 92.1 91.3 91.7 91.8 ± 0.1

LDA 6 KL JE JK DC Mean (±CI)

Audio 35.0 62.5 56.7 70.8 56.3 ± 6.7

Visual 96.7 92.5 92.5 100 95.4 ± 1.6

Audio-visual 97.5 96.7 95.8 100 97.5 ± 0.8

PCA 10 KL JE JK DC Mean (±CI)

Audio 30.8 51.7 55.0 62.5 50.0 ± 6.0

Visual 91.7 87.5 89.2 98.3 91.7 ± 2.1

Audio-visual 92.5 86.7 94.2 98.3 92.9 ± 2.1

Results

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 16: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Human KL JE JK DC Mean (±CI)

Audio 63.2 80.9 79.2 82.0 76.3 ± 2.4

Visual 90.6 97.2 90.0 87.4 91.3 ± 1.1

Audio-visual 94.4 98.3 93.5 94.5 95.2 ± 0.6

PCA 7 KL JE JK DC Mean (±CI)

Audio 50.0 58.3 65.8 72.5 61.7 ± 4.3

Visual 80.8 98.3 93.3 88.3 90.2 ± 3.3

Audio-visual 83.3 97.5 95.0 93.3 92.3 ± 2.7

LDA 3 KL JE JK DC Mean (±CI)

Audio 60.0 75.8 58.3 80.0 68.5 ± 4.8

Visual 97.5 100 94.2 100 97.9 ± 1.2

Audio-visual 97.5 100 94.2 100 97.9 ± 1.2

Results

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

4 Emotion Classes

Page 17: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Conclusions• For British English emotional database a recognition rate

comparable to human was achieved (speaker-dependent).• The LDA outperformed PCA with the top 40 features. • Both audio and visual information were useful for emotion

recognition.• Important audio features: Energy and MFCC.• Important visual features: mean of marker y-coordinate (vertical

movement of face).• Reason for higher performance of system: evaluators were not

given equal opportunity to exploit the data.

Future Work• To perform speaker-independent experiments, using more data

and other classifiers, such as GMM and SVM.

Summary

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 18: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Ashraf, A.B., Lucey, S., Cohn, J.F., Chen, T., Ambadar, Z., Prkachin, K., Solomon, P. & Theobald, B.J. (2007). The Painful Face: Pain Expression Recognition Using Active Appearance Models, Proc. Ninth ACM Int'l Conf. Multimodal Interfaces, pp. 9-14.

Bartlett, M.S., Littlewort, G., Frank, M. et al. (2005). Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior, Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition , pp. 568-573.

Borchert, M. & Düsterhöft, A. (2005). Emotions in Speech – Experiments with Prosody and Quality Features in Speech for Use in Categorical and Dimensional Emotion Recognition Environments, Proc. NLPKE, pp. 147–151.

Busso, C. & Narayanan, S.S. (2007). Interrelation between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study. IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331-2347.

Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U. & Narayanan, S. (2004). Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 205-211.

Chen, C. (1978). Pattern Recognition and Signal Processing. Sijthoff & Noordoff, The Netherlands.

Edwards, A. L. (1962). Experimental Design in Psychological Research. New York: Holt, Rinehart and Winston.

3dMD 4D Capture System. Online: http://www.3dmd.com, accessed on 3 May, 2009.

References

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 19: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Haq, S., Jackson, P.J.B. & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification, Proc. Auditory-Visual Speech Processing, pp. 185-190.

Lin, Y. & Wei, G. (2005). Speech Emotion Recognition Based on HMM and SVM, Proc. 4th Int’l Conf. on Mach. Learn. and Cybernetics, pp. 4898-4901.

Luengo, I., Navas, E. et al. (2005). Automatic Emotion Recognition using Prosodic Parameters, Proc. Interspeech, pp. 493-496.

Pal, P., Iyer, A.N. & Yantorno, R.E. (2006). Emotion Detection from Infant Facial Expressions and Cries, Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing, vol. 2, pp.721-724.

Schuller, B., Muller, R., Hornler, B. et al. (2007). Audiovisual Recognition of Spontaneous Interest within Conversations, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 30-37.

Song, M., Bu, J., Chen, C. & Li, N. (2004). Audio-Visual-Based Emotion Recognition: A New Approach, Proc. Int'l Conf. Computer Vision and Pattern Recognition, pp. 1020-1025.

Valstar, M.F., Gunes, H. & Pantic, M. (2007). How to Distinguish Posed from Spontaneous Smiles Using Geometric Features, Proc. ACM Int’l Conf. Multimodal Interfaces, pp. 38-45.

Wang, Y. & Guan, L. (2005). Recognizing Human Emotion from Audiovisual Information, Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1125-1128.

Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D. & Levinson, S. (2007). Audio-Visual Affect Recognition. IEEE Trans. Multimedia, vol. 9, no. 2, pp. 424-428.

References

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Page 20: Speaker-Dependent Audio-Visual Emotion Recognition Sanaul Haq Philip J.B. Jackson Centre for Vision, Speech and Signal Processing, University of Surrey,

Thank you for your attention!

R

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary