21
Multimodal Information Analysis for Emotion Recognition (Tele Health Care Application) Malika Meghjani Gregory Dudek and Frank P. Ferrie

Multimodal Information Analysis for Emotion Recognition (Tele Health Care Application) Malika Meghjani Gregory Dudek and Frank P. Ferrie

Embed Size (px)

Citation preview

Multimodal Information Analysis for Emotion

Recognition(Tele Health Care Application)

Malika MeghjaniGregory Dudek and Frank P. Ferrie

Content

1. Our Goal

2. Motivation

3. Proposed Approach

4. Results

5. Conclusion

Our Goal

• Automatic emotion recognition using audio-visual information analysis.

• Create video summaries by automatically labeling the emotions in a video sequence.

Motivation• Map Emotional States of the Patient to

Nursing Interventions.• Evaluate the role of Nursing Interventions for

improvement in patient’s health.

NURSING INTERVENTIONS

Proposed Approach

Visual Feature

Extraction

Audio Feature

Extraction

Visual based Emotion

Classification

Audio based Emotion

Classification

Data Fusion

Recognized Emotional

StateDecision Level Fusion

Visual Analysis

1. Face Detection (Voila Jones Face Detector using Haar

Wavelets and Boosting Algorithm)

2. Feature Extraction (Gabor Filter: 5 Spatial Frequencies

and 4 orientations, 20 filter responses for each frame)

3. Feature Selection (Select the most discriminative features

in all the emotional classes)

4. SVM Classification (Classification with probability

estimates)

Visual Feature Extraction

X =

( 5 Frequencies X 4 Orientation ) Frequency Domain Filters

Automatic Emotion

Classification

Feature Selection

Audio Analysis

1. Audio Pre-Processing (Remove leading and trailing

edges)

2. Feature Extraction (Statistics of Pitch and Intensity

contours and Mel Frequency Cepstral Coefficients)

3. Feature Normalization (Remove inter speaker

variability)

4. SVM Classification (Classification with probability

estimates)

Audio Feature Extraction

Speech RatePitchIntensitySpectrum AnalysisMel Frequency Cepstral Coefficients(Short-term Power Spectrum of Sound)

Audio Feature

Extraction

Automatic Emotion

Classification

Audio Signal

SVM Classification

Feature Selection

Average Count of Error Features Distance Selected

Bounding Planes

• Feature selection method is similar to the SVM classification. (Wrapper Method)

• Generates a separating plane by minimizing the weighted sum of distances of misclassified data points to two parallel planes.

• Suppress as many components of the normal to the separating plane which provide consistent results for classification.

Data Fusion1. Decision Level:

• Obtaining probability estimate for each emotional class using SVM margins.

• The probability estimates from two modalities are multiplied and re-normalized to the give final estimation of decision level emotional classification.

2. Feature Level:

Concatenate the Audio and Visual feature and repeat feature selection and SVM classification process.

Database and Training

Database:1. Visual only posed database2. Audio Visual posed database

Training: • Audio segmentation based on minimum

window required for feature extraction.• Corresponding visual key frame

extraction in the segmented window. • Image based training and audio segment

based training.

Experimental Results

Statistics

Database

No. of Training

Examples

No. of Subjects

No. of Emotional

State

% Recognition

Rate

Validation Method

Posed Visual Data

Only (CKDB)

120 20 5+Nuetral 75% Leave one subject

out cross validation

Posed Audio

Visual Data (EDB)

270 9 6 82%

76%

Decision Level

Feature Level

Time Series Plot

( * Cohen Kanade Database, Posed Visual only Database)

75% Leave One Subject Out Cross Validation Results

Surprise Sad Angry Disgust Happy

Feature Level Fusion

(*eNTERFACE 2005 , Posed Audio Visual Database)

Decision Level Fusion

(*eNTERFACE 2005 , Posed Audio Visual Database)

Confusion Matrix

Demo

Conclusion• Combining two modalities (Audio and Visual)

improves overall recognition rates by 11% with Decision Level Fusion and by 6% with Feature Level Fusion

• Emotions where vision wins: Disgust, Happy and Surprise.

• Emotions where audio wins: Anger and Sadness• Fear was equally well recognized by the two

modalities.• Automated multimodal emotion recognition is

clearly effective.

Things to do…

• Inference based on temporal relation between instantaneous classifications.

• Tests on natural audio-visual database (on-going).