1
Université de Mons Université de Mons UMons at Affective Impact of Movies Task Gueorgui Pironkov | TCTS Lab, UMons | [email protected] Omar Seddati, Emre Kulah*, Gueorgui Pironkov, Stéphane Dupont, Saïd Mahmoudi, Thierry Dutoit University of Mons, Mons, Belgium / *Middle East Technical University, Ankara, Turkey Introduction In this work, we propose a solution for both violence detection and affective impact evaluation. We investigate different visual and auditory feature extraction methods. An i-vector approach is applied for the audio and optical flow maps processed through a deep convolutional neural network are tested for the video. Classifiers based on probabilistic linear discriminant analysis and fully connected feed-forward neural networks are then used. Approach Our system is based on: i-vector: a low-dimensional feature vector extracted from high-dimensional data without losing too much of the relevant acoustic information ConvNets: a state-of-the-art technique in the field of object recognition within images Instead of applying ConvNets to 2D images (frames), we extract dense optical flow maps that represent the displacement of each pixel between two successive frames, and use a sequence of those maps as input for our ConvNets. In this way, local temporal information is projected onto a space similar to the pixel space, and ConvNets can be effectively used for dynamic information. Results Run MAP(%) i-vector – pLDA 9,56 OFM – ConvNets 9,67 OFM – ConvNets – HMDB-51 6,56 Run Valence (%) Arousal (%) i-vector – pLDA 37,03 31,71 OFM – ConvNets 35,28 44,39 OFM – ConvNets – HMDB-51 37,28 52,44 We have performed three runs for both subtasks: 1. Run 1: we use ConvNets trained from scratch on the dense optical flow maps extracted from the MediaEval dataset. 2. Run 2: we train a ConvNet on the HMDB-51 benchmark. Then, we use the convolutional layers of this network as a feature extractor, and we train a multi layer perceptron on those features. 3. Run 3: for each video, we extract 20 Mel frequency cepstral coefficients, and the associated first and second derivatives. For each shot, a 100-dimensional i-vector is extracted. All the i-vectors are then processed through three independent classifiers (one per subtask). In this work the visual and audio features are processed separately. Both features are giving similar results for violence detection and valence. For arousal, video features are far more interesting, especially when the ConvNets feature extractor is trained on external data. We investigated merging the audio and visual features together using a neural network. However, the results were poorer than using the features separately. Our future work will focus on the merging the audio and video features. Conclusion Optical flow ConvNets MLP classifier SHOT MFCC i-vector pLDA Video Audio

MediaEval 2015 - UMons at Affective Impact of Movies Task - Poster

Embed Size (px)

Citation preview

Université de Mons Université de Mons

UMons at Affective Impact of Movies Task

Gueorgui Pironkov | TCTS Lab, UMons | [email protected]

Omar Seddati, Emre Kulah*, Gueorgui Pironkov, Stéphane Dupont, Saïd Mahmoudi, Thierry Dutoit University of Mons, Mons, Belgium / *Middle East Technical University, Ankara, Turkey

Introduction

In this work, we propose a solution for both violence detection and affective impact evaluation. We investigate different visual and auditory feature extraction methods. An i-vector approach is applied for the audio and optical flow maps processed through a deep convolutional neural network are tested for the video. Classifiers based on probabilistic linear discriminant analysis and fully connected feed-forward neural networks are then used.

Approach • Our system is based on:

i-vector: a low-dimensional feature vector extracted from high-dimensional data without losing too much of the relevant acoustic information

ConvNets: a state-of-the-art technique in the field of object recognition within images • Instead of applying ConvNets to 2D images (frames), we extract dense optical flow maps that represent

the displacement of each pixel between two successive frames, and use a sequence of those maps as input for our ConvNets. In this way, local temporal information is projected onto a space similar to the pixel space, and ConvNets can be effectively used for dynamic information.

Results

Run MAP(%)

i-vector – pLDA 9,56

OFM – ConvNets 9,67

OFM – ConvNets – HMDB-51 6,56

Run Valence (%) Arousal (%)

i-vector – pLDA 37,03 31,71

OFM – ConvNets 35,28 44,39

OFM – ConvNets – HMDB-51 37,28 52,44

We have performed three runs for both subtasks: 1. Run 1: we use ConvNets trained from scratch on the dense optical flow maps extracted from the

MediaEval dataset. 2. Run 2: we train a ConvNet on the HMDB-51 benchmark. Then, we use the convolutional layers of

this network as a feature extractor, and we train a multi layer perceptron on those features. 3. Run 3: for each video, we extract 20 Mel frequency cepstral coefficients, and the associated first

and second derivatives. For each shot, a 100-dimensional i-vector is extracted. All the i-vectors are then processed through three independent classifiers (one per subtask).

In this work the visual and audio features are processed separately. Both features are giving similar results for violence detection and valence. For arousal, video features are far more interesting, especially when the ConvNets feature extractor is trained on external data. We investigated merging the audio and visual features together using a neural network. However, the results were poorer than using the features separately. Our future work will focus on the merging the audio and video features.

Conclusion

Op

tica

l fl

ow

ConvNets

MLP

classifier

SHO

T

MFC

C

i-vector

pLD

A

Vid

eo

Au

dio