Upload
multimediaeval
View
90
Download
0
Embed Size (px)
Citation preview
Université de Mons Université de Mons
UMons at Affective Impact of Movies Task
Gueorgui Pironkov | TCTS Lab, UMons | [email protected]
Omar Seddati, Emre Kulah*, Gueorgui Pironkov, Stéphane Dupont, Saïd Mahmoudi, Thierry Dutoit University of Mons, Mons, Belgium / *Middle East Technical University, Ankara, Turkey
Introduction
In this work, we propose a solution for both violence detection and affective impact evaluation. We investigate different visual and auditory feature extraction methods. An i-vector approach is applied for the audio and optical flow maps processed through a deep convolutional neural network are tested for the video. Classifiers based on probabilistic linear discriminant analysis and fully connected feed-forward neural networks are then used.
Approach • Our system is based on:
i-vector: a low-dimensional feature vector extracted from high-dimensional data without losing too much of the relevant acoustic information
ConvNets: a state-of-the-art technique in the field of object recognition within images • Instead of applying ConvNets to 2D images (frames), we extract dense optical flow maps that represent
the displacement of each pixel between two successive frames, and use a sequence of those maps as input for our ConvNets. In this way, local temporal information is projected onto a space similar to the pixel space, and ConvNets can be effectively used for dynamic information.
Results
Run MAP(%)
i-vector – pLDA 9,56
OFM – ConvNets 9,67
OFM – ConvNets – HMDB-51 6,56
Run Valence (%) Arousal (%)
i-vector – pLDA 37,03 31,71
OFM – ConvNets 35,28 44,39
OFM – ConvNets – HMDB-51 37,28 52,44
We have performed three runs for both subtasks: 1. Run 1: we use ConvNets trained from scratch on the dense optical flow maps extracted from the
MediaEval dataset. 2. Run 2: we train a ConvNet on the HMDB-51 benchmark. Then, we use the convolutional layers of
this network as a feature extractor, and we train a multi layer perceptron on those features. 3. Run 3: for each video, we extract 20 Mel frequency cepstral coefficients, and the associated first
and second derivatives. For each shot, a 100-dimensional i-vector is extracted. All the i-vectors are then processed through three independent classifiers (one per subtask).
In this work the visual and audio features are processed separately. Both features are giving similar results for violence detection and valence. For arousal, video features are far more interesting, especially when the ConvNets feature extractor is trained on external data. We investigated merging the audio and visual features together using a neural network. However, the results were poorer than using the features separately. Our future work will focus on the merging the audio and video features.
Conclusion
Op
tica
l fl
ow
ConvNets
MLP
classifier
SHO
T
MFC
C
i-vector
pLD
A
Vid
eo
Au
dio