Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in...
33
Competence Center Information Retrieval & Machine Learning Esra Acar 1 , Frank Hopfgartner 2 and Sahin Albayrak 1 1 DAI Laboratory, TU Berlin, Germany 2 Humanities Advanced Technology and Information Institute, University of Glasgow, UK 13 th International Workshop on Content-Based Multimedia Indexing (CBMI), Prague. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos Esra Acar 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
1. Competence Center Information Retrieval & Machine
Learning Esra Acar1, Frank Hopfgartner2 and Sahin Albayrak1 1 DAI
Laboratory, TU Berlin, Germany 2 Humanities Advanced Technology and
Information Institute, University of Glasgow, UK 13th International
Workshop on Content-Based Multimedia Indexing (CBMI), Prague.
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis in Videos Esra Acar 10. Jun
2015 Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis
2. Outline 10. Jun 2015 Introduction The Video Affective
Analysis Method Overview Audio and static visual representation
learning Mid-level dynamic visual representations Model generation
Performance Evaluation Dataset & ground-truth Experimental
setup Results Sample video clips Conclusions & Future Work
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis
3. Introduction 1 10. Jun 2015 Delivering personalized video
content extracted from colossal amounts of multimedia is still a
challenge. Video affective analysis can bring an answer to such a
challenge from an original perspective by Analyzing video content
at affective-level, and Providing access to videos based on
emotions expected to arise in the audience. Affective analysis can
either be categorical or dimensional. Fusion of Learned Multi-Modal
Representations and Dense Trajectories for Emotional Analysis
4. Introduction 2 10. Jun 2015 In the context of categorical
affective analysis, one direction followed by many researchers
consists in using machine learning methods. Machine learning
approaches make use of a specific data representation. One key
issue is to find an effective representation of video content.
Features can be classified based on the level of semantic
information they carry: low-level (e.g., pixel value) mid-level
(e.g., bag of visual words) high-level (e.g., a guitarist
performing a song) Fusion of Learned Multi-Modal Representations
and Dense Trajectories for Emotional Analysis
5. Introduction 3 10. Jun 2015 Another possible feature
distinction in video analysis is the one between static and dynamic
(or temporal) features. Commonplace approaches among video
affective content analysis methods consist in using: low-level
audio-visual features, mid-level representations based on low-level
ones (e.g., horror sound, laughter), and high-level semantic
attributes (e.g., SentiBank, ObjectBank). Affective video analysis
methods in the literature use mainly handcrafted low and mid-level
features, and the temporal aspect of videos in a limited manner.
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis
6. The Video Affective Analysis Method 1 10. Jun 2015 Two main
issues we address in this work: learning mid-level audio and static
visual features, and deriving effective mid-level motion
representations. Our approach is a categorical affective analysis
solution which tries to map each video into one of the four
quadrants in the Valence-Arousal-space (VA-space). The choice of
categorical or dimensional is not critical. In practice, categories
can always be mapped onto dimensions and vice versa. Fusion of
Learned Multi-Modal Representations and Dense Trajectories for
Emotional Analysis
7. The Video Affective Analysis Method 2 10. Jun 2015 arousal
intensity of emotion valence type of emotion Fusion of Learned
Multi-Modal Representations and Dense Trajectories for Emotional
Analysis Excerpt from Yazdani, A., Skodras, E., Fakotakis, N.,
& Ebrahimi, T. (2013). Multimedia content analysis for
emotional characterization of music video clips. EURASIP Journal on
Image and Video Processing, 2013(1), 26.
8. An Overview of the Steps in the System 10. Jun 2015 (1)
one-minute highlight extracts of music video clips are first
segmented into pieces of 5 second length; (2) audio and visual
feature extraction; (3) learning mid-level audio and static visual
representations (training); (4) generating mid-level audio-visual
representations; (5) generating an affective analysis model
(training); (6) classifying a video segment of 5-second length into
one of the four quadrants in the VA-space (test); and (7)
classifying an extract using the results of the 5-second segments
(test). Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis
9. Audio and Static Visual Representation Learning 1 10. Jun
2015 Mel-Frequency Cepstral Coefficients (MFCC) and color values in
the HSV color space are used as raw data. Convolutional neural
networks (CNNs) are used for mid-level feature extraction. three
convolution and two subsampling layers. trained using the
backpropagation algorithm. the output of the last convolution layer
mid-level audio or visual representation. Fusion of Learned
Multi-Modal Representations and Dense Trajectories for Emotional
Analysis
10. Audio and Static Visual Representation Learning 2 10. Jun
2015 Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis (a) A high-level overview of
our representation learning method, (b) the detailed CNN
architectures for audio and visual representation learning. The
architecture contains three convolution and two subsampling layers,
one output layer fully connected to the last convolution layer
(C6). (CNN: Convolutional Neural Network, MFCC: Mel-Frequency
Cepstral Coefficients, A: Audio, V: Visual)
11. Mid-Level Dynamic Visual Representations 1 10. Jun 2015
Motion in edited videos (e.g., music video clips) is shown to be an
important cue for affective video analysis. We adopt the work of
Wang et al. on dense trajectories. Dense trajectories are dynamic
visual features derived from tracking densely sampled feature
points in multiple spatial scales. are initially used for
unconstrained video action recognition. constitute a powerful tool
for motion description. Fusion of Learned Multi-Modal
Representations and Dense Trajectories for Emotional Analysis
12. Mid-Level Dynamic Visual Representations 2 10. Jun 2015
Steps to construct mid-level motion representations: Dense
trajectories of length 15 frames are extracted from each video
segment. represented by HoG, HoF and motion boundary histograms in
the x and y directions (MBHx and MBHy, respectively). A separate
dictionary for each dense trajectory descriptor is learned. Sparse
dictionary learning technique used to generate a dictionary of size
k (k = 512). 400 x k feature vectors are sampled from the training
data. Sparse representations are generated using the LARS algorithm
and the max-pooling technique (i.e., sparse coded Bag-of-Words).
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis
13. Model Generation 1 10. Jun 2015 Mid-level audio and static
visual representations are created by using the CNN models.
Mid-level motion representations are derived using sparse coded
BoW. Mid-level audio, dynamic and static visual representations are
fed into separate multi-class SVMs (RBF kernel). The probability
estimates of the models are merged using linear or SVM-based
fusion. Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis
14. Model Generation 2 10. Jun 2015 We investigated two
distinct fusion techniques to combine the outputs of the SVM
models: Linear fusion: probability estimates are fused at the
decision- level using different weights for each modality. The
weights are optimized on the training data. SVM-based fusion:
probability estimates of the SVMs are concatenated into vectors and
used to construct higher level representations which are used to
construct another SVM to predict the label of a video segment.
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis
15. Performance Evaluation 10. Jun 2015 The experiments aim at
comparing the discriminative power of our method against the method
that uses low-level audio- visual features (i.e., the baseline
method), and the works presented in [1] and [2]. [1] A. Yazdani, K.
Kappeler, and T. Ebrahimi, Affective content analysis of music
video clips, in MIRUM. ACM, 2011. [2] E. Acar, F. Hopfgartner, and
S. Albayrak, Understanding affective content of music videos
through learned representations, in MMM, 2014. Fusion of Learned
Multi-Modal Representations and Dense Trajectories for Emotional
Analysis
16. Dataset & Ground-truth 1 10. Jun 2015 We use the DEAP
dataset www.eecs.qmul.ac.uk/mmv/datasets/deap) The DEAP dataset is
for the analysis of human affective states using
electroencephalogram, physiological and video signals. consists of
the ratings from an online self-assessment where 120 one-minute
extracts of music videos were each rated by 14-16 volunteers based
on arousal, valence and dominance. Only one-minute highlight
extracts from these 74 videos (available at YouTube) have been used
in the experiments (i.e., 888 video segments). Fusion of Learned
Multi-Modal Representations and Dense Trajectories for Emotional
Analysis
17. Dataset & Ground-truth 2 10. Jun 2015 Four affective
labels each representing one quadrant in the VA- space used for
classification. high arousal-high valence (ha-hv) 19 songs, low
arousal-high valence (la-hv) 19 songs, low arousal-low valence
(la-lv) 14 songs, and high arousal-low valence (ha-lv) 22 songs The
labels are provided in the dataset and are determined by the
average ratings of the participants in the online self- assessment.
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis
18. Experimental Setup 1 10. Jun 2015 MFCC extraction frame
sizes of 25 ms with 10 ms overlap, 13-dimensional. Mean and
standard deviation of MFCC low-level audio representation (LLR
audio). Normalized HSV histograms (16, 4, 4 bins) in the HSV color
space low-level visual representation (LLR visual). Fusion of
Learned Multi-Modal Representations and Dense Trajectories for
Emotional Analysis
19. Experimental Setup 2 10. Jun 2015 The most computationally
expensive phase training of the CNN models. MFCC 150 seconds, color
350 seconds (on average per epoch) The generation of feature
representations per video segment MFCC using CNNs 0.5 seconds,
Color using CNNs 1.2 seconds, and Dense trajectory based sparse
coded BoW 16 seconds. All the timing evaluations were performed
with a machine with 2.40GHz CPU and 8GB RAM. Leave-one-song-out
cross validation scheme used. Fusion of Learned Multi-Modal
Representations and Dense Trajectories for Emotional Analysis
20. Results Unimodal Representations 10. Jun 2015 Motion and
audio representations are more discriminative than static visual
features. Motion representation is superior affect present in video
clips is often characterized by motion (e.g., camera motion). Color
values in the HSV space lead to more discriminative mid-level
representations than color values in the RGB space (when compared
to our previous work). Classification Accuracies on the DEAP
dataset (MLR: mid-level representation) Fusion of Learned
Multi-Modal Representations and Dense Trajectories for Emotional
Analysis
21. Results Multi-modal Representations 10. Jun 2015 The
performance gain over prior works is remarkable for SVM- based
fusion an advanced fusion mechanism is better. Differences with the
setup of work [3]: 40 video clips from the DEAP dataset used in
[3]. only the clips which induce strong emotions used in [3].
Classification Accuracies on the DEAP dataset (MLR: mid-level
representation) Fusion of Learned Multi-Modal Representations and
Dense Trajectories for Emotional Analysis
22. Results Confusion Matrix 10. Jun 2015 Confusion matrices on
the DEAP dataset (Mean accuracy: 50% for (a) and 58.11% for (b)).
Lighter areas along the main diagonal correspond to better
discrimination. (b) MLR audio, motion and static visual linear
fusion(a) MLR audio and static visual Fusion of Learned Multi-Modal
Representations and Dense Trajectories for Emotional Analysis
23. Correctly Classified (HA-HV) 10. Jun 2015 Fusion of Learned
Multi-Modal Representations and Dense Trajectories for Emotional
Analysis Emiliana Torrini - JungleDrum
24. Wrongly Classified (HA-HV) 10. Jun 2015 Predicted HA-LV
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis TheGo!Team Huddle
Formation
25. Correctly Classified (LA-HV) 10. Jun 2015 Fusion of Learned
Multi-Modal Representations and Dense Trajectories for Emotional
Analysis Grand Archives Miniature Birds
26. Wrongly Classified (LA-HV) 10. Jun 2015 Predicted HA-LV
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis The Cardigans Carnival
27. Correctly Classified (LA-LV) 10. Jun 2015 Fusion of Learned
Multi-Modal Representations and Dense Trajectories for Emotional
Analysis James Blunt Goodbye My Lover
28. Wrongly Classified (LA-LV) 10. Jun 2015 Predicted LA-HV
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis Porcupine Tree Normal
29. Correctly Classified (HA-LV) 10. Jun 2015 Fusion of Learned
Multi-Modal Representations and Dense Trajectories for Emotional
Analysis ARCH ENEMY - My Apocalypse
30. Wrongly Classified (HA-LV) 10. Jun 2015 Predicted HA-HV
Fusion of Learned Multi-Modal Representations and Dense
Trajectories for Emotional Analysis The Cranberries Zombie
31. Conclusions & Future Work 1 10. Jun 2015 We presented
an approach where higher level representations were learned from
raw data using CNNs, and fused with dense trajectory based motion
features at the decision-level. Experimental results on the DEAP
dataset support our assumptions (1) that higher level audio-visual
representations learned using CNNs are more discriminative than
low-level ones, and (2) that including dense trajectories
contribute to increase the classification performance. Fusion of
Learned Multi-Modal Representations and Dense Trajectories for
Emotional Analysis
32. Conclusions & Future Work 2 10. Jun 2015 Future work to
concentrate on the modeling aspect of the problem and explore
machine learning techniques such as ensemble learning. to extend
our approach to user-generated videos (i.e., usually not
professionally edited). to incorporate high-level representations
such as sentiment- level semantics. Fusion of Learned Multi-Modal
Representations and Dense Trajectories for Emotional Analysis
33. Competence Center Information Retrieval & Machine
Learning www.dai-labor.de Fon Fax +49 (0) 30 / 314 74 +49 (0) 30 /
314 74 003 DAI-Labor Technische Universitt Berlin Fakultt IV
Elektrontechnik & Informatik Sekretariat TEL 14
Ernst-Reuter-Platz 7 10587 Berlin, Deutschland Esra Acar Researcher
M.Sc. [email protected] Thanks! 013 10. Jun 2015 Fusion of
Learned Multi-Modal Representations and Dense Trajectories for
Emotional Analysis