Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos

1. Competence Center Information Retrieval & Machine Learning Esra Acar1, Frank Hopfgartner2 and Sahin Albayrak1 1 DAI Laboratory, TU Berlin, Germany 2 Humanities Advanced Technology and Information Institute, University of Glasgow, UK 13th International Workshop on Content-Based Multimedia Indexing (CBMI), Prague. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos Esra Acar 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

2. Outline 10. Jun 2015 Introduction The Video Affective Analysis Method Overview Audio and static visual representation learning Mid-level dynamic visual representations Model generation Performance Evaluation Dataset & ground-truth Experimental setup Results Sample video clips Conclusions & Future Work Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

3. Introduction 1 10. Jun 2015 Delivering personalized video content extracted from colossal amounts of multimedia is still a challenge. Video affective analysis can bring an answer to such a challenge from an original perspective by Analyzing video content at affective-level, and Providing access to videos based on emotions expected to arise in the audience. Affective analysis can either be categorical or dimensional. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

4. Introduction 2 10. Jun 2015 In the context of categorical affective analysis, one direction followed by many researchers consists in using machine learning methods. Machine learning approaches make use of a specific data representation. One key issue is to find an effective representation of video content. Features can be classified based on the level of semantic information they carry: low-level (e.g., pixel value) mid-level (e.g., bag of visual words) high-level (e.g., a guitarist performing a song) Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

5. Introduction 3 10. Jun 2015 Another possible feature distinction in video analysis is the one between static and dynamic (or temporal) features. Commonplace approaches among video affective content analysis methods consist in using: low-level audio-visual features, mid-level representations based on low-level ones (e.g., horror sound, laughter), and high-level semantic attributes (e.g., SentiBank, ObjectBank). Affective video analysis methods in the literature use mainly handcrafted low and mid-level features, and the temporal aspect of videos in a limited manner. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

6. The Video Affective Analysis Method 1 10. Jun 2015 Two main issues we address in this work: learning mid-level audio and static visual features, and deriving effective mid-level motion representations. Our approach is a categorical affective analysis solution which tries to map each video into one of the four quadrants in the Valence-Arousal-space (VA-space). The choice of categorical or dimensional is not critical. In practice, categories can always be mapped onto dimensions and vice versa. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

7. The Video Affective Analysis Method 2 10. Jun 2015 arousal intensity of emotion valence type of emotion Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis Excerpt from Yazdani, A., Skodras, E., Fakotakis, N., & Ebrahimi, T. (2013). Multimedia content analysis for emotional characterization of music video clips. EURASIP Journal on Image and Video Processing, 2013(1), 26.

8. An Overview of the Steps in the System 10. Jun 2015 (1) one-minute highlight extracts of music video clips are first segmented into pieces of 5 second length; (2) audio and visual feature extraction; (3) learning mid-level audio and static visual representations (training); (4) generating mid-level audio-visual representations; (5) generating an affective analysis model (training); (6) classifying a video segment of 5-second length into one of the four quadrants in the VA-space (test); and (7) classifying an extract using the results of the 5-second segments (test). Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

9. Audio and Static Visual Representation Learning 1 10. Jun 2015 Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space are used as raw data. Convolutional neural networks (CNNs) are used for mid-level feature extraction. three convolution and two subsampling layers. trained using the backpropagation algorithm. the output of the last convolution layer mid-level audio or visual representation. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

10. Audio and Static Visual Representation Learning 2 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis (a) A high-level overview of our representation learning method, (b) the detailed CNN architectures for audio and visual representation learning. The architecture contains three convolution and two subsampling layers, one output layer fully connected to the last convolution layer (C6). (CNN: Convolutional Neural Network, MFCC: Mel-Frequency Cepstral Coefficients, A: Audio, V: Visual)

11. Mid-Level Dynamic Visual Representations 1 10. Jun 2015 Motion in edited videos (e.g., music video clips) is shown to be an important cue for affective video analysis. We adopt the work of Wang et al. on dense trajectories. Dense trajectories are dynamic visual features derived from tracking densely sampled feature points in multiple spatial scales. are initially used for unconstrained video action recognition. constitute a powerful tool for motion description. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

12. Mid-Level Dynamic Visual Representations 2 10. Jun 2015 Steps to construct mid-level motion representations: Dense trajectories of length 15 frames are extracted from each video segment. represented by HoG, HoF and motion boundary histograms in the x and y directions (MBHx and MBHy, respectively). A separate dictionary for each dense trajectory descriptor is learned. Sparse dictionary learning technique used to generate a dictionary of size k (k = 512). 400 x k feature vectors are sampled from the training data. Sparse representations are generated using the LARS algorithm and the max-pooling technique (i.e., sparse coded Bag-of-Words). Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

13. Model Generation 1 10. Jun 2015 Mid-level audio and static visual representations are created by using the CNN models. Mid-level motion representations are derived using sparse coded BoW. Mid-level audio, dynamic and static visual representations are fed into separate multi-class SVMs (RBF kernel). The probability estimates of the models are merged using linear or SVM-based fusion. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

14. Model Generation 2 10. Jun 2015 We investigated two distinct fusion techniques to combine the outputs of the SVM models: Linear fusion: probability estimates are fused at the decision- level using different weights for each modality. The weights are optimized on the training data. SVM-based fusion: probability estimates of the SVMs are concatenated into vectors and used to construct higher level representations which are used to construct another SVM to predict the label of a video segment. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

15. Performance Evaluation 10. Jun 2015 The experiments aim at comparing the discriminative power of our method against the method that uses low-level audio- visual features (i.e., the baseline method), and the works presented in [1] and [2]. [1] A. Yazdani, K. Kappeler, and T. Ebrahimi, Affective content analysis of music video clips, in MIRUM. ACM, 2011. [2] E. Acar, F. Hopfgartner, and S. Albayrak, Understanding affective content of music videos through learned representations, in MMM, 2014. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

16. Dataset & Ground-truth 1 10. Jun 2015 We use the DEAP dataset www.eecs.qmul.ac.uk/mmv/datasets/deap) The DEAP dataset is for the analysis of human affective states using electroencephalogram, physiological and video signals. consists of the ratings from an online self-assessment where 120 one-minute extracts of music videos were each rated by 14-16 volunteers based on arousal, valence and dominance. Only one-minute highlight extracts from these 74 videos (available at YouTube) have been used in the experiments (i.e., 888 video segments). Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

17. Dataset & Ground-truth 2 10. Jun 2015 Four affective labels each representing one quadrant in the VA- space used for classification. high arousal-high valence (ha-hv) 19 songs, low arousal-high valence (la-hv) 19 songs, low arousal-low valence (la-lv) 14 songs, and high arousal-low valence (ha-lv) 22 songs The labels are provided in the dataset and are determined by the average ratings of the participants in the online self- assessment. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

18. Experimental Setup 1 10. Jun 2015 MFCC extraction frame sizes of 25 ms with 10 ms overlap, 13-dimensional. Mean and standard deviation of MFCC low-level audio representation (LLR audio). Normalized HSV histograms (16, 4, 4 bins) in the HSV color space low-level visual representation (LLR visual). Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

19. Experimental Setup 2 10. Jun 2015 The most computationally expensive phase training of the CNN models. MFCC 150 seconds, color 350 seconds (on average per epoch) The generation of feature representations per video segment MFCC using CNNs 0.5 seconds, Color using CNNs 1.2 seconds, and Dense trajectory based sparse coded BoW 16 seconds. All the timing evaluations were performed with a machine with 2.40GHz CPU and 8GB RAM. Leave-one-song-out cross validation scheme used. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

20. Results Unimodal Representations 10. Jun 2015 Motion and audio representations are more discriminative than static visual features. Motion representation is superior affect present in video clips is often characterized by motion (e.g., camera motion). Color values in the HSV space lead to more discriminative mid-level representations than color values in the RGB space (when compared to our previous work). Classification Accuracies on the DEAP dataset (MLR: mid-level representation) Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

21. Results Multi-modal Representations 10. Jun 2015 The performance gain over prior works is remarkable for SVM- based fusion an advanced fusion mechanism is better. Differences with the setup of work [3]: 40 video clips from the DEAP dataset used in [3]. only the clips which induce strong emotions used in [3]. Classification Accuracies on the DEAP dataset (MLR: mid-level representation) Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

22. Results Confusion Matrix 10. Jun 2015 Confusion matrices on the DEAP dataset (Mean accuracy: 50% for (a) and 58.11% for (b)). Lighter areas along the main diagonal correspond to better discrimination. (b) MLR audio, motion and static visual linear fusion(a) MLR audio and static visual Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

23. Correctly Classified (HA-HV) 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis Emiliana Torrini - JungleDrum

24. Wrongly Classified (HA-HV) 10. Jun 2015 Predicted HA-LV Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis TheGo!Team Huddle Formation

25. Correctly Classified (LA-HV) 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis Grand Archives Miniature Birds

26. Wrongly Classified (LA-HV) 10. Jun 2015 Predicted HA-LV Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis The Cardigans Carnival

27. Correctly Classified (LA-LV) 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis James Blunt Goodbye My Lover

28. Wrongly Classified (LA-LV) 10. Jun 2015 Predicted LA-HV Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis Porcupine Tree Normal

29. Correctly Classified (HA-LV) 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis ARCH ENEMY - My Apocalypse

30. Wrongly Classified (HA-LV) 10. Jun 2015 Predicted HA-HV Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis The Cranberries Zombie

31. Conclusions & Future Work 1 10. Jun 2015 We presented an approach where higher level representations were learned from raw data using CNNs, and fused with dense trajectory based motion features at the decision-level. Experimental results on the DEAP dataset support our assumptions (1) that higher level audio-visual representations learned using CNNs are more discriminative than low-level ones, and (2) that including dense trajectories contribute to increase the classification performance. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

32. Conclusions & Future Work 2 10. Jun 2015 Future work to concentrate on the modeling aspect of the problem and explore machine learning techniques such as ensemble learning. to extend our approach to user-generated videos (i.e., usually not professionally edited). to incorporate high-level representations such as sentiment- level semantics. Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

33. Competence Center Information Retrieval & Machine Learning www.dai-labor.de Fon Fax +49 (0) 30 / 314 74 +49 (0) 30 / 314 74 003 DAI-Labor Technische Universitt Berlin Fakultt IV Elektrontechnik & Informatik Sekretariat TEL 14 Ernst-Reuter-Platz 7 10587 Berlin, Deutschland Esra Acar Researcher M.Sc. [email protected] Thanks! 013 10. Jun 2015 Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis

Science

Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos