Multimodal Processing and Interaction

Multimodal Processing and Interaction:Audio, Video, Text

Edited by

Petros Maragos, Alexandros Potamianos, and Patrick Gros

November 2006

Contents

I Review of the State-of-the-Art 5

1 Cross-Modal Integration for Performance Improving in Multimedia: State-of-the-Art Review 7

2 Human-Computer Interfaces for Multimedia Retrieval: State-of-the-Art Re-view 9

II New Research Directions: INTEGRATED MULTIMEDIA ANALYSISAND RECOGNITION 11

3 Stochastic Models for Multimodal Video Analysis 13

4 Adaptive Multimodal Fusion by Uncertainty Compensation with Application toAudiovisual Speech Recognition 15

5 Movie Analysis with Emphasis to Dialogue Detections 17

6 Using HMM for Action Recognition in Audio-Visual streams 19

7 Surveillance Using Both Video and Audio 21

8 Audiovisual Attention Modeling and Salient Event Detection 23

III New Research Directions: SEARCHING MULTIMEDIA CONTENT 25

9 Interactive Image Retrieval using a Hybrid Visual and Conceptual Content Rep-resentation 27

10 Multi-Modal Analysis of Text and Audio Features for Music Information Re-trieval 29

11 Toward the Integration of NLP and ASR: POS Tagging and Transcription 31

IV New Research Directions: INTERFACES TO MULTIMEDIA CON-TENT 33

12 Design Principles for Multimodal Spoken Dialogue Systems 35

13 Eye Tracking for Image Retrieval 37

14 Natural/ Novel User Interfaces for Mobile Devices 39

3

Part I

Review of the State-of-the-Art

5

Chapter 1

Cross-Modal Integration forPerformance Improving inMultimedia: State-of-the-Art Review

WP6 authors

7

Chapter 2

Human-Computer Interfaces forMultimedia Retrieval:State-of-the-Art Review

WP10 authors

9

Part II

New Research Directions:INTEGRATED MULTIMEDIA

ANALYSIS AND RECOGNITION

11

Chapter 3

Stochastic Models for MultimodalVideo Analysis

Manolis Delakis, Guillaume Gravier, and Patrick Gros

IRISA, France

One of the key issue in video analysis is to be able to consider all the available information (i.e.all the media present in a given document, images, sound, speech and text). Although a lot ofwork exist in sound or image processing, these technique usually fail to consider all the media sincethese media have different temporal rates (25 images per seconds, but 100 sound vectors for thesame second), and the events are not strongly synchronized: in sport videos, the speaker alwaysdescribe what happened... before!

Apart from ad-hoc solutions (rules on top of monomodal analyses), several techniques, mostlyderived from the ASR field, have been developed that can be adapted to analyze videos. HMMsare a first obvious proposition, but they assume a common temporal rate. Segment models allowto relax this constraint. Dynamic Bayesian networks allow to enrich the dependencies between bevarious streams of the document by explicitely modeling them. Other proposition are possible, forexample based on Recurrent Neural Networks.

13

Chapter 4

Adaptive Multimodal Fusion byUncertainty Compensation withApplication to Audiovisual SpeechRecognition

George Papandreou, Athanassios Katsamanis, Vasilis Pitsikalis, and Petros Maragos

National Technical University of Athens, Greece

While the accuracy of feature measurements heavily depends on changing environmental con-ditions, studying the consequences of this fact in pattern recognition tasks has received relativelylittle attention to date. In this work we explicitly take into account feature measurement un-certainty and we show how classification rules should be adjusted to compensate for its effects.Our approach is particularly fruitful in multimodal fusion scenarios, such as audio-visual speechrecognition, where multiple streams of complementary time-evolving features are integrated. Forsuch applications, provided that the measurement noise uncertainty for each feature stream canbe estimated, the proposed framework leads to highly adaptive multimodal fusion rules which arewidely applicable and easy to implement. We further show that previous multimodal fusion meth-ods relying on stream weights fall under our scheme under certain assumptions; this provides novelinsights into their applicability for various tasks and suggests new practical ways for estimating thestream weights adaptively. The potential of our approach is demonstrated in audio-visual speechrecognition using either synchronous or asynchronous models.

15

Chapter 5

Movie Analysis with Emphasis toDialogue Detections

S. Siatras, E. Benetos, C. Kotropoulos, N. Nikolaidis, and I. Pitas

AUTH, Greece

The wide prevalence of personal computers, the decreasing cost of mass storage devices, andthe advances in compression techniques have fuelled a vast increase in digital multimedia con-tent, giving rise, among others to online music and video stores, personal multimedia collectionsand video on demand. However, the convenience of multimedia libraries and the functionality ofthe aforementioned applications will be in doubt, unless efficient multimedia data management,necessary for organizing, navigating, browsing, searching, and viewing the multimedia content, isemployed. Semantic, content-based video indexing and annotation is a promising solution towardsthis direction.

In this chapter, we focus on the problem of detection of dialogue scenes in a video sequence.Dialogues constitute a significant element of any movie. They can be interpreted as high-levelsemantic features, appropriate for inclusion in more sophisticated and semantically enabled orga-nization, annotation, browsing and retrieval techniques for movies and television programs. Forinstance, dialogue detection can be incorporated in a browsing or retrieval system in order toenhance it with the functionality of detecting scenes where a conversation is taking place. A thor-ough description of the basic principles of dialogue detection along with a detailed review of recentdialogue detection algorithms will be provided in this chapter.

Moreover, since certain techniques deal with dialogue detection in conjunction with scene char-acterization (i.e., detection of action scenes, suspense scenes, etc), such techniques will be alsoreviewed. However, the focus will be on dialogue scene detection since this problem draws consid-erably more attention in the research community.

17

Chapter 6

Using HMM for Action Recognitionin Audio-Visual streams

Rozenn Dahyot, Naomi Harte, Daire Lennon, and Anil Kokaram

TCD, Ireland

This chapter presents a tutorial-style discussion of the use of Hidden Markov Model (HMM)frameworks for automatic event detection in multimedia streams. HMMs have long been used inspeech recognition as a powerful method for modeling the temporal evolution of speech spectralfeatures. HMMs have more recently become popular in image processing research both in fusingfeatures from audio-visual streams and also for the explicit modeling of image-related featuresover time. The basic framework of HMMs is reviewed in this chapter with reference to classicpapers such as Rabiner but with greater emphasis on how the framework is practically appliedin image processing. The different approaches for using HMMs for automatic event detection areintroduced and supported by examples from previously published work in the area, specificallyevent classification in sports video and event detection in observational psychology videos.

Two types of sports videos are considered – tennis and snooker. Sport videos usually showa finite number of different views. Using low level shape and color descriptors as observations,we propose to classify their sequences according to the type of views. Once this classification isperformed, we select the shots in the video corresponding to the view with the most significancefor the game taking place (large view of the court in tennis, and large view of the table in snooker).Then using mainly object trajectories (balls or players), we recognize actions in snooker such asbreak building or conservative play, and in tennis (aces, faults, serves and volleys, etc.). Both viewclassification and action recognition use Hidden Markov modelling.

The observational psychology video example is taken from a study where subjects are filmedand the video subsequently analyzed to detect certain physical reflexes in the candidates limbsduring the directed exercises. Traditionally such content analysis has been done by hand. Thepurpose of this research was to automatically extract sections of the video containing exercises ofinterest. The concept is illustrated by the use of a HMM framework for the detection of headrotation sequences within a specific exercise. The features selected and modeled by the HMMs arerelated to motion and curl features in successive images in this case. Two HMMs are trained –one representing rotation events, the other non-rotation events. Continuous density observationsare used in this case. Using classic Viterbi-based recognition, periods of rotation and non-rotationcan automatically be distinguished in unseen videos, once HMMS are trained with a reasonableamount of hand labeled videos.

The concepts behind suitable feature section, and whether the use of discrete or continuousobservation densities is necessitated by a particular application, are discussed. The applicationsdemonstrate two main approaches are possible: where the HMMs are used after some initial coarse

19

segmentation stage, e.g., shot cut detection, to classify segments as one of a set of pre-definedevents as is the case for the sports videos, or whether the segmentation is an explicit part of theHMM output, as is the case for the observational psychology. Analogies are drawn with connectedword versus continuous speech recognition applications in audio. Thus a tutorial style approach istaken where arguments are supported and interspersed with examples from existing applications.

Chapter 7

Surveillance Using Both Video andAudio

Yigithan Dedeoglu, B. Ugur Toreyin, Ugur Gudukbay, and A. Enis Cetin

Bilkent University, Turkey

Current CCTV surveillance systems are mostly based on video. It is now possible to installcameras monitoring sensitive areas but it may not be possible to assign a security guard to eachcamera or a set of cameras. In addition, security guards may get tired and watch the monitor ina blank manner without noticing important events taking place in front of their eyes. Recently,intelligent video analysis systems capable of detecting humans, cars etc were developed. Suchsystems mostly use HMMs or SVMs to reach decisions. They detect important events but theyalso produce false alarms. It is possible to take advantage of other low cost sensors including audioto reduce the number of false alarms Most video recording systems have the capability of recordingaudio as well. Analysis of audio for intelligent information extraction is a relatively new area.Automatic detection of broken glass sounds, car crash sounds, screems, increasing sound level atthe background are indicators of important events. By combining the information coming from theaudio channel with the information from the video channels reliable surveillance systems will beachieved. In this article, current state of the art will be reviewed and an intelligent surveillancesystem analyzing both audio and video channels will be described.

21

Chapter 8

Audiovisual Attention Modeling andSalient Event Detection

G. Evangelopoulos, K. Rapantzikos, and P. Maragos

National Technical University of Athens, Greece

Although human perception appears to be automatic and unconscious there exist complexsensory mechanisms that form the preattentive component of human understanding and lead toawareness. Considerable research has been carried out into these preattentive mechanisms andcomputational models have been developed and employed to common computer vision or speechanalysis problems. The separate audio and visual modules may convey explicit, complementary ormutually exclusive information around structures of audiovisual events. We focus on exploring theaural and visual sources of information for modeling attention and subsequent detection of salient(important) events. In any video sequence the two streams are processed in parallel. Based onrecent studies on perceptual and computer attention modeling, we extract attention curves usingfeatures around the spatiotemporal structure of video and sounds. Audio saliency is capturedby modulation-domain signal modeling and multifrequency band features extracted through non-linear operators and energy tracking. Important audio events, e.g. speech, music, sound effects canthen be identified by adaptive threshold-based detection mechanisms. Visual saliency is measuredby means of spatiotemporal attention models that combine various feature cues (intensity, color,motion,...) and generate a single saliency map. Statistics are thus extracted in regions of interestobtained through segmentation of this map. Integration of audio and video attention curves isachieved by means of linear and non-linear fusion schemes resulting in a single attention curve,where events supported both from audio and video are enhanced while others may be suppressed orvanish. Event detection at this final audiovisual curve is processed in multiple scales and geometricalfeatures such as local extrema and sharp transition points are extracted that signify the presence ofimportant audiovisual events. The potential of intra-module fusion and audiovisual event detectionis demonstrated in applications such as key-frame selection, video skimming and summarizationand audio/visual segmentation.

23

Part III

New Research Directions:SEARCHING MULTIMEDIA

CONTENT

25

Chapter 9

Interactive Image Retrieval using aHybrid Visual and ConceptualContent Representation

Marin Ferecatu, Nozha Boujemaa and Michel Crucianu

INRIA, France

Many image databases available today have keyword annotations associated with the images.In spite of the maturity of the low-level visual features that reflect well the “physical” content andthus the visual similarity between images, the information retrieval based on visual features aloneis subject to the semantic gap. Textual annotation could be related to image context or semanticinterpretation of image content. They are not necessarily related to the visual appearance ofthe images. Keywords and visual features thus provide complementary information. Using bothsources of information is an advantage in many applications and recent work in this area reflectsthis interest.

In this chapter we will address the challenge of semantic gap reduction, through an originalactive SVM-based method, jointly with a hybrid visual and conceptual content representation andretrieval. We introduce two improvements of SVM-based relevance feedback methods. First, tooptimize the transfer of information between the user and the system, we focus on the criterionemployed by the system for selecting the images presented to the user at every feedback round.We put forward a new active learning selection criterion that minimizes redundancy between thecandidate images shown to the user. Second, for image classes having very different scales, we findthat a high sensitivity of the SVM to the scale of the data brings about a low retrieval performance.We then argue that insensitivity to scale is desirable in this context and we show how to obtain itby the use of specific kernel functions.

Also, we introduce a new feature vector, based on the keyword annotations available for theimages, which makes use of conceptual information extracted from an external lexical database,represented by “key concepts”. We test the joint use of the proposed hybrid feature vector com-posed of keyword representations and the low level visual features in an SVM-based relevancefeedback setting. Our experiments show that the use of the keyword-based feature vectors providesa significant improvement of the quality of the results.

27

Chapter 10

Multi-Modal Analysis of Text andAudio Features for Music InformationRetrieval

Andreas Rauber and Robert NeumayerVienna University of Technology, Austria

Multimedia content can be described in multiple ways as its essence is not limited to one view.For audio data those multiple views are, for instance, a song’s audio features as well as its lyrics.Both of those modalities have their advantages as text may be easier to search in and could covermore of the ‘semantics’ of a song while it does not say much about ’sonic similarity’. Psychoacousticfeature sets, on the other hand, provide the means to identify tracks that ’sound’ similar while itis not feasible for semantic categorisation of any kind. Those discerning requirements for differenttypes of feature sets is expressed by users’ differing information needs. Particularly large collectionsinvite users to explore them interactively in a loose way of browsing whereas specific searches aremuch easier, if not only possible when supported by textual data.

This chapter describes how audio files can be treated in a multi-modal way, pointing out thespecific advantages of two kinds of representations. We will show how audio features, which can beextracted in any case , can be seen as the basis of any automatic organisation of audio archives. Wewill explain the nature of two different feature sets which describe the same instances, i.e. audiotracks. Moreover, we will propose the use of textual data, which may not always be available,on top of low level audio features. Further, we will show the impact of different combinationsof audio features (Statistical Spectrum Descriptors, Rhythm Histograms, and Rhythm Patterns)and textual features based on content as well as stylistic features. Experiments will cover theclassification performance of different combinations of feature sets for the genre classification task.

29

Chapter 11

Toward the Integration of NLP andASR: POS Tagging and Transcription

Stephane Huet, Guillaume Gravier, and Pascale Sebillot

IRISA, France

This chapter presents a study for the integration of natural language processing (NLP) tech-niques with automatic speech recognition (ASR) systems. Most of the time, NLP is applied asis to automatic transcriptions without any specific attempt to integrate both systems in order toimprove the transcription and the subsequent analysis. After reviewing some of the attempts tointegrate linguistic knowledge into ASR sytsems, we investigate the use of part of speech (POS)tagging to improve speech recognition. We show that traditional POS taggers are reliable when ap-plied to spoken corpus, including automatic transcriptions. This new result enables us to effectivelyuse POS tag knowledge to improve, in a postprocessing stage, the quality of transcriptions, espe-cially correcting agreement errors. We finally investigate the use of POS information to improveconfidence measures.

31

Part IV

New Research Directions:INTERFACES TO MULTIMEDIA

CONTENT

33

Chapter 12

Design Principles for MultimodalSpoken Dialogue Systems

A. Potamianos and M. Perakakis

Technical University of Crete, Greece

With the appearance of a vast array of computational devices, such as phones, embeddedsystems, PDAs, laptops, desktops and wall size displays, with different sizes, computational powerand input/output capabilities the idea of ubiquitous computing is becoming a reality. Quests andneeds for new ways to interact with computers has led to the creation of various novel devices andmodalities. The future of computing will include novel modalities such as gestures, speech, haptics,and innovative devices, sensors, opening the door to new applications. In such an environment,an important issue that has to be addressed is how the interaction techniques should change totake these varying input and output hardware into account. Multimodal (or perceptual) interfacesare interfaces where the communication between the system and the user is made through variousinput/output modes. Multimodal interfaces should have the ability to fuse the information of thevarious modalities, to decide in each point of the interaction which one is the best modality tocommunicate with the end user (adaptive interfaces) taking into account various features at eachtime, and to be able to disambiguate input from a modality with the use of an other one (intelligentinterfaces).

In this chapter, we review efforts in defining design principles and creating tools for buildingmultimodal dialog systems with emphasis on the speech modality. General design principles forarchitecting and building such systems are reviewed and challenges are outlined. The focus is onsystem architecture, application and speech interface design, data collection and evaluation tools.We conclude that modularity, flexibility, customizability, domain-independence and automatic di-alog generation are some important features of successful dialog systems and design tools. We alsopresent a multimodal system which combines GUI and speech modalities as a design case study.Two important issues with multimodal systems design, is the selection of appropriate modalitiesin a given context and the exploitation of the synergies between the modalities in order to designa consistent and efficient interface. We introduce the concept of mode synergy that measures theadded value from efficiently combining multiple input modes. User behavior and system evaluationresults on the prototype system demonstrate how users and multimodal systems can (and should)adapt to maximize mode synergy to create efficient, natural and intelligent multimodal interfaces.

35

Chapter 13

Eye Tracking for Image Retrieval

O. Oyekoya and F. Stentiford

UCL, England

Eye-tracking technology offers a natural and immediate way of communicating human intentionsto a computer. Eye movements reflect interests and may be analysed to drive computer functionalityin games, image and video search, and other visual tasks. This chapter examines current eyetracking technologies and their applications. Experiments are described that show that targetimages can be identified more rapidly by eye tracking than using a mouse interface. Further resultsshow that an eye tracking technology provides an efficient interface for locating images in a largedatabase. Finally the chapter speculates about how the technology may enter the mass market ascosts decrease.

37

Chapter 14

Natural/ Novel User Interfaces forMobile Devices

Sanni Siltanen and Seppo ValliVTT, Finland

People carry their mobile phones almost all the time which makes them an interesting platformfor mobile applications. However, the conventional tiny keypad of a mobile phone is unsuitablefor many situations. For example, typing a simple URL like http://www.google.com/ requiresover 70 key presses with a typical phone model. In addition, the so called navi-button ofteninterprets the intended press wrongly. These are few reasons why people are not so willing to usetheir mobilephones for demanding purposes, e.g. for getting multimedia contentfrom theInternet.Respectively, there is a clear need to improve the user interface.

Part1: Physical Browsing/ Pointing User Interface (in Ubiquitous Environment)Pointing is perhaps the most natural user interface. Children use pointing inherently in all

cultures. Pointing with a mobile phone means that a user points at information/tag with his/hermobile device to make the desired action. For example to download a music track, the user maypoint an advertisement in a newspaper or in a poster. A pointing user interface can be realizedin several manners. One possibility is to use sc. tags as access points. User can point at a tagattached for instance to a poster. A variety of tag types are used: visual tags (e.g. barcodes andmatrix codes), RFID tags (Radio Frequency Identifier), Bluetooth tags, Infrared tags, etc.

Different tag types and their pros and cons will be introduced. Visual tags will be discussedmore in detail. Various use cases/applications are described.

Pattern recognition and optical character recognition algorithms are generally not within theprocessing capacity of current devices. However, general ideas and some examples will be given toapply them for interaction with mobile devices.

Part 2: Motion Detection User InterfaceAs stated earlier, common user interfaces of mobile devices are not suitable for many purposes.

In PC environment the use of a mouse has become a standard way for moving a cursor on thescreen. However, there is no good correspondence for a mouse in mobile phones. The navi-buttonis indented to be one, but the usability is poor with a small tap.

A feasible way of substituting a mouse in a mobile device is to use a camera attached to thedevice and detect its motion. When user turns or moves the device in his/her palm the cursormoves correspondingly in the desired direction. Especially mobile games benefit from this kind ofnatural user interface.

In this chapter we describe how motion detection can be implemented on a mobile phone anddescribe some examples of motion tracking based user interfaces as well as point out some referencesfor further reading.

39

Documents

Multimodal Processing and Interaction