12
Multimodal acting in mixed reality interactive storytelling TeesRep - Teesside's Research Repository Item type Article Authors Cavazza, M. O. (Marc); Charles, F. (Fred); Mead, S. J. (Steven); Martin, O. (Olivier); Marichal, X. (Xavier); Nandi, A. (Alok) Citation Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed reality interactive storytelling', IEEE Multimedia, 11(3), pp.30-39. DOI 10.1109/MMUL.2004.11 Publisher Institute of Electrical and Electronics Engineers Journal IEEE Multimedia Rights Author can archive publisher's version/PDF. For full details see http://www.sherpa.ac.uk/romeo/ [Accessed 07/01/2010] Downloaded 27-Jun-2018 12:12:41 Link to item http://hdl.handle.net/10149/58295 TeesRep - Teesside University's Research Repository - https://tees.openrepository.com/tees

Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

  • Upload
    tranbao

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

Multimodal acting in mixed reality interactive storytelling

TeesRep - Teesside'sResearch Repository

Item type Article

Authors Cavazza, M. O. (Marc); Charles, F. (Fred); Mead, S. J.(Steven); Martin, O. (Olivier); Marichal, X. (Xavier);Nandi, A. (Alok)

Citation Cavazza, M. O. et al. (2004) 'Multimodal acting in mixedreality interactive storytelling', IEEE Multimedia, 11(3),pp.30-39.

DOI 10.1109/MMUL.2004.11

Publisher Institute of Electrical and Electronics Engineers

Journal IEEE Multimedia

Rights Author can archive publisher's version/PDF. For full detailssee http://www.sherpa.ac.uk/romeo/ [Accessed07/01/2010]

Downloaded 27-Jun-2018 12:12:41

Link to item http://hdl.handle.net/10149/58295

TeesRep - Teesside University's Research Repository - https://tees.openrepository.com/tees

Page 2: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

TeesRep: Teesside University's Research Repository http://tees.openrepository.com/tees/

This full text version, available on TeesRep, is the PDF (final version) of:

Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed reality interactive

storytelling', IEEE Multimedia, 11 (3), pp.30-39.

For details regarding the final published version please click on the following DOI link:

http://dx.doi.org/10.1109/ASWEC.2005.40

When citing this source, please use the final published version as above.

Copyright © 2005 IEEE. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of Teesside University's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to [email protected]. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

This document was downloaded from http://tees.openrepository.com/tees/handle/10149/58295

Please do not use this version for citation purposes.

All items in TeesRep are protected by copyright, with all rights reserved, unless otherwise indicated.

Page 3: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

An experimentalmixed reality using amultimodalapproach lets usersplay characters ininteractive narrativesas though acting ona stage. Usersinteract withcharacters throughspeech, attitude, andgesture, enhancingtheir immersion inthe virtual world.

Interactive storytelling immerses users infantasy worlds in which they play partsin evolving narratives that respond totheir intervention. Implementing the

interactive storytelling concept involves manycomputing technologies: virtual or mixed realityfor creating the artificial world, and artificialintelligence techniques and formalisms for gen-erating the narrative and characters in real time.

As a character in the narrative, the user com-municates with virtual characters much like anactor communicates with other actors. Thisrequirement introduces a novel context for mul-timodal communication as well as several tech-nical challenges. Acting involves attitudes andbody gestures that are highly significant for bothdramatic presentation and communication. Atthe same time, spoken communication is essen-tial to realistic interactive narratives. This kind ofmultimodal communication faces several diffi-culties in terms of real-time performance, cover-age, and accuracy.

We’ve developed an experimental system thatprovides a small-scale but complete integrationof multimodal communication in interactive sto-rytelling. It uses a narrative’s semantic context tofocus multimodal input processing—that is, thesystem interprets users’ acting (the multimodalinput) in the mixed reality stage in terms of nar-rative functions representing users’ contributionsto the unfolding plot.

System overview: The mixed realityinstallation

Figure 1 shows the mixed reality system archi-tecture. The system uses a “magic mirror” para-digm, which we derived from the Transfictionapproach.1 In our approach, a video camera cap-tures the user’s image in real time, and theTransfiction engine extracts the image from thebackground and mixes it with a 3D graphicmodel of a virtual stage, which includes the sto-ry’s synthetic characters. The system projects theresulting image on a large screen facing the user,who sees his or her image embedded in the vir-tual stage with the synthetic actors.

We based the mixed reality world’s graphiccomponent on the Unreal Tournament 2003 gameengine (http://www.unrealtournament.com). Thisengine not only renders graphics and animatescharacters but, most importantly, contains asophisticated development environment for defin-ing interaction with objects and character behav-iors.2 It also supports integration of externalsoftware through socket-based communication.

We use the Transfiction engine to constructthe mixed environment through real-time imageprocessing.3 A single (monoscopic) 2D cameraanalyzes the user’s image in real time by seg-menting the user’s contours. The segmentation’sobjectives are twofold:

❚ It extracts the user image silhouette andinjects it into the virtual setting on the pro-jection screen (without resorting to chromakeying).

❚ At the same time, the Transfiction engine ana-lyzes the extracted body silhouette to recog-nize and track user behavior (position,attitude, and gestures) and influence the inter-active narrative accordingly.

A detection module segments the video imagein real time and outputs the resulting imagetogether with other data, such as gesture recog-

30 1070-986X/04/$20.00 © 2004 IEEE Published by the IEEE Computer Society

MultimodalActing in MixedRealityInteractiveStorytelling

Marc Cavazza, Fred Charles, and Steven J. MeadUniversity of Teesside

Olivier MartinUniversité Catholique de Louvain

Xavier Marichal and Alok NandiAlterface

Multisensory Communication and Experiencethrough Multimedia

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.

Page 4: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

nition, that enable further processing. The cur-rent detection module uses a 4 × 4 Walsh func-tion Hadamard determinant and calculates thetransform on 4 × 4-pixel elements. Sliding thebox of two pixels aside allows taking decision on2 × 2-pixel blocks. As a result, it can segment andadequately detect objects’ boundaries and offerssome robustness to luminance variations. Figure2 gives an overview of the change-detectionprocess with the Walsh-Hadamard transform.First, the detection module calculates the back-ground image’s Walsh-Hadamard transform. Itthen compares the transform’s values for the cur-rent and background images. When the rate ofchange is higher than an established threshold,the module sets the area as foreground. Becauseshadows (which can be problematic because ofvariable indoor lighting conditions) can corruptsegmentation results, we remove them usinginvariant techniques.4

Next, we composite the resulting video imagewith the virtual environment image by mixingthe video channels captured by a separate com-puter running a DirectX-based application. Thefirst stage involves isolating the user image fromits background using basic chroma keying. Theremaining stage attempts to solve the occlusion

problem by blending the user image with the vir-tual environment image using empirical depthinformation. The gesture-recognition module

31

July–Septem

ber 2004

BabelTech Speechrecognition

Naturallanguage

processingTCP

Storytellingengine

Unreal engine

Transfictionengine

AI planning layer(hierarchical task networks)

Rendered scene

Silhouette

UDP

TCPMixedvideosignals

Captured image

Capture

UserProjection screen

Figure 1. Mixed reality system architecture. The Transfiction engine extracts the user’s image from the film captured by the video

camera and mixes it with a 3D graphic model of a virtual stage. The user views the resulting image on a large screen.

Walsh-Hadamardtransform

Current image

Walsh-Hadamardtransform

Background image

Compare

Foreground object

Higher thanthreshold?

Figure 2. Extracting the

user’s image from the

background.The

detection module uses a

Walsh-Hadamard

transform to compare

the background and

current images.

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.

Page 5: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

provides this information for the user as theuser’s relative distance to the camera; the gameengine provides it for the virtual environment.Figure 3 illustrates the overall process wherebythe system composites several video image layersin real time to produce the final image, which itprojects onto a screen in front of the user.

In the first prototype, the gesture detectionand recognition components share a normalizedsystem of coordinates, which we obtain throughcalibration prior to running the system. This pro-totype doesn’t deal with occlusion in mixed real-ity, which is also set at calibration time. We’recurrently developing an occlusion managementsystem, which uses depth information providedby the transfiction engine.

The shared coordinates system lets us not onlyposition the user in the virtual image, but alsodetermine the relations between user and virtualenvironment. To do this, we map the 2D bound-ing box produced by the transfiction engine—which defines the contour of the segmented usercharacter—to a 3D bounding cylinder in the

Unreal Tournament 2003 environment, whichrepresents the user’s position in the virtual world(as Figure 4 shows). Relying on its basic mecha-nisms, the Transfiction engine automatically gen-erates low-level graphical events such as collisionsand object interaction.

The two subsystems communicate via TCPsockets: the image-processing module, workingon a separate computer, regularly sends two typesof message to the graphic engine. The messagesupdate the user’s position and any recognizedgestures. The Transfiction engine transmits therecognized gesture as a code for the gesture (forexample, a 2D vector indicating the direction ofpointing represents a pointing gesture). However,contextual interpretation of the gesture occurswithin the storytelling system.

The storytelling scenario in our experimentsis a James Bond adventure in which the userplays the villain (the Professor). The narrativeproperties of James Bond stories make themgood candidates for interactive storytellingexperiments; Barthes used them as a supportingexample in his foundational work in contem-porary narratology.5 In addition, their relianceon narrative stereotypes facilitates both narra-tive control and users’ understanding of theroles they’re to play. The basic storyline repre-sents an early encounter between Bond and theProfessor. Bond’s objective is to acquire someessential information, which he can obtain bysearching the Professor’s office, asking theProfessor’s assistant, or, under certain condi-tions, deceiving or threatening the Professorhimself. The user’s actions as the Professor inter-fere with Bond’s plan, altering how the plotunfolds.

32

IEEE

Mul

tiM

edia

Compositingvideolayers

Projection screen

Z-depth blending

Chroma keying

Figure 3. Constructing

the mixed reality

environment. The

mixed reality system

blends the user image

and the vitual

environment image to

produce the final

image, which it projects

on a large screen.

Figure 4. The 3D

bounding cylinder

determines physical

interactions in the

Unreal Tournament

2003 engine.

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.

Page 6: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

Interactive storytellingWe adapted the interactive storytelling tech-

nology used in these experiments from our pre-vious work, described in detail elsewhere.6 Thus,we only briefly overview the approach here,focusing on the aspects most relevant to the sys-tem’s mixed reality implementation, in particu-lar multimodal user interaction.

Interactive storytelling involves the real-timegeneration of narrative actions such that the con-sequences of user intervention result in the inter-active storytelling system regenerating the storywith a modified environment. Narrative control dic-tates that user intervention should modify the sto-ry’s course, but only within the limits of the story’sgenre. Narrative control generally relies on a base-line plot that defines possible character actions, butimposes no unnecessary constraints on how theactions can be combined to constitute a plot.

Our approach, character-based interactive story-telling,6 centers on the virtual actors’ roles. Webased the artificial intelligence mechanism sup-porting character behavior on a planning tech-nology using hierarchical task networks (HTNs),as illustrated in Figure 5.7 These representationsdescribe the character’s role as a plan using ahierarchical decomposition of tasks into sub-tasks. (Formally, HTNs are AND/OR graphs andwe can represent the solution plan as a subgraphof the HTN.)

For instance, we can decompose aninformation-gathering task into several options for

gaining access to that information, such as search-ing files or getting it from another character. Eachof these tasks can be further decomposed—forinstance, to get information from another charac-ter, the user’s character must approach it, establisha relationship with it, convince it to handle theinformation, and so on. HTN task decompositioncontinues until reaching the terminal-actionlevel—that is, the level at which the synthetic char-acter can visually perform actions in the virtualworld. The system thus uses an HTN planner toselect in real-time each character’s actions. Ourmodule implemented within the UnrealTournament 2003 sends an action’s failure back tothe planner, which produces an alternative solu-tion. This mechanism is essential in interactive sto-rytelling because user intervention often causes acharacter’s planned action to fail, leaving it to pro-duce an alternative solution that will lead the storyinto new directions.

To accommodate the mixed-reality context,we adapted our previous character-based story-telling framework to the new user-interactionparadigm derived from the magic mirrormetaphor,8,9 which assumes greater user involve-ment than other interactive storytellingapproaches. This greater involvement calls formore flexible narrative control. In our support-ing example, each situation represents a stage inthe encounter between Bond and the villain:introduction, negotiation, and separation. In thisversion, we’ve defined one HTN for each situa-

33

July–Septem

ber 2004

Get infoand

escape

Escapefrom

professor

Escapefrom

assistant

Escapefrom

professor

Escapefrom

professor

Getdeliberately

captured

Getinfo

Searchroom

Searchroom

Threaten

Grabassistant

Exit viabackdoor

Treather

Go tobookshelf

Readbook

Go tofiling

cabinet

Searchthrough

files

Escapelasertable

Lock ontable

Gatedesk

Readbook

Drawgun

Jump offlaser

table just-in time

Use Q'sunlockingwatch toundo lock

Struggle toget free, reach forspecialwatch

AskGoldfinger

for info

Get herattention

Persuadeher

Askfor info

Getnear her

Choosesuitablelocation

Go tolocation

Callherover

Talkto her

Talkto her

Requesther

Usepassion

Seduceher

Flatterhere

Saynice

things

Saynice

things

Check thatno one isaround

Kissher

Makeprofessor

angry

Talk toGoldfinger

Wait to becapturedby guards

Makethreat toprofessor

Figure 5. Hierarchical

task network for Bond

character.The HTN

decomposes tasks into

subtasks.

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.

Page 7: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

tion rather than a single HTN encompassing theentire scene. Figure 5 depicts the discussionphase HTN.

We adapted basic interactive storytellingmechanisms to the greater user involvement.HTNs are still based on the Bond role, but theygive a more explicit status to user intervention toallow for the user’s regular, but unpredictable,interaction, as the HTN representing the conver-sation between Bond and the Professor illustrates.This HTN incorporates several extensions to theHTN used in our previous work. Some extensionsinvolve a novel use of the representation; othersrequired modifying the underlying planningalgorithm that uses the HTN to animate the vir-tual character. For example, the HTN in the cur-rent system uses mixed nodes that have both anAND and an OR, letting us incorporate optionalactions while still limiting the representation’scomplexity (for example, in Figure 5, the HTNmakes the threat situation optional). We’ve alsoincorporated the possibility of some user inter-vention in the HTN itself. One required extensionallows a character to attempt an action only afterthe planner tests for the compatible user action(in this case, the Professor giving away the infor-mation). This doesn’t prevent the user from per-forming actions other than the one expected,which will impact the character’s plan at anoth-er level. In other words, this representationdeparts from a strict character-based approach toincorporate plot representation to accommodatethe higher level of user involvement.

Multimodal interactionThe user intervention is a multimodal input

consisting of a spoken utterance and an optionalbody gesture interprets this multimodal unit incontext (that is, using knowledge about the plotprogression) to determine what kind of responseit is to the virtual characters’ actions.

Consider the joint recognition of a multimodalspeech act comprising an utterance analyzed

through speech recognition and a body gestureprocessed by the transfiction engine. We catego-rize the user’s attitude shown in Figure 6 as theuser sitting with his arms raised and partly extend-ed (other categories include pointing gestures,which can mean showing, giving, and so on). Thisattitude is compatible with different interpreta-tions, including greetings (“Welcome, Mr.Bond!”), denial (“You must be joking, Mr. Bond”),or challenge (“Shoot me and you’ll never know,007!”). We interpret the correct meaning throughjoint analysis of the user’s utterance and attitude.

Traditional literature on multimodality focus-es on the use of deictic gestures in natural lan-guage instructions or dialogue10 and of gesturesin nonverbal communication.11 The narrativecontext of interactive storytelling creates newforms of gesture use, which in turn create newmultimodal combinations. The system supportsdeictic gestures, such as when the user indicatesan object or a location in a multimodal utterance(“Take a seat, Mr. Bond”).

Another type of gesture—physical gestures—is on-stage physical interventions, such as grasp-ing an object, slapping a character, or standingin front of an object or character. To implementphysical gestures effects, we use the main mixedreality mechanism—that is, a single coordinatesystem that controls interaction through bound-ing boxes in the virtual environment.

The most important gesture type is semioticgestures. Semiotic gestures include opening one’sarms to welcome someone, raising a hand toattract attention or call someone, raising botharms in wonder or disbelief, and opening arms toindicate ignorance. What distinguishes these ges-tures from other nonverbal behaviors is that theyconstitute isolated units associated with a precisecommunicative function; in particular, a func-tion that can be mapped to a narrative context(unlike beat and other continuous and dynamicgestures).

Much controversy over the status of the vari-ous modalities in terms of their semantic contentexists in the multimodal literature. A distin-guishing characteristic of the interactive story-telling context is that speech and semioticgestures can have comparable semantic content.

Speech recognitionSpeech is the only practical mode of commu-

nication between the user and the virtual actorsin an interactive storytelling context, in additionto its being part of the narrative itself. Of course,

34

IEEE

Mul

tiM

edia

Welcome, Mr. Bond! Surely, you must be joking, 007!

Figure 6. Ambiguous

gestures. This attitude,

characterized as the

user sitting with his

arms raised and partly

extended, can represent

greetings, denial,

challenge, and so on.

Interpreting the correct

meaning requires joint

analysis of the user’s

utterance and attitude.

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.

Page 8: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

formidable challenges associated with speechunderstanding exist. In this context, we aim fora modest level of robustness, although a princi-pled approach would use the specific interactivestorytelling context to guide the speech inter-pretation strategy.

We based the speech recognition componenton BabelTech’s Ear software developer’s kit sys-tem, shown in Figure 7. The SDK can be used invarious modes, including multikeyword spottingin an open loop. Multikeyword spotting involvesdynamically recognizing keywords from a prede-fined set in any user utterance, regardless of theutterance’s other contents. It provides robustrecognition of the utterance’s most relevant top-ics in context without imposing constraints onthe user (such as the use of a specific phraseolo-gy). One necessary step when using multikey-work spotting is to provide a more integrateddefinition of keywords as meaning units (forexample, “be_careful” or “pay_attention”).

User utterances occurring in this narrativecontext are a specific type of speech act—that is,an utterance with a specific impact on the hear-er’s behavior. More importantly, a good mappingexists between speech acts and narrative actions(greetings, threats, requests, denials, and so on)such that they constitute direct input into thenarrative representation.

Categorizing user utterances in terms of speechacts—that is, recognizing the relevant speech actfrom the speech recognition output, be it a set ofkeywords or a more complex structure—is the keyproblem. No universally agreed-on method toidentify speech acts exists. Practical approachesin speech understanding have sought to eitherdetect surface-form cues (such as the occurrenceof “welcome” in greetings or wh-markers in ques-tions) or derive the speech act from the utter-ance’s casual structure (that is, identify the actionverb and its parameters).

Our implementation uses both approaches inparallel, while extending the shallow approachto cue detection. When the natural-language pro-cessing module doesn’t recognize an action verbaround which to instantiate an action template,it looks for surface cues, which it maps to acoarse-grained semantic category, such asapproval/disapproval, confirmation/denial, orfriendly/hostile (surface patterns such as “youwon’t,” “you’ll never,” and so on). The underly-ing principle is to use coarse-grained semanticcategories that the module can directly map to aspeech act, which in turn the previously men-

tioned speech act can assimilate into a narrativefunction according to its impact on the interac-tive story. Although the use of coarse-grained cat-egories doesn’t let us extract an occurrence’scomplete meaning, it does let us identify a glob-al meaning in context, which should trigger anappropriate behavior from the virtual actor.

We based the natural language interpretationon a template-matching procedure, because theuse of multikeyword spotting precludes morecomplete forms of syntax-based parsing. In tem-plate filling procedures, the natural-language pro-cessing module looks for certain action verbs orsubstantives. Recognizing one of these wordsactivates a template, which then searches for key-words corresponding to the action parameters.To find a subject or object, for example, the tem-plate might look for pronouns or proper names.

To identify a relevant speech act, the natural-language processing module uses the informa-tion in the template, any surface cuesencountered, and the narrative context—that is,the current stage of the plot. Implementingspeech act identification requires a set of pro-duction rules. The narrative actions representedin the plot model determine the number of tar-get speech acts, which are thus limited in num-ber, facilitating the mapping process.

We’ve so far associated speech acts with spo-ken utterances; in practice, they correspond to

35

July–Septem

ber 2004

Figure 7. BabelTech’s

Ear software

developer’s kit

environment.

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.

Page 9: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

multimodal input, which includes both speechand gesture. Although this doesn’t alter thenature of the target speech acts, it requires pro-cessing both modalities simultaneously, whichin turns supports more robust recognition.

Gesture recognition and multimodalprocessing

To a large extent, the gesture recognition soft-ware follows a philosophy that is not unlike thatof multikeyword spotting used for the speech com-ponent. That is to say, a fixed set of parameters isextracted from the image, which can be mappedto previous characterization of user attitudes. A setof semiotic gestures constitutes a gesture lexicon,containing as many as 15 body attitudes, some ofwhich are represented in Figure 8.

The gesture collection has a variety of sources:literature, actors in relevant movies, and so on.For each attitude, we collected and associateddata from the transfiction engine to the gesturein the lexicon. We similarly associate the semi-otic interpretation to the gesture. A gesture thatis ambiguous out of context receives several pos-sible interpretations (Figure 6).

The representation of each semiotic gesture inthe lexicon is a set of descriptive features: dis-tance between arm extremities, hand height, and

so on. A set of feature-value pairs describes eachgesture. While the Transfiction engine constant-ly outputs tracking point coordinates, the gesturerecognition system derives in real time the val-ues for each gesture feature from the trackingpoints’ coordinates. The system uses each fea-ture’s set of values to filter candidate gesturesfrom the gesture repository. Whenever it encoun-ters a satisfactory match, it outputs the candidatesemiotic gesture or a set of candidate gestures,such as {welcoming, denial}. The gesture recog-nition system can then unify this semantic cate-gory with those categories produced by thespeech recognition component. Figure 9, whichrepresents the temporal histograms correspond-ing to certain gesture features aligned with thespoken utterance, “You must be joking, Mr.Bond!” illustrates this process.

By processing speech and gestures jointly, thesystem implies that the open arms attitude servesa denial narrative function. Users will thus inter-pret the attitude as a negative answer to Bond’squestion, which corresponds to a failure of thetask in the corresponding HTN, shown in Figure10 (on p. 38), leading to a new course of action.

ConclusionMixed reality is a significant departure from

other paradigms of user involvement, such aspure spectator (with the ability to influence thestory)6 or Holodeck,12 in which an actor isimmersed in first-person mode. Although we’veyet to explore the practical implications of suchinvolvement, mixed reality interactive story-telling brings new perspectives for user interac-tion as well, with an emphasis on multimodalinteraction.

Human–computer interface research describesthe mode of appropriation used in our system (inwhich the context leads the user to rediscovermodes of expression previously described) as hab-itability. We can therefore conclude that actingcreates the condition for multimodal habitability.

Although the system is fully implemented, itremains a proof-of-concept prototype. Thus it’sstill too early to perform detailed user evalua-tions. However, grounding future evaluation pro-cedures on the multimodal habitability notion isappropriate. This suggests including an earlyadaptation stage during which the user getsacquainted with the system interface with mini-mal guidance apart from the prior presentationof original film footage. The subsequent evalua-tion should be allowed to measure various forms

36

IEEE

Mul

tiM

edia

Welcome, Mr. Bond! Please, take a seat.

This should help yourecover, Mr. Bond.

Would you come here,Miss Galore?

Figure 8. Example body

gestures. A gesture

lexicon contains as

many as 15 body

attitudes, such as the

four illustrated here.

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.

Page 10: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

of departure between the user’s expressions andthose characteristics of the original role withinthe story genre. MM

AcknowledgmentsOlivier Martin is funded through First Europe

Objectif 3 from the Walloon Region in Belgium.

References1. R. Barthes, “Introduction à l’Analyse Structurale des

Récits (in French), Comm., vol. 8, 1966, pp.1-27.

2. M. Cavazza, F. Charles, and S.J. Mead, “Character-

Based Interactive Storytelling,” IEEE Intelligent

Systems, vol. 17, issue 4, 2002, pp. 17-24.

3. T. Darrell et al., “A Novel Environment for Situated

Vision and Behavior,” Proc. IEEE Workshop Visual

Behaviors, IEEE CS Press, 1994, pp. 68-72.

4. R.S. Feldman and B. Rime, Fundamentals of

Nonverbal Behavior, Cambridge Univ. Press, 1991.

5. X. Marichal, and T. Umeda, “Real-Time

Segmentation of Video Objects for Mixed-Reality

Interactive Applications,” Proc. SPIE, vol. 5150,

Visual Communications and Image Processing

2003, T. Ebrahimi and Thomas Sikora, eds., 2003,

pp. 41-50.

6. A. Nandi, and X. Marichal, “Senses of Spaces

through Transfiction,” Proc. Int’l Workshop

Entertainment Computing (IWEC 2002), Kluwer,

2002, pp. 439-446.

7. D. Nau et al., “SHOP: Simple Hierarchical Ordered

Planner,” Proc. 16th Int’l Joint Conf. Artificial

Intelligence, AAAI Press, 1999, pp. 968-973.

37

July–Septem

ber 2004

Figure 9. Multimodal

detection: natural-

language processing

and gesture recognition

for interpretation of

user’s acting.

You must bejoking, Mr. Bond

Threshold

Threshold

Hor

izon

tal d

ista

nce

betw

een

both

han

dsVe

rtic

al d

ista

nce

betw

een

both

han

ds

Soundtrack

y

x

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.

Page 11: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

8. S. Oviatt, “Ten Myths of Multimodal Interaction,”

Comm. ACM, vol 42, no. 11, Nov. 1999, pp. 74-81.

9. T. Psik et al., “The Invisible Person: Advanced

Interaction Using an Embedded Interface,” Proc.

7th Int’l Workshop on Immersive Projection

Technology, 9th Eurographics Workshop on Virtual

Enviroments, ACM Press , 2003, pp.29-37.

10. E. Salvador, A. Cavallaro, and T. Ebrahimi, “Shadow

Identification and Classification Using Invariant

Color Models,” IEEE Signal Processing Soc. Int’l Conf.

Acoustics, Speech, and Signal Processing (ICASSP

2001), IEEE Press, 2001, pp. 1545-1548.

11. Comm. ACM, special issue on game engines in sci-

entific research, vol. 45, no. 1, Jan. 2002.

12. W. Swartout et al., “Toward the Holodeck:

Integrating Graphics, Sound, Character, and Story”

Proc. Autonomous Agents 2001 Conf., ACM Press,

2001, pp. 409-416.

Marc Cavazza is a professor of

intelligent virtual environments at

the University of Teesside, UK. His

main research interest is artificial

intelligence for virtual humans,

with an emphasis on language

technologies. He is currently

working on interactive storytelling systems and dialogue

formalisms for conversational characters. He received his

MD and PhD from the University of Paris 7.

Olivier Martin is a research assis-

tant in artificial intelligence at

the Université Catholique de

Louvain, Belgium. His main

research interest is intelligent

agent design for interactive appli-

cations, with an emphasis on ges-

ture recognition and behaviour planning. He received

his MD cum laude in electrical engineering from the

Université Catholique de Louvain.

Fred Charles is senior lecturer in

computer games programming at

the University of Teesside and is

currently completing a PhD in

the area of virtual actors in inter-

active storytelling. He has a BSc

in computer science and an MSc

in computer graphics from the University of Teesside.

38

IEEE

Mul

tiM

edia

Professorproviding

information

Questioningprofessor

TerminationConversationIntrusion

ThreatsInformationexchange

Informationexchange

Presentation

FailedFailed

FailedSucce

ss

Professorproviding

information

Questioningprofessor

TerminationConversationIntrusion

ThreatsPresentation

Failed

Success Information

exchange

Figure 10. Multimodal

interaction. (a) Bond

questions the Professor.

(b) The Professor

replies, “You must be

joking, Mr Bond!” with

a corresponding body

gesture. The system

interprets the

multimodal speech act

as a refusal (see Figure

9), causing the

corresponding task in

Bond’s HTN to fail.

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.

Page 12: Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed ...tees.openrepository.com/tees/bitstream/10149/58295/5/58295.pdfmixed reality using a multimodal ... The system uses a “magic

Steven J. Mead is a lecturer in the

computer games programming

department at the University of

Teesside and is currently under-

taking an MPhil in intervention

in autonomous agents’ behaviors

in interactive storytelling using

natural language. He has a BSc in computer science and

an MSc in computer graphics from the University of

Teesside.

Xavier Marichal is the co-founder

of Alterface, a spin-off integrating

real-time interactive applications.

His main interests concern

motion estimation/compensation

and image analysis in the frame-

work of distributed interactive set-

tings. He has a BSc and a PhD in electrical engineering

from the Université Catholique de Louvain.

Alok Nandi is the co-founder of

Alterface, a spin-off integrating

real-time interactive applications.

His main interests are storytelling

(interactive and not) art direction,

design, and visualization. He

received a BSc in electrical engi-

neering from the Université libre de Bruxelles and a

Licence en Philosophie et Lettres, with a focus on cine-

matographic analysis and writing.

Readers may contact Cavazza at the University of

Teesside, School of Computing and Mathematics,

Borough Rd., Middlesbrough, TS1 3BA, UK;

[email protected].

For further information on this or any other computing

topic, please visit our Digital Library at http://computer.

org/publications/dlib.

39

EXECUTIVE COMMITTEEPresident:CARL K. CHANG* Computer Science Dept.Iowa State UniversityAmes, IA 50011-1040Phone: +1 515 294 4377Fax: +1 515 294 [email protected]: GERALD L. ENGEL*Past President: STEPHEN L. DIAMOND*VP, Educational Activities: MURALI VARANASI*VP, Electronic Products and Services: LOWELL G. JOHNSON (1ST VP)*VP, Conferences and Tutorials: CHRISTINA SCHOBER†VP, Chapters Activities: RICHARD A. KEMMERER (2ND VP)*VP, Publications: MICHAEL R. WILLIAMS*VP, Standards Activities: JAMES W. MOORE*VP, Technical Activities: YERVANT ZORIAN*Secretary: OSCAR N. GARCIA*Treasurer:RANGACHAR KASTURI†2003–2004 IEEE Division V Director: GENE H. HOFFNAGLE†2003–2004 IEEE Division VIII Director: JAMES D. ISAAK†2004 IEEE Division VIII Director-Elect: STEPHEN L. DIAMOND*Computer Editor in Chief:DORIS L. CARVER†Executive Director: DAVID W. HENNAGE†* voting member of the Board of Governors† nonvoting member of the Board of Governors

E X E C U T I V E S T A F FExecutive Director: DAVID W. HENNAGEAssoc. Executive Director: ANNE MARIE KELLYPublisher: ANGELA BURGESSAssistant Publisher: DICK PRICEDirector, Administration:VIOLET S. DOANDirector, Information Technology & Services: ROBERT CARE

PURPOSE The IEEE Computer Society is theworld’s largest association of computing pro-fessionals, and is the leading provider of tech-nical information in the field.

MEMBERSHIP Members receive the month-ly magazine Computer, discounts, and opportu-nities to serve (all activities are led by volunteermembers). Membership is open to all IEEEmembers, affiliate society members, and othersinterested in the computer field.

COMPUTER SOCIETY WEB SITEThe IEEE Computer Society’s Web site, atwww.computer.org, offers information andsamples from the society’s publications and con-ferences, as well as a broad range of informationabout technical committees, standards, studentactivities, and more.

BOARD OF GOVERNORSTerm Expiring 2004: Jean M. Bacon, RicardoBaeza-Yates, Deborah M. Cooper, George V. Cybenko,Haruhisha Ichikawa, Thomas W. Williams, YervantZorianTerm Expiring 2005: Oscar N. Garcia, Mark A.Grant, Michel Israel, Stephen B. Seidman, Kathleen M.Swigger, Makoto Takizawa, Michael R. WilliamsTerm Expiring 2006: Mark Christensen, AlanClements, Annie Combelles, Ann Gates, Susan Men-gel, James W. Moore, Bill SchilitNext Board Meeting: 5 Nov. 2004, New Orleans

IEEE OFFICERSPresident: ARTHUR W. WINSTONPresident-Elect: W. CLEON ANDERSONPast President: MICHAEL S. ADLERExecutive Director: DANIEL J. SENESESecretary: MOHAMED EL-HAWARYTreasurer: PEDRO A. RAYVP, Educational Activities: JAMES M. TIENVP, Pub. Services & Products: MICHAEL R. LIGHTNERVP, Regional Activities: MARC T. APTERVP, Standards Association: JAMES T. CARLOVP, Technical Activities: RALPH W. WYNDRUM JR.IEEE Division V Director: GENE H. HOFFNAGLEIEEE Division VIII Director: JAMES D. ISAAKPresident, IEEE-USA: JOHN W. STEADMAN

COMPUTER SOCIETY OFFICESHeadquarters Office

1730 Massachusetts Ave. NW Washington, DC 20036-1992Phone: +1 202 371 0101 Fax: +1 202 728 9614E-mail: [email protected]

Publications Office10662 Los Vaqueros Cir., PO Box 3014Los Alamitos, CA 90720-1314Phone:+1 714 8218380E-mail: [email protected] and Publication Orders:Phone: +1 800 272 6657 Fax: +1 714 821 4641E-mail: [email protected]

Asia/Pacific OfficeWatanabe Building1-4-2 Minami-Aoyama,Minato-kuTokyo107-0062, JapanPhone: +81 3 3408 3118 Fax: +81 3 3408 3553E-mail: [email protected]

Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.