Upload
tranbao
View
213
Download
0
Embed Size (px)
Citation preview
Multimodal acting in mixed reality interactive storytelling
TeesRep - Teesside'sResearch Repository
Item type Article
Authors Cavazza, M. O. (Marc); Charles, F. (Fred); Mead, S. J.(Steven); Martin, O. (Olivier); Marichal, X. (Xavier);Nandi, A. (Alok)
Citation Cavazza, M. O. et al. (2004) 'Multimodal acting in mixedreality interactive storytelling', IEEE Multimedia, 11(3),pp.30-39.
DOI 10.1109/MMUL.2004.11
Publisher Institute of Electrical and Electronics Engineers
Journal IEEE Multimedia
Rights Author can archive publisher's version/PDF. For full detailssee http://www.sherpa.ac.uk/romeo/ [Accessed07/01/2010]
Downloaded 27-Jun-2018 12:12:41
Link to item http://hdl.handle.net/10149/58295
TeesRep - Teesside University's Research Repository - https://tees.openrepository.com/tees
TeesRep: Teesside University's Research Repository http://tees.openrepository.com/tees/
This full text version, available on TeesRep, is the PDF (final version) of:
Cavazza, M. O. et al. (2004) 'Multimodal acting in mixed reality interactive
storytelling', IEEE Multimedia, 11 (3), pp.30-39.
For details regarding the final published version please click on the following DOI link:
http://dx.doi.org/10.1109/ASWEC.2005.40
When citing this source, please use the final published version as above.
Copyright © 2005 IEEE. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of Teesside University's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to [email protected]. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
This document was downloaded from http://tees.openrepository.com/tees/handle/10149/58295
Please do not use this version for citation purposes.
All items in TeesRep are protected by copyright, with all rights reserved, unless otherwise indicated.
An experimentalmixed reality using amultimodalapproach lets usersplay characters ininteractive narrativesas though acting ona stage. Usersinteract withcharacters throughspeech, attitude, andgesture, enhancingtheir immersion inthe virtual world.
Interactive storytelling immerses users infantasy worlds in which they play partsin evolving narratives that respond totheir intervention. Implementing the
interactive storytelling concept involves manycomputing technologies: virtual or mixed realityfor creating the artificial world, and artificialintelligence techniques and formalisms for gen-erating the narrative and characters in real time.
As a character in the narrative, the user com-municates with virtual characters much like anactor communicates with other actors. Thisrequirement introduces a novel context for mul-timodal communication as well as several tech-nical challenges. Acting involves attitudes andbody gestures that are highly significant for bothdramatic presentation and communication. Atthe same time, spoken communication is essen-tial to realistic interactive narratives. This kind ofmultimodal communication faces several diffi-culties in terms of real-time performance, cover-age, and accuracy.
We’ve developed an experimental system thatprovides a small-scale but complete integrationof multimodal communication in interactive sto-rytelling. It uses a narrative’s semantic context tofocus multimodal input processing—that is, thesystem interprets users’ acting (the multimodalinput) in the mixed reality stage in terms of nar-rative functions representing users’ contributionsto the unfolding plot.
System overview: The mixed realityinstallation
Figure 1 shows the mixed reality system archi-tecture. The system uses a “magic mirror” para-digm, which we derived from the Transfictionapproach.1 In our approach, a video camera cap-tures the user’s image in real time, and theTransfiction engine extracts the image from thebackground and mixes it with a 3D graphicmodel of a virtual stage, which includes the sto-ry’s synthetic characters. The system projects theresulting image on a large screen facing the user,who sees his or her image embedded in the vir-tual stage with the synthetic actors.
We based the mixed reality world’s graphiccomponent on the Unreal Tournament 2003 gameengine (http://www.unrealtournament.com). Thisengine not only renders graphics and animatescharacters but, most importantly, contains asophisticated development environment for defin-ing interaction with objects and character behav-iors.2 It also supports integration of externalsoftware through socket-based communication.
We use the Transfiction engine to constructthe mixed environment through real-time imageprocessing.3 A single (monoscopic) 2D cameraanalyzes the user’s image in real time by seg-menting the user’s contours. The segmentation’sobjectives are twofold:
❚ It extracts the user image silhouette andinjects it into the virtual setting on the pro-jection screen (without resorting to chromakeying).
❚ At the same time, the Transfiction engine ana-lyzes the extracted body silhouette to recog-nize and track user behavior (position,attitude, and gestures) and influence the inter-active narrative accordingly.
A detection module segments the video imagein real time and outputs the resulting imagetogether with other data, such as gesture recog-
30 1070-986X/04/$20.00 © 2004 IEEE Published by the IEEE Computer Society
MultimodalActing in MixedRealityInteractiveStorytelling
Marc Cavazza, Fred Charles, and Steven J. MeadUniversity of Teesside
Olivier MartinUniversité Catholique de Louvain
Xavier Marichal and Alok NandiAlterface
Multisensory Communication and Experiencethrough Multimedia
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.
nition, that enable further processing. The cur-rent detection module uses a 4 × 4 Walsh func-tion Hadamard determinant and calculates thetransform on 4 × 4-pixel elements. Sliding thebox of two pixels aside allows taking decision on2 × 2-pixel blocks. As a result, it can segment andadequately detect objects’ boundaries and offerssome robustness to luminance variations. Figure2 gives an overview of the change-detectionprocess with the Walsh-Hadamard transform.First, the detection module calculates the back-ground image’s Walsh-Hadamard transform. Itthen compares the transform’s values for the cur-rent and background images. When the rate ofchange is higher than an established threshold,the module sets the area as foreground. Becauseshadows (which can be problematic because ofvariable indoor lighting conditions) can corruptsegmentation results, we remove them usinginvariant techniques.4
Next, we composite the resulting video imagewith the virtual environment image by mixingthe video channels captured by a separate com-puter running a DirectX-based application. Thefirst stage involves isolating the user image fromits background using basic chroma keying. Theremaining stage attempts to solve the occlusion
problem by blending the user image with the vir-tual environment image using empirical depthinformation. The gesture-recognition module
31
July–Septem
ber 2004
BabelTech Speechrecognition
Naturallanguage
processingTCP
Storytellingengine
Unreal engine
Transfictionengine
AI planning layer(hierarchical task networks)
Rendered scene
Silhouette
UDP
TCPMixedvideosignals
Captured image
Capture
UserProjection screen
Figure 1. Mixed reality system architecture. The Transfiction engine extracts the user’s image from the film captured by the video
camera and mixes it with a 3D graphic model of a virtual stage. The user views the resulting image on a large screen.
Walsh-Hadamardtransform
Current image
Walsh-Hadamardtransform
Background image
Compare
Foreground object
Higher thanthreshold?
Figure 2. Extracting the
user’s image from the
background.The
detection module uses a
Walsh-Hadamard
transform to compare
the background and
current images.
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.
provides this information for the user as theuser’s relative distance to the camera; the gameengine provides it for the virtual environment.Figure 3 illustrates the overall process wherebythe system composites several video image layersin real time to produce the final image, which itprojects onto a screen in front of the user.
In the first prototype, the gesture detectionand recognition components share a normalizedsystem of coordinates, which we obtain throughcalibration prior to running the system. This pro-totype doesn’t deal with occlusion in mixed real-ity, which is also set at calibration time. We’recurrently developing an occlusion managementsystem, which uses depth information providedby the transfiction engine.
The shared coordinates system lets us not onlyposition the user in the virtual image, but alsodetermine the relations between user and virtualenvironment. To do this, we map the 2D bound-ing box produced by the transfiction engine—which defines the contour of the segmented usercharacter—to a 3D bounding cylinder in the
Unreal Tournament 2003 environment, whichrepresents the user’s position in the virtual world(as Figure 4 shows). Relying on its basic mecha-nisms, the Transfiction engine automatically gen-erates low-level graphical events such as collisionsand object interaction.
The two subsystems communicate via TCPsockets: the image-processing module, workingon a separate computer, regularly sends two typesof message to the graphic engine. The messagesupdate the user’s position and any recognizedgestures. The Transfiction engine transmits therecognized gesture as a code for the gesture (forexample, a 2D vector indicating the direction ofpointing represents a pointing gesture). However,contextual interpretation of the gesture occurswithin the storytelling system.
The storytelling scenario in our experimentsis a James Bond adventure in which the userplays the villain (the Professor). The narrativeproperties of James Bond stories make themgood candidates for interactive storytellingexperiments; Barthes used them as a supportingexample in his foundational work in contem-porary narratology.5 In addition, their relianceon narrative stereotypes facilitates both narra-tive control and users’ understanding of theroles they’re to play. The basic storyline repre-sents an early encounter between Bond and theProfessor. Bond’s objective is to acquire someessential information, which he can obtain bysearching the Professor’s office, asking theProfessor’s assistant, or, under certain condi-tions, deceiving or threatening the Professorhimself. The user’s actions as the Professor inter-fere with Bond’s plan, altering how the plotunfolds.
32
IEEE
Mul
tiM
edia
Compositingvideolayers
Projection screen
Z-depth blending
Chroma keying
Figure 3. Constructing
the mixed reality
environment. The
mixed reality system
blends the user image
and the vitual
environment image to
produce the final
image, which it projects
on a large screen.
Figure 4. The 3D
bounding cylinder
determines physical
interactions in the
Unreal Tournament
2003 engine.
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.
Interactive storytellingWe adapted the interactive storytelling tech-
nology used in these experiments from our pre-vious work, described in detail elsewhere.6 Thus,we only briefly overview the approach here,focusing on the aspects most relevant to the sys-tem’s mixed reality implementation, in particu-lar multimodal user interaction.
Interactive storytelling involves the real-timegeneration of narrative actions such that the con-sequences of user intervention result in the inter-active storytelling system regenerating the storywith a modified environment. Narrative control dic-tates that user intervention should modify the sto-ry’s course, but only within the limits of the story’sgenre. Narrative control generally relies on a base-line plot that defines possible character actions, butimposes no unnecessary constraints on how theactions can be combined to constitute a plot.
Our approach, character-based interactive story-telling,6 centers on the virtual actors’ roles. Webased the artificial intelligence mechanism sup-porting character behavior on a planning tech-nology using hierarchical task networks (HTNs),as illustrated in Figure 5.7 These representationsdescribe the character’s role as a plan using ahierarchical decomposition of tasks into sub-tasks. (Formally, HTNs are AND/OR graphs andwe can represent the solution plan as a subgraphof the HTN.)
For instance, we can decompose aninformation-gathering task into several options for
gaining access to that information, such as search-ing files or getting it from another character. Eachof these tasks can be further decomposed—forinstance, to get information from another charac-ter, the user’s character must approach it, establisha relationship with it, convince it to handle theinformation, and so on. HTN task decompositioncontinues until reaching the terminal-actionlevel—that is, the level at which the synthetic char-acter can visually perform actions in the virtualworld. The system thus uses an HTN planner toselect in real-time each character’s actions. Ourmodule implemented within the UnrealTournament 2003 sends an action’s failure back tothe planner, which produces an alternative solu-tion. This mechanism is essential in interactive sto-rytelling because user intervention often causes acharacter’s planned action to fail, leaving it to pro-duce an alternative solution that will lead the storyinto new directions.
To accommodate the mixed-reality context,we adapted our previous character-based story-telling framework to the new user-interactionparadigm derived from the magic mirrormetaphor,8,9 which assumes greater user involve-ment than other interactive storytellingapproaches. This greater involvement calls formore flexible narrative control. In our support-ing example, each situation represents a stage inthe encounter between Bond and the villain:introduction, negotiation, and separation. In thisversion, we’ve defined one HTN for each situa-
33
July–Septem
ber 2004
Get infoand
escape
Escapefrom
professor
Escapefrom
assistant
Escapefrom
professor
Escapefrom
professor
Getdeliberately
captured
Getinfo
Searchroom
Searchroom
Threaten
Grabassistant
Exit viabackdoor
Treather
Go tobookshelf
Readbook
Go tofiling
cabinet
Searchthrough
files
Escapelasertable
Lock ontable
Gatedesk
Readbook
Drawgun
Jump offlaser
table just-in time
Use Q'sunlockingwatch toundo lock
Struggle toget free, reach forspecialwatch
AskGoldfinger
for info
Get herattention
Persuadeher
Askfor info
Getnear her
Choosesuitablelocation
Go tolocation
Callherover
Talkto her
Talkto her
Requesther
Usepassion
Seduceher
Flatterhere
Saynice
things
Saynice
things
Check thatno one isaround
Kissher
Makeprofessor
angry
Talk toGoldfinger
Wait to becapturedby guards
Makethreat toprofessor
Figure 5. Hierarchical
task network for Bond
character.The HTN
decomposes tasks into
subtasks.
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.
tion rather than a single HTN encompassing theentire scene. Figure 5 depicts the discussionphase HTN.
We adapted basic interactive storytellingmechanisms to the greater user involvement.HTNs are still based on the Bond role, but theygive a more explicit status to user intervention toallow for the user’s regular, but unpredictable,interaction, as the HTN representing the conver-sation between Bond and the Professor illustrates.This HTN incorporates several extensions to theHTN used in our previous work. Some extensionsinvolve a novel use of the representation; othersrequired modifying the underlying planningalgorithm that uses the HTN to animate the vir-tual character. For example, the HTN in the cur-rent system uses mixed nodes that have both anAND and an OR, letting us incorporate optionalactions while still limiting the representation’scomplexity (for example, in Figure 5, the HTNmakes the threat situation optional). We’ve alsoincorporated the possibility of some user inter-vention in the HTN itself. One required extensionallows a character to attempt an action only afterthe planner tests for the compatible user action(in this case, the Professor giving away the infor-mation). This doesn’t prevent the user from per-forming actions other than the one expected,which will impact the character’s plan at anoth-er level. In other words, this representationdeparts from a strict character-based approach toincorporate plot representation to accommodatethe higher level of user involvement.
Multimodal interactionThe user intervention is a multimodal input
consisting of a spoken utterance and an optionalbody gesture interprets this multimodal unit incontext (that is, using knowledge about the plotprogression) to determine what kind of responseit is to the virtual characters’ actions.
Consider the joint recognition of a multimodalspeech act comprising an utterance analyzed
through speech recognition and a body gestureprocessed by the transfiction engine. We catego-rize the user’s attitude shown in Figure 6 as theuser sitting with his arms raised and partly extend-ed (other categories include pointing gestures,which can mean showing, giving, and so on). Thisattitude is compatible with different interpreta-tions, including greetings (“Welcome, Mr.Bond!”), denial (“You must be joking, Mr. Bond”),or challenge (“Shoot me and you’ll never know,007!”). We interpret the correct meaning throughjoint analysis of the user’s utterance and attitude.
Traditional literature on multimodality focus-es on the use of deictic gestures in natural lan-guage instructions or dialogue10 and of gesturesin nonverbal communication.11 The narrativecontext of interactive storytelling creates newforms of gesture use, which in turn create newmultimodal combinations. The system supportsdeictic gestures, such as when the user indicatesan object or a location in a multimodal utterance(“Take a seat, Mr. Bond”).
Another type of gesture—physical gestures—is on-stage physical interventions, such as grasp-ing an object, slapping a character, or standingin front of an object or character. To implementphysical gestures effects, we use the main mixedreality mechanism—that is, a single coordinatesystem that controls interaction through bound-ing boxes in the virtual environment.
The most important gesture type is semioticgestures. Semiotic gestures include opening one’sarms to welcome someone, raising a hand toattract attention or call someone, raising botharms in wonder or disbelief, and opening arms toindicate ignorance. What distinguishes these ges-tures from other nonverbal behaviors is that theyconstitute isolated units associated with a precisecommunicative function; in particular, a func-tion that can be mapped to a narrative context(unlike beat and other continuous and dynamicgestures).
Much controversy over the status of the vari-ous modalities in terms of their semantic contentexists in the multimodal literature. A distin-guishing characteristic of the interactive story-telling context is that speech and semioticgestures can have comparable semantic content.
Speech recognitionSpeech is the only practical mode of commu-
nication between the user and the virtual actorsin an interactive storytelling context, in additionto its being part of the narrative itself. Of course,
34
IEEE
Mul
tiM
edia
Welcome, Mr. Bond! Surely, you must be joking, 007!
Figure 6. Ambiguous
gestures. This attitude,
characterized as the
user sitting with his
arms raised and partly
extended, can represent
greetings, denial,
challenge, and so on.
Interpreting the correct
meaning requires joint
analysis of the user’s
utterance and attitude.
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.
formidable challenges associated with speechunderstanding exist. In this context, we aim fora modest level of robustness, although a princi-pled approach would use the specific interactivestorytelling context to guide the speech inter-pretation strategy.
We based the speech recognition componenton BabelTech’s Ear software developer’s kit sys-tem, shown in Figure 7. The SDK can be used invarious modes, including multikeyword spottingin an open loop. Multikeyword spotting involvesdynamically recognizing keywords from a prede-fined set in any user utterance, regardless of theutterance’s other contents. It provides robustrecognition of the utterance’s most relevant top-ics in context without imposing constraints onthe user (such as the use of a specific phraseolo-gy). One necessary step when using multikey-work spotting is to provide a more integrateddefinition of keywords as meaning units (forexample, “be_careful” or “pay_attention”).
User utterances occurring in this narrativecontext are a specific type of speech act—that is,an utterance with a specific impact on the hear-er’s behavior. More importantly, a good mappingexists between speech acts and narrative actions(greetings, threats, requests, denials, and so on)such that they constitute direct input into thenarrative representation.
Categorizing user utterances in terms of speechacts—that is, recognizing the relevant speech actfrom the speech recognition output, be it a set ofkeywords or a more complex structure—is the keyproblem. No universally agreed-on method toidentify speech acts exists. Practical approachesin speech understanding have sought to eitherdetect surface-form cues (such as the occurrenceof “welcome” in greetings or wh-markers in ques-tions) or derive the speech act from the utter-ance’s casual structure (that is, identify the actionverb and its parameters).
Our implementation uses both approaches inparallel, while extending the shallow approachto cue detection. When the natural-language pro-cessing module doesn’t recognize an action verbaround which to instantiate an action template,it looks for surface cues, which it maps to acoarse-grained semantic category, such asapproval/disapproval, confirmation/denial, orfriendly/hostile (surface patterns such as “youwon’t,” “you’ll never,” and so on). The underly-ing principle is to use coarse-grained semanticcategories that the module can directly map to aspeech act, which in turn the previously men-
tioned speech act can assimilate into a narrativefunction according to its impact on the interac-tive story. Although the use of coarse-grained cat-egories doesn’t let us extract an occurrence’scomplete meaning, it does let us identify a glob-al meaning in context, which should trigger anappropriate behavior from the virtual actor.
We based the natural language interpretationon a template-matching procedure, because theuse of multikeyword spotting precludes morecomplete forms of syntax-based parsing. In tem-plate filling procedures, the natural-language pro-cessing module looks for certain action verbs orsubstantives. Recognizing one of these wordsactivates a template, which then searches for key-words corresponding to the action parameters.To find a subject or object, for example, the tem-plate might look for pronouns or proper names.
To identify a relevant speech act, the natural-language processing module uses the informa-tion in the template, any surface cuesencountered, and the narrative context—that is,the current stage of the plot. Implementingspeech act identification requires a set of pro-duction rules. The narrative actions representedin the plot model determine the number of tar-get speech acts, which are thus limited in num-ber, facilitating the mapping process.
We’ve so far associated speech acts with spo-ken utterances; in practice, they correspond to
35
July–Septem
ber 2004
Figure 7. BabelTech’s
Ear software
developer’s kit
environment.
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.
multimodal input, which includes both speechand gesture. Although this doesn’t alter thenature of the target speech acts, it requires pro-cessing both modalities simultaneously, whichin turns supports more robust recognition.
Gesture recognition and multimodalprocessing
To a large extent, the gesture recognition soft-ware follows a philosophy that is not unlike thatof multikeyword spotting used for the speech com-ponent. That is to say, a fixed set of parameters isextracted from the image, which can be mappedto previous characterization of user attitudes. A setof semiotic gestures constitutes a gesture lexicon,containing as many as 15 body attitudes, some ofwhich are represented in Figure 8.
The gesture collection has a variety of sources:literature, actors in relevant movies, and so on.For each attitude, we collected and associateddata from the transfiction engine to the gesturein the lexicon. We similarly associate the semi-otic interpretation to the gesture. A gesture thatis ambiguous out of context receives several pos-sible interpretations (Figure 6).
The representation of each semiotic gesture inthe lexicon is a set of descriptive features: dis-tance between arm extremities, hand height, and
so on. A set of feature-value pairs describes eachgesture. While the Transfiction engine constant-ly outputs tracking point coordinates, the gesturerecognition system derives in real time the val-ues for each gesture feature from the trackingpoints’ coordinates. The system uses each fea-ture’s set of values to filter candidate gesturesfrom the gesture repository. Whenever it encoun-ters a satisfactory match, it outputs the candidatesemiotic gesture or a set of candidate gestures,such as {welcoming, denial}. The gesture recog-nition system can then unify this semantic cate-gory with those categories produced by thespeech recognition component. Figure 9, whichrepresents the temporal histograms correspond-ing to certain gesture features aligned with thespoken utterance, “You must be joking, Mr.Bond!” illustrates this process.
By processing speech and gestures jointly, thesystem implies that the open arms attitude servesa denial narrative function. Users will thus inter-pret the attitude as a negative answer to Bond’squestion, which corresponds to a failure of thetask in the corresponding HTN, shown in Figure10 (on p. 38), leading to a new course of action.
ConclusionMixed reality is a significant departure from
other paradigms of user involvement, such aspure spectator (with the ability to influence thestory)6 or Holodeck,12 in which an actor isimmersed in first-person mode. Although we’veyet to explore the practical implications of suchinvolvement, mixed reality interactive story-telling brings new perspectives for user interac-tion as well, with an emphasis on multimodalinteraction.
Human–computer interface research describesthe mode of appropriation used in our system (inwhich the context leads the user to rediscovermodes of expression previously described) as hab-itability. We can therefore conclude that actingcreates the condition for multimodal habitability.
Although the system is fully implemented, itremains a proof-of-concept prototype. Thus it’sstill too early to perform detailed user evalua-tions. However, grounding future evaluation pro-cedures on the multimodal habitability notion isappropriate. This suggests including an earlyadaptation stage during which the user getsacquainted with the system interface with mini-mal guidance apart from the prior presentationof original film footage. The subsequent evalua-tion should be allowed to measure various forms
36
IEEE
Mul
tiM
edia
Welcome, Mr. Bond! Please, take a seat.
This should help yourecover, Mr. Bond.
Would you come here,Miss Galore?
Figure 8. Example body
gestures. A gesture
lexicon contains as
many as 15 body
attitudes, such as the
four illustrated here.
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.
of departure between the user’s expressions andthose characteristics of the original role withinthe story genre. MM
AcknowledgmentsOlivier Martin is funded through First Europe
Objectif 3 from the Walloon Region in Belgium.
References1. R. Barthes, “Introduction à l’Analyse Structurale des
Récits (in French), Comm., vol. 8, 1966, pp.1-27.
2. M. Cavazza, F. Charles, and S.J. Mead, “Character-
Based Interactive Storytelling,” IEEE Intelligent
Systems, vol. 17, issue 4, 2002, pp. 17-24.
3. T. Darrell et al., “A Novel Environment for Situated
Vision and Behavior,” Proc. IEEE Workshop Visual
Behaviors, IEEE CS Press, 1994, pp. 68-72.
4. R.S. Feldman and B. Rime, Fundamentals of
Nonverbal Behavior, Cambridge Univ. Press, 1991.
5. X. Marichal, and T. Umeda, “Real-Time
Segmentation of Video Objects for Mixed-Reality
Interactive Applications,” Proc. SPIE, vol. 5150,
Visual Communications and Image Processing
2003, T. Ebrahimi and Thomas Sikora, eds., 2003,
pp. 41-50.
6. A. Nandi, and X. Marichal, “Senses of Spaces
through Transfiction,” Proc. Int’l Workshop
Entertainment Computing (IWEC 2002), Kluwer,
2002, pp. 439-446.
7. D. Nau et al., “SHOP: Simple Hierarchical Ordered
Planner,” Proc. 16th Int’l Joint Conf. Artificial
Intelligence, AAAI Press, 1999, pp. 968-973.
37
July–Septem
ber 2004
Figure 9. Multimodal
detection: natural-
language processing
and gesture recognition
for interpretation of
user’s acting.
You must bejoking, Mr. Bond
Threshold
Threshold
Hor
izon
tal d
ista
nce
betw
een
both
han
dsVe
rtic
al d
ista
nce
betw
een
both
han
ds
Soundtrack
y
x
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.
8. S. Oviatt, “Ten Myths of Multimodal Interaction,”
Comm. ACM, vol 42, no. 11, Nov. 1999, pp. 74-81.
9. T. Psik et al., “The Invisible Person: Advanced
Interaction Using an Embedded Interface,” Proc.
7th Int’l Workshop on Immersive Projection
Technology, 9th Eurographics Workshop on Virtual
Enviroments, ACM Press , 2003, pp.29-37.
10. E. Salvador, A. Cavallaro, and T. Ebrahimi, “Shadow
Identification and Classification Using Invariant
Color Models,” IEEE Signal Processing Soc. Int’l Conf.
Acoustics, Speech, and Signal Processing (ICASSP
2001), IEEE Press, 2001, pp. 1545-1548.
11. Comm. ACM, special issue on game engines in sci-
entific research, vol. 45, no. 1, Jan. 2002.
12. W. Swartout et al., “Toward the Holodeck:
Integrating Graphics, Sound, Character, and Story”
Proc. Autonomous Agents 2001 Conf., ACM Press,
2001, pp. 409-416.
Marc Cavazza is a professor of
intelligent virtual environments at
the University of Teesside, UK. His
main research interest is artificial
intelligence for virtual humans,
with an emphasis on language
technologies. He is currently
working on interactive storytelling systems and dialogue
formalisms for conversational characters. He received his
MD and PhD from the University of Paris 7.
Olivier Martin is a research assis-
tant in artificial intelligence at
the Université Catholique de
Louvain, Belgium. His main
research interest is intelligent
agent design for interactive appli-
cations, with an emphasis on ges-
ture recognition and behaviour planning. He received
his MD cum laude in electrical engineering from the
Université Catholique de Louvain.
Fred Charles is senior lecturer in
computer games programming at
the University of Teesside and is
currently completing a PhD in
the area of virtual actors in inter-
active storytelling. He has a BSc
in computer science and an MSc
in computer graphics from the University of Teesside.
38
IEEE
Mul
tiM
edia
Professorproviding
information
Questioningprofessor
TerminationConversationIntrusion
ThreatsInformationexchange
Informationexchange
Presentation
FailedFailed
FailedSucce
ss
Professorproviding
information
Questioningprofessor
TerminationConversationIntrusion
ThreatsPresentation
Failed
Success Information
exchange
Figure 10. Multimodal
interaction. (a) Bond
questions the Professor.
(b) The Professor
replies, “You must be
joking, Mr Bond!” with
a corresponding body
gesture. The system
interprets the
multimodal speech act
as a refusal (see Figure
9), causing the
corresponding task in
Bond’s HTN to fail.
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.
Steven J. Mead is a lecturer in the
computer games programming
department at the University of
Teesside and is currently under-
taking an MPhil in intervention
in autonomous agents’ behaviors
in interactive storytelling using
natural language. He has a BSc in computer science and
an MSc in computer graphics from the University of
Teesside.
Xavier Marichal is the co-founder
of Alterface, a spin-off integrating
real-time interactive applications.
His main interests concern
motion estimation/compensation
and image analysis in the frame-
work of distributed interactive set-
tings. He has a BSc and a PhD in electrical engineering
from the Université Catholique de Louvain.
Alok Nandi is the co-founder of
Alterface, a spin-off integrating
real-time interactive applications.
His main interests are storytelling
(interactive and not) art direction,
design, and visualization. He
received a BSc in electrical engi-
neering from the Université libre de Bruxelles and a
Licence en Philosophie et Lettres, with a focus on cine-
matographic analysis and writing.
Readers may contact Cavazza at the University of
Teesside, School of Computing and Mathematics,
Borough Rd., Middlesbrough, TS1 3BA, UK;
For further information on this or any other computing
topic, please visit our Digital Library at http://computer.
org/publications/dlib.
39
EXECUTIVE COMMITTEEPresident:CARL K. CHANG* Computer Science Dept.Iowa State UniversityAmes, IA 50011-1040Phone: +1 515 294 4377Fax: +1 515 294 [email protected]: GERALD L. ENGEL*Past President: STEPHEN L. DIAMOND*VP, Educational Activities: MURALI VARANASI*VP, Electronic Products and Services: LOWELL G. JOHNSON (1ST VP)*VP, Conferences and Tutorials: CHRISTINA SCHOBER†VP, Chapters Activities: RICHARD A. KEMMERER (2ND VP)*VP, Publications: MICHAEL R. WILLIAMS*VP, Standards Activities: JAMES W. MOORE*VP, Technical Activities: YERVANT ZORIAN*Secretary: OSCAR N. GARCIA*Treasurer:RANGACHAR KASTURI†2003–2004 IEEE Division V Director: GENE H. HOFFNAGLE†2003–2004 IEEE Division VIII Director: JAMES D. ISAAK†2004 IEEE Division VIII Director-Elect: STEPHEN L. DIAMOND*Computer Editor in Chief:DORIS L. CARVER†Executive Director: DAVID W. HENNAGE†* voting member of the Board of Governors† nonvoting member of the Board of Governors
E X E C U T I V E S T A F FExecutive Director: DAVID W. HENNAGEAssoc. Executive Director: ANNE MARIE KELLYPublisher: ANGELA BURGESSAssistant Publisher: DICK PRICEDirector, Administration:VIOLET S. DOANDirector, Information Technology & Services: ROBERT CARE
PURPOSE The IEEE Computer Society is theworld’s largest association of computing pro-fessionals, and is the leading provider of tech-nical information in the field.
MEMBERSHIP Members receive the month-ly magazine Computer, discounts, and opportu-nities to serve (all activities are led by volunteermembers). Membership is open to all IEEEmembers, affiliate society members, and othersinterested in the computer field.
COMPUTER SOCIETY WEB SITEThe IEEE Computer Society’s Web site, atwww.computer.org, offers information andsamples from the society’s publications and con-ferences, as well as a broad range of informationabout technical committees, standards, studentactivities, and more.
BOARD OF GOVERNORSTerm Expiring 2004: Jean M. Bacon, RicardoBaeza-Yates, Deborah M. Cooper, George V. Cybenko,Haruhisha Ichikawa, Thomas W. Williams, YervantZorianTerm Expiring 2005: Oscar N. Garcia, Mark A.Grant, Michel Israel, Stephen B. Seidman, Kathleen M.Swigger, Makoto Takizawa, Michael R. WilliamsTerm Expiring 2006: Mark Christensen, AlanClements, Annie Combelles, Ann Gates, Susan Men-gel, James W. Moore, Bill SchilitNext Board Meeting: 5 Nov. 2004, New Orleans
IEEE OFFICERSPresident: ARTHUR W. WINSTONPresident-Elect: W. CLEON ANDERSONPast President: MICHAEL S. ADLERExecutive Director: DANIEL J. SENESESecretary: MOHAMED EL-HAWARYTreasurer: PEDRO A. RAYVP, Educational Activities: JAMES M. TIENVP, Pub. Services & Products: MICHAEL R. LIGHTNERVP, Regional Activities: MARC T. APTERVP, Standards Association: JAMES T. CARLOVP, Technical Activities: RALPH W. WYNDRUM JR.IEEE Division V Director: GENE H. HOFFNAGLEIEEE Division VIII Director: JAMES D. ISAAKPresident, IEEE-USA: JOHN W. STEADMAN
COMPUTER SOCIETY OFFICESHeadquarters Office
1730 Massachusetts Ave. NW Washington, DC 20036-1992Phone: +1 202 371 0101 Fax: +1 202 728 9614E-mail: [email protected]
Publications Office10662 Los Vaqueros Cir., PO Box 3014Los Alamitos, CA 90720-1314Phone:+1 714 8218380E-mail: [email protected] and Publication Orders:Phone: +1 800 272 6657 Fax: +1 714 821 4641E-mail: [email protected]
Asia/Pacific OfficeWatanabe Building1-4-2 Minami-Aoyama,Minato-kuTokyo107-0062, JapanPhone: +81 3 3408 3118 Fax: +81 3 3408 3553E-mail: [email protected]
Authorized licensed use limited to: Teesside University. Downloaded on January 25, 2010 at 05:52 from IEEE Xplore. Restrictions apply.