A Framework for the Intelligent Multimodal Presentation of ... · A Framework for the Intelligent Multimodal Presentation of Information Cyril Rousseau1, Yacine Bellik1, Frédéric

A Framework for the Intelligent Multimodal Presentation of Information

Cyril Rousseau1, Yacine Bellik1, Frédéric Vernier1, Didier Bazalgette2

1LIMSI-CNRS, Université Paris XI, BP 133, 91403 Orsay cedex, France 2DGA/DSP/STTC/DT/SH, 8 Bd Victor, 00303, Armées, France

Abstract Intelligent multimodal presentation of information aims at using several communication

modalities to produce the most relevant user outputs. This problem involves different concepts related to information structure, interaction components (modes, modalities, devices) and context. In this paper we study the influence of interaction context on system outputs. More precisely, we propose a conceptual model for intelligent multimodal presentation of information. This model called WWHT is based on four concepts: “What”, “Which”, “How” and “Then”. These concepts describe the life cycle of a multimodal presentation from its "birth" to its "death" including its evolution. On the basis of this model, we present the ELOQUENCE software platform for the specification, the simulation and the execution of output multimodal systems. Finally, we describe two applications of this framework: the first one concerns the simulation of an incoming call in an intelligent mobile phone and the second one is related to a task of marking out a target on the ground in a fighter plane cockpit.

Keywords: Human-Computer Interaction, output multimodality, multimodal presentation, interaction context, outputs specification, outputs simulation.

1. Introduction

Nowadays, computer systems are more and more diversified. The mobility property of some platforms (such as notebook, mobile phone, Personal Digital Assistant, etc.) creates new use habits of information processing systems. It becomes common to meet in every kind of places (like pub, fast-food, park, airport, etc.) people using the brand new communication devices such as mobile phones, portable media players and recently the phone-PDA. This situation has become usual but actually symbolizes the last research subjects of the Human Computer Interaction community: mobile and pervasive computing.

The diversity of environments, systems and user profiles leads to a contextualisation of the interaction. Initially the interaction had to be adapted to a given application and for a specific interaction context. In the near future, the interaction has to be adapted to different situations and to a context in constant evolution [39]. The contextualization of the interaction techniques emphasizes the complexity of a multimodal system design. This requires the adaptation of the design process and more precisely the implementation of a new generation of user interface tools. These tools should help the designer and the system to choose the best interaction techniques in a given context.

This paper presents a conceptual model for multimodal presentation of information and its application through a platform allowing the output multimodal systems design. This conceptual 1 Corresponding author. Tel.: +33 (0)1 69 85 81 06; fax: +33 (0)1 69 85 80 88. E-mail addresses: [email protected], [email protected], [email protected],

[email protected]

model called WWHT (What-Which-How-Then) describes the design process of a multimodal presentation. More precisely, it introduces the different output multimodality components and presents the different steps of the presentation's life-cycle. On the basis of this model, the ELOQUENCE software platform, allowing the specification, the simulation and the execution of multimodal system outputs, has been developed. Two examples illustrate the application of the model and of the platform: the first one concerns the mobile telephony field (simulation of an incoming call) and the second one the military avionic field (task of marking out a target on the ground). The first example illustrates all along the paper the introduced concepts and the second example sums them up. Finally, the paper concludes with a discussion about the relations between inputs and outputs in an intelligent multimodal system.

2. WWHT Model

In this section we will explain the process of designing an intelligent presentation. We will start by describing the required elements for the multimodal presentation of information. Then we will present the WWHT conceptual model which allows to identify the different steps in the life cycle of a multimodal presentation.

2.1 Required Elements

An output multimodal system can not be reduced to a system which simply exploits several output modalities. In that case, we prefer to talk about “multimedia system” [20]. An output multimodal system aims at presenting information in an “intelligent” way by exploiting different communication modalities. Some authors use the MIIP (Multimodal Intelligent Information Presentation) term to make reference to this concept. According to the desired multimodal system, this notion of “intelligence” may vary. However, all existing systems share the same goal: the information presentation must be the most suitable to the interaction context.

This process of intelligent information presentation is based on four elements: • information to present, • interaction components, • interaction context, • behaviour.

2.1.1 Information

In reference to the ARCH model [1], the output module presents semantic information to the user. More precisely, these informations are generally created by the functional core, forwarded by the dialog controller and presented by the output module. For example, the output module of a mobile phone may present the following semantic informations: “call of X”, “message of X“, “low battery level”, etc.

2.1.2 Interaction Components

An interaction component is a (physical or logical) communication mean between the user and the application. Three types of interaction components can be distinguished: mode, modality and medium. According to the authors, these notions may have different definitions [12, 14, 32].

We refer to our user oriented definitions [11]. Output modes correspond to human sensory systems (visual, auditory, tactile, etc.). An output modality is defined by the information structure as it is perceived by the user (text, image, vibration, etc.) and not as it is represented internally by the machine. For instance, if a text is scanned then it may be represented internally by an image, but the perceived modality for the user is still text and not image. Finally an output medium is an output device allowing the expression of an output modality (screen, speaker, vibrator, etc.). Output media are independent elements of the interactive system to achieve a better modularity.

We can notice that some relations exist between these three notions. A mode can be associated with a set of modalities and each modality can be associated to a set of media. For example, the “vibrator” medium allows the expression of the “vibration” modality which is perceived through the “tactile” mode. Two types of relations between the interaction components can be distinguished: “primary” and “secondary”. A primary relation refers to a wanted effect whereas a secondary relation is a side effect. For instance, the vibration of a mobile phone is used to be perceived by the user in a tactile way. This implies a primary relation between “tactile” mode and “vibration” modality. But the sound generated by the vibrations is an example of side effect. So, a secondary relation between “auditory” mode and “vibration” modality can be added. All these relations define a diagram of the interaction components managed by the output system. Figure 1 presents an interaction components diagram managed by a mobile phone.

Criteria Values Model

Deaf person Yes, No User

Visually impaired person Yes, No User

Phone mode Increased,

Normal, Silent

System

Screen availability

Available, Unavailable System

Speaker availability


Vibrator availability


Audio channel availability

Free, Occupied System

Battery level 0-100 System

Noise level 0-130 Environment

Fig. 1. Interaction components diagram. Table 1. Interaction context.

2.1.3 Interaction Context

We refer to Anind Dey definition for the concept of interaction context [24]: “Context is any information that can be used to characterize the situation of an entity. An entity is a person or object that is considered relevant to the interaction between a user and an application, including the user and application themselves”.

We used a model base approach [6] to specify these entities. A model (user model, system model, etc.) formalizes an entity through a set of dynamic or static criteria (user preferences, media availabilities, etc.). According to studied systems the used models may be different. An element of the context can be relevant for some kind of application field and useless for other ones. For example, Table 1 describes an interaction context modelling for a mobile phone application.

2.1.4 Behaviour

The expression of information requires a multimodal presentation suited to the current interaction context. This presentation is composed by a set of output (modality, medium) pairs linked with redundancy or complementarity properties [21]. For example, an incoming call on a mobile phone may be expressed through a multimodal presentation composed of two pairs. A first pair (“ringing modality”, “speaker medium”) indicates a phone call while a second pair (“text modality”, “screen medium”) presents the caller's identity.

Modalities Media

Visual Photography

Text

Screen

Tactile Vibration Vibrator

Auditory

Ringing

Synthetical voice

Speaker

Primary relation Secondary relation

Modes

The behavioural model is probably the most critical part when designing a multimodal presentation. It identifies the best interaction components (modes, modalities and media) adapted to the current state of the interaction context. Its formalization can be made in different ways: rules [38], automats [26], Petri networks [8], etc.

2.1.5 How to obtain these elements?

An analysis process must be applied to obtain required elements. At first, it is necessary to collect a data corpus. This corpus must be composed of scenarios / storyboards (referring to nominal or degraded situations) but also of relevant knowledge on application field, system, environment, etc. Collecting this corpus must be strictly done and should produce consequent and diversified set of data. The corpus provides elementary elements needed to build the output system core (behavioural model). The quality of system outputs will highly depend on the corpus diversity.

The participation and the collaboration of three actors is required: ergonomists, designers and end users (experts in the application field). Designers and users are mainly involved in the extraction of the elements while ergonomists are mainly involved in the interpretation of the extracted elements. The participation of all these actors is not an essential condition. However, the absence of an actor will be probably the source of a fall in the outputs specification quality.

Figure 2 presents the different steps to extract the required elements. The first step identifies pertinent data which can influence the output interaction (interaction context modelling). These data are interpreted to constitute context criteria and classified by models. The next step specifies the interaction components diagram. Media are often defined in technical documentations and from media it is relatively easy to identify output modes and modalities. The third step identifies semantic information which should be presented by the system. For better performance of the final system, it is recommended to decompose these information into elementary semantical parts. At last, these extracted elements will allow the behavioural model definition.

Fig. 2. Extraction of the required elements.

2.2 Model Overview

From these four elements, a conceptual model describing the life cycle of an adapted (intelligent) multimodal presentation has been deduced. This model called WWHT is based on four concepts “What”, “Which”, “How” and “Then”:

• What is the information to present? • Which modality(ies) should we use to present this information? • How to present the information using this(ese) modality(ies)? • and Then, how to handle the evolution of the resulting presentation?

Data Corpus Behaviour

Interaction Context

1. Identify the semantic information 2. Decompose into elementary information

Information Units

1. Identify the media 2. Identify the modalities 3. Identify the modes 4. Identify the relations - Mode / Modality - Modality / Medium

Interaction Components

1. Identify the models 2. Identify the criteria 3. Classify criteria according to the models

The first three concepts concern the initial presentation design [14] while the last concept is relative to the presentation evolution [18]. Figure 3 and next sections present the initial design process. Then the presentation evolution will be discussed in section 2.6.

Fig. 3. Design process of an adapted multimodal presentation.

2.3 What Information to Present?

First, it is necessary to decompose the semantic information (Figure 3, IUi) into elementary information units (Figure 3, EIUi). Let us continue with the example of an incoming call on a mobile phone. The semantic information is “phone call of X”. This information can be decomposed into two elementary information units: the event (a phone call) and the caller identity.

Fig. 4. Multimodal Fusion / Fission.

Some authors use the word “fission” by the opposite to the word “fusion” (Figure 4) to name the

process of output modalities selection. We think this term is not relevant for this use. Indeed there is a fission process, but this fission takes place on a semantic level. So we prefer to talk about “semantic fission” during the decomposition of the semantic information into elementary information and of “allocation” during the output modalities selection (Figure 5).

Cm

…

Ci

…

C2

C3

C1

IUx

EIU1

EIU2

EIUn

Mod1 Med1

Mod2 Med2

.

.

.

CR

Mod1 Med1

Mod2 Med2

…

WHAT WHICH HOW

Semantic Fission Election

…

MPi

Med1 Mod1

Medi Modj

.

.

.

CR

Medk Modl

Medk Modj

…

…

Value1

Value2

.. Valuev

Attributek Value1

Value2

.. Valueu

Attribute1

…

Value1

Value2

.. Valueh

Content

Instantiation

MPi

IUi: Information Unit EIUi: Elementary Information Unit Ci: state of the interaction Context Modi: output Modality Medi: output Medium (device) CR: Complementarity / Redundancy property MPi: Multimodal Presentation

Fusion

Fission

DC

FC

Information

Information

…

Mod1

Mod2

…

Mod1

Mod2

Modi: input / ouput Modality DC: Dialogue Controller FC: Fonctional Core

Fig. 5. Semantic Fission.

In our platform, the semantic fission is made by the designer. When he specifies the semantic

information, he must also specify the associated decomposition into elementary information. This decomposition stays dependent on the designer expertise. The automation of the semantic fission is an interesting research problem but at the moment, it remains for us, as a research perspective.

2.4 Which Modality(ies) to Use?

Once the information is decomposed, it is necessary to choose the most suitable presentation form (Figure 3, MPi) to express the information: this problem is called the allocation process [28]. The allocation of a multimodal presentation consists in selecting adapted output modalities. This selection process according to the interaction context is based on the behavioural model. First, for each elementary information unit, a multimodal presentation adapted to the current state of the interaction context (Figure 3, Ci) is selected. Then, selected presentations are merged into only one presentation expressing the initial information.

Our approach uses a behavioural model formalized by a base of election rules. This formalism has the advantage to propose a simple reasoning (If … Then…instructions) limiting the learning cost. However this choice introduces problems on the scalability (evolution ability), the coherence and the completeness of a rule-based system. A graphical rule editor (introduced in the sub-section 3.1) has been implemented to help the designer in the conception and the modification of the rules base. Mechanisms for checking the structural coherence (two rules with equivalent premises must have coherent conclusions) are also proposed but the designer is still responsible of the completeness of the rules base.

Three types of rules are distinguished: contextual, composition and property rules. The premises of a contextual rule describe a state of the interaction context. The conclusions define contextual weights underlining the interest of the aimed interaction components (according to the context state described in the premises rule). The composition [42] rules allow the modalities composition and so the conception of multimodal presentation with several (modality, medium) pairs based on redundancy and/or complementarity criteria [20]. Lastly, the property rules select a set of modalities using a global modality property (linguistic, analogical [12], confidential, etc.).

Table 2 presents seven rules to allocate the “phone call of X” information: five of contextual type, one of composition type (R2) and one of property type (R7). In a nominal situation, only R6 and R7 rules are applied to present an incoming call. The call is then presented though a multimodal presentation composed of two pairs: (Ringing, Speaker) to indicate the phone call event (first EIU) and (Photography, Screen) to present the caller (second EIU). In a different interaction context such as a low battery level, R4 rule changes the form of the last presentation (stops the use of the photography modality) by choosing the Text modality to present the caller.

Fission

Semantic Information

Allocation

… EIU1 EIU2 EIUN

… Mod1 Mod2 ModM

Table 2. Seven rules of the behavioral model.

Id Name Description in natural language R1 Visually impaired person If user is a visually impaired person Then do not use Visual mode R2 Increased mode If mobile phone is in increased mode Then use Redundancy property

R3 Speaker unavailability If speaker is unavailable or audio channel is already in use Then do not use Speaker medium

R4 Low battery level If current IU is a call reception and battery level is low Then do not use Photography modality and do not use Vibrator medium

R5 Too noisy If noise level is superior to 80 dB or mobile phone is in silent mode Then Auditory mode is unsuitable

R6 Call event If current EIU is an incoming call event Then Ringing modality is suitable

R7 Caller identity If current EIU is a caller identity Then try to express it with Analogical modalities

By analogy with the political world, we call the allocation process: “election”. Our election

process uses a rules base (voters) which add or remove points (votes) to certain modes, modalities or media (candidates), according to the current state of the interaction context (political situation, economic situation, etc.).

Fig. 6. Election process.

The application of the contextual and property rules defines the “pure” election while the application of the composition rules defines the “compound” election (Figure 6). The pure election elects the best modality-medium pair while the compound election enriches the presentation by selecting new pairs redundant or complementary to the first one.

2.5 How to Present the Information?

It’s now necessary to instantiate the allocated multimodal presentation. The instantiation process [3] consists in selecting the lexico-syntactical content (Figure 3, Content) and the morphological attributes (Figure 3, Attributes) of the presentation. First, concrete content expressed through the presentation modalities are chosen [30]. From this content, the presentation attributes (modalities attributes [33], spatial and temporal parameters [23, 25], etc.) are fixed.

ContextState

EIU

Pure Election

Compound Election

Modality A Medium CMP

Modality A Medium C

Modality B Medium DMP

Complementarity

CompositionRules

Contextual Rules

Property Rules

Let us resume the sample of an incoming call on a mobile phone. We suppose that the presentation is composed of two (modality, medium) pairs. A first pair (ringing modality, speaker medium) indicates the phone call event (first EIU) and the second pair (photography modality, screen medium) presents the caller (second EIU). An example of the presentation content may consist in using the pink panther music for the ringing modality and a portrait of the caller for the photography modality. For presentation attributes, we may use an average volume for the ringing modality and a full screen resolution for the photography modality. We can also decide to maintain this presentation during 15 seconds.

The generation of a lexico-syntactical content from a semantical content could be automatic but this research subject exceeds the bounds of our work. The automatic content generation is a separate problem for each modality and represents a whole research field in itself such as natural language generation [4] or gesture synthesis [17]. However the automatic generation of the presentation attributes is a part of our research problems.

The Figure 7 presents a selection process of the modality content used in our platform. For each modality, a set of all possible contents is defined during the outputs specification. The elementary information units to express, the elected medium as well as the current state of the interaction context reduce and classify the different possibilities allowing the selection of the most suitable content for the modality.

Fig. 7. Selection of the content modality.

The number of the used instantiations for a given modality is often reduced in regard to all

possible instantiations. So for each modality attribute, it is better to specify a set of attributes instantiation models instead of a set of all possible instantiations. Then, it remains to select the best attributes instantiation model according to the chosen content, elementary information units to express, elected medium and current state of the interaction context. These mechanisms are based on a classification system. A similar formalization to the formalization used for the modalities election process could be also applied here.

2.6 And Then? (Presentation Evolution)

We just saw how to present information and more precisely how to adapt a multimodal presentation to a given state of the interaction context. However the interaction context may evolve during the presentation running [37, 41]. This raises a problem on the presentation validity. The presentation is adapted at the moment of its design but it may not be the case after a context evolution.

This problem affects mainly persistent presentations because a context evolution is rather improbable (but not impossible) in the case of punctual presentations. Furthermore, only certain context evolutions require the check of the presentation validity. The evolution has to pay attention only to the context elements which may influence the presentation. These elements can be deduced from the contextual rules premises which led to the design of the current presentation. A context evolution on one of the criteria which appears in these premises may change the application of the behavioural model. The presentation may be not valid anymore and consequently need to be updated.

Content1

Content2

Content3

Content4

Content5

…

Contentn

Modality

Content3

Content6

Content1 …

Content5

Modality

Context Content6

Content1

Content5

Modality EIU

Medium

It is important to notice that the invalidation does not always concern the whole presentation but only some elements of it. In this case a partial re-election can update the presentation according to the new context at a lower cost. This solution limits the number of new elections and thus improves the processing time. However the coherence of the updated presentation must be re-verified. In some cases, this check may lead to a total re-election of the presentation.

As we have seen, a multimodal presentation must be adapted to the interaction context throughout its whole life cycle. This constraint requires the implementation of new mechanisms to manage persistent presentations.

Fig. 8. Evolution module.

In our platform, we used a first approach which we qualify by “centralized”. This approach is based on an evolution module of the presentations (Figure 8) [34]. Any new presentation is registered on this module. This registration applies the supervision of the context criteria guaranteeing the presentation validity. The evolution of one of these criteria implies an invalidation of the presentation and the application of the update mechanisms. Lastly, the end of the presentation involves unregistration on the evolution module.

We also studied a “distributed” approach. In this approach, the information expression does not produce a multimodal presentation but a presentation model (Figure 9). This presentation model is based on an augmented transition network (ATN) with two levels (global and local). A global state represents a multimodal presentation (Figure 9, MPi) whereas a local state corresponds to an instantiation (Figure 9, Ii) of this presentation.

Fig. 9. Presentation model of an information unit.

MP3

MP2

MP4

MP1

MP5 Evolution module

Start / stop the supervision of criterion X

Criterion X changed

Registration Context

Invalidation

Unregistration

Update

I1

I2

I3

I4

ModA

MP1

Ci

Ai

Aj Cn

Ck

Running state

Ci: state of the interaction Context Ai: user Action T: run Time constraint MPi: Multimodal Presentation C, R: Complementarity / Redundancy Modi: output Modality Medi: output Medium Ii: modality Instantiation

Refinement Mutation

I1 I2

MP2

MedD

ModB

I1

I2

I3

I4 I1

I3

I2

MP3

Ci R

MedC

ModA ModB

MedD

T

AiCj

MedC

A global state (for example MP1) is composed of an augmented transition network describing the different possible instantiations (local states: I1, I2, I3, I4) of the presentation. The global state MP1 represents the initial presentation and the local state I1, the initial instantiation. Others global states (Figure 9, MP2 and MP3) represent all presentation alternatives.

The transitions (global and local) are labelled by an evolution factor. An evolution factor is any element which requires an update of the presentation. It can be: a context evolution, a user action on the presentation or a time constraint. Two types of evolution have been defined.

The first evolution type called “refinement” changes the presentation instantiation (local state change). For instance, the increase of the vibration level or the ringing volume is a refinement of the presentation. The second evolution type called “mutation” changes the presentation modalities and/or media (global state change). For example, the evolution from vibration modality to ringing modality is a mutation of the presentation. These two examples can be then used to strength in a progressive way the presentation of an incoming call event.

With this approach, a presentation does not know the global state of the interaction, more exactly the running state of other presentations. This can be the source of incoherence problems between persistent presentations. Mechanisms allowing to take into account the running state of other presentations should be defined to guarantee the interaction coherence.

The “distributed” approach has the advantage of making independent the information presentation. However this independence has an important cost. The conception of a presentation model expressing the current information requires the conception of all possible presentations depending on the interaction context state. The complexity of the context modelling acts directly on the presentation model size.

2.7 Presentation Coherence

Mechanisms allowing the check of the presentation coherence are associated to the model. The Figure 10 presents the different incoherence types which can be thrown during the application of the model.

Fig. 10. Incoherencies in the design process.

A first incoherence thrown during the election process underlines a structural problem on the

resulting multimodal presentation (Which step). Few modalities can express in the same presentation several elementary information units but it is not the case of all modalities. A compromise must then be found to guarantee the presentation coherence. In the best case, this compromise is at the origin of a re-election of the responsible elementary informations. At worst, the semantic fission decomposition (What step) should be corrected.

A second incoherence thrown during the How step points out a problem on the presentation instantiation. The selection of the content or of the instantiation model for a given modality has not reached its end because of lack of choices. The content chosen is then changed and at worst, the structure of the elected presentation is reviewed.

Let us take the example of our mobile phone in the case of a visually impaired user. Let us suppose that the election process results in ringing modality for both “call” and “caller” elementary information units. If there is no personalized ringing associated with the caller, then the presentation cannot be instantiated because of lack of right ringing modality content. This lack throws an instantiation incoherence which requires a re-election of the presentation.

WHICH HOW WHAT

Election Instantiation

Election incoherence

Instantiationincoherence

Information not eligible

Presentation not instantiable

2.8 Related Works

We described through the last sections the WWHT conceptual model and its associated concepts. We are now going to position these concepts compared with related works.

The SRM (Standard Reference Model) model [14] is one of the first conceptual models for the intelligent presentation of information. It proposes a set of five layers to design a multimodal presentation:

• control layer, • content layer, • design layer, • realization layer, • presentation display layer.

The first layer is about the conception goals. In our model, the control layer is managed by the actors involved in the analysis process. The content layer allocates the presentation (Which step) and selects the presentation content (How step). The design layer sets the presentation attributes (How step). The two last layers define the rendering process. Since the rendering process is more relevant of multimedia systems than intelligent multimodal systems, it has not been included in the WWHT model. The semantic fission (What step) and the presentation evolution (Then step) are not managed by the SRM model.

Stephanidis et al. [38] extends the SRM model by introducing mechanisms in the allocation process to adapt a multimodal presentation to the interaction context. These mechanisms exploit a behavioural model formalized with “adaptation rules”. This rule model [28] is based on the design goals. The suitability of the interaction components according to the design goals represents the interaction component interest (conclusion's rules). Design goals represent high level choices motivations but the translation of these high level motivations into computable rules is not simple.

The same authors decompose the adaptation process into four questions [27]:

• What to adapt, • When to adapt, • Why to adapt, • How to adapt.

These questions must not be merged with the WWHT model steps (What, Which, How, Then). When (adaptation moment) and Why (adaptation goals) questions are not apparent in our model but represent preliminary steps in our analysis process. Concerning What (adaptation subjects) question, there is no specific equivalence in our model. At last the How (adaptation reaction) question is equivalent with our Which step. The instantiation process of the presentation (How step) is not managed by this decomposition.

The presentation evolution is a recent research subject with many works in progress. The plasticity [7, 40] of the user interface is an example of new concepts including an evolution model of the interaction. Our distributed approach hides and breaks up this evolution model in the presentation itself to guarantee its independence. Concerning the evolution factors such as user actions [22], context evolutions [18] as well as temporal and spatial constraints [23], we propose to take into account all these evolution types through the same formalization.

Finally, numerous approaches are also proposed to compose selected interaction components. In some approaches [28], the composition property in an adaptation rule is implicit (not described in the rule conclusion). In the opposite, some other approaches [2] propose an explicit specification. This specification consists in defining the composition subjects (interaction components) and the composition type (complementarity or redundancy). Our composition mechanism is a compromise between these two approaches. In a composition rule, the conclusion specifies the wanted composition type and the composition subjects are deduced from the contextual interest of the interaction components.

3. ELOQUENCE Platform

The WWHT model leads us to develop a software platform for the design and development of output multimodal systems. This platform called ELOQUENCE is composed of two tools allowing respectively the outputs specification and the outputs simulation of a multimodal system and a runtime kernel allowing the execution of the final system.

3.1 Specification Tool

The specification tool [36] aims at making easier the outputs specification process. The analysis process (introduced in the sub-section 2.1.5) extracts and identifies the required elements by the WWHT model. Our specification tool proposes four different editors (component editor, context editor, information editor and behaviour editor), allowing to specify each required element.

Fig. 11. Edition of the interaction components managed by a mobile phone.

Figure 11 presents the interaction components editor. This editor is composed by two parts. The first part allows the diagram specification while the second part edits the properties (name, type, attributes, criteria and comments) of the selected component. Figure 11 presents more precisely the edition of the interaction components managed by a mobile phone application. For example, the specification of a mobile phone screen results in the definition of four attributes (consumption, horizontal and vertical number of pixels and number of colours) and two criteria (confidentiality and visual isolation).

Output Mode

Output Medium

Output Modality

Secondary relation

Primary relation

In our platform, the specification of an election rule is based on a graph describing the rule premises and on a table presenting the rule conclusions. Figure 12 presents a screen capture of the behavioural model editor.

Fig. 12. Edition of an election rule.

Contrary to existing specification tools [5], our specification tool allows to reuse the outputs

specification from an application to another. Indeed, the resulting specification is saved in a proprietary language for future use. This language called MOXML (Multimodal Output eXtended Markup Language), describes all the specification elements. At the present time, the definition of an outputs specification is not managed by the W3C’s Extended Multimodal Annotation Markup Language (EMMA) [19]. Existing user interface description languages such as XISL (eXtensible Interaction Scenario Language) [29], DISL (Dialog and Interface Specification Language) [13], or UsiXML (USer Interface eXtensible Markup Language) [31] do not cover whole of our needs (unlimited interaction context, modality attributes, election rules, etc.). So we defined our own data representation language based on XML with a set of tags describing all needed elements in an output multimodal system.

Figure 13 presents the MOXML description of an election rule. This rule tries to decrease the mobile phone electric consumption when presenting an incoming call in the case of a low battery level (rule premises). The proposed restrictions (rule conclusions) concern the vibrator medium and the photography modality which consumes a lot of electricity. A graphical specification of this rule is presented in Figure 12.

Fig. 13. MOXML description of an election rule.

3.2 Simulation Tool

Fig. 14. Simulation of an incoming call presentation on a mobile phone.

C

B

A

D

<rule name="Energy saving" number="10"> <premises> <premise number="1"> <elt_left model="system" criterion="battery level" /> <comparator value="<" /> <elt_right type="int" value="15" /> </premise> <op_logical name="and" /> <premise number="2"> <elt_left model="UI" criterion="name" /> <comparator value="=" /> <elt_right type="string" value="call of X" /> </premise> </premises> <conclusions> <conclusion number="1"> <target level="modality" name="Photography" /> <effect value="unsuitable" /> </conclusion> <conclusion number="2"> <target level="medium" name="Vibrator" /> <effect value="unsuitable" /> </conclusion> </conclusions> </rule>

The simulation tool [36] is composed of four parts (Figure 14). A first interface (Figure 14, A) simulates the dialog controller. More precisely it allows to simulate incoming information units from the dialog controller which leads to launch the presentation process for these information. A second interface (Figure 14, B) simulates a context server allowing the modification of the interaction context state. A third window (Figure 14, C) describes the simulation results in a textual form. Finally, a last interface (Figure 14, D) presents with graphics and sounds a simulation of the outputs results. The simulation tool involves the runtime kernel which is presented in the following section.

3.3 Runtime Kernel

The runtime kernel allows the execution of an outputs specification. From the exported specification files, this tool is able to produce dynamically a multimodal presentation depending on the current interaction context state and a given information unit. The runtime kernel can be used either by the simulation tool or by the real final system.

Figure 15 presents the architecture model [34] of the runtime kernel. This architecture model applies the WWHT conceptual model. The first step (What) is made during the specification process of the information units. The three other steps (Which, How and Then) concepts are managed by different modules of the architecture model. This architecture model is composed of four main modules: election module, instantiation module, multimodal presentations management module and context spy module. The knowledge of the system are defined through five different structures: context (models), behaviour (rules base), contents (contents list), attributes (attributes instantiation models) and MPL (Multimodal Presentations List).

From the specified behavioural model, the election module allocates multimodal presentations adapted to the current state of the interaction context. These presentations are then instantiated by the instantiation module on the basis of the specified contents and attributes models. The resulting presentation is finally sent to the rendering engine (specific to the application) for the presentation to the user. Meanwhile the multimodal presentations management module checks the validity of persistent presentations. It receives information from the context spy module which analyses the interaction context evolution.

Multimodal Presentations

Management Module

Election Module

add / delete MPx

Spy

Behaviour

MPL

scan

invalid MPx

add criteria delete criteria

Context

Fig. 15. Runtime kernel architecture.

Medium 1

Medium 3

Medium 4

RenderingEngine

InstantiationModule

Semantic Information

modification find

Dialog Controller

MultimodalPresentation

Contents Attributes

Medium 2

elects MP instantiates MP

3.4 From Simulations to Real Applications

The simulation is different from the test phase because it does not require the presence of end users. Thanks to this simulation, the designer will be able to notice the effects and the quality of the proposed specification. However, this simulation is an “application” of the specification but does not replace a prototype and a test phase. Prototypes represent a second stage in the simulation process which might be improved by the results of the first simulation.

The simulation tool can also be used to develop a prototype of the complete system. The links with the other application modules (dialog controller, context module and system media) are managed by RMI (Remote Method Invocation) connections which allows to support distributed architecture. Any modification on the specification only needs a re-initialization of the tool.

The outputs specification can be then made in an incremental way by successive prototypes. The outputs design of a multimodal system is then based on a cycle model composed of three steps:

• analysis, • specification, • simulation.

The results of the analysis step are recommendations for the next specification step. During the first iteration, these ones are extracted from the project requirements. For next iterations, the analysis carries on the simulation results of the last submitted specification.

4. Application

In the previous sections, we presented a conceptual model and its application through a software platform. We are now going to illustrate the introduced concepts through the outputs specification of a more complex and realistic application.

4.1 Interaction in a fighter plane cockpit

This application is carried out on the INTUITION (multimodal interaction integrating innovative technologies) project. This project is partly funded by French DGA (General Delegation for the Armament) and includes three laboratories (LIMSI-CNRS, CLIPS-IMAG and LIIHS-IRIT) and an industrial partner (Thales). The project objective is to develop an adaptation platform for the new Human-Computer Interaction technologies. The first application of this platform was about interaction in a fighter plane cockpit [9]. This application is applied on the Thales-Avionics’s simulator. This simulator is mainly composed of a graphical environment (based on the flight simulator X-PLANE), scenarios (managed by GESTAC, a Thales-Avionics application) and a multimodal interaction kernel which is composed by 3 systems: ICARE, Petshop and MOST. ICARE [15, 16] catches and analyses pilot’s input interactions. Input events are sent to a dialog controller specified with Petshop tool [8, 10] through Petri networks. Finally the information presentation is managed by MOST [35].

The main objective of this application was to verify real time constraint and to check communication between the different partner’s modules. The extensibility of the approach was also studied. A first prototype of the application has been implemented to validate the different components of the platform. This prototype is about a task of marking out a target on the ground.

4.2 Outputs Specification

Outputs specification of this first task is presented below. Figure 16 presents the interaction components diagram (HMV: Helmet Mounted Visor, LRS: Large Reconfigurable Screen and HAS: Helmet Audio System). Interaction context is composed of three models and seven criteria presented in the Table 3.

Criteria Values Model

Pilot’s head position

High Low

User

NAS (Navigation and Armament System) mode

Air-Air Air-Sol

System

HMV availability Available Unavailable

System

LRS availability Available Unavailable

System

HAS availability Available Unavailable

System

Audio channel availability

Free Occupied

System

Noise level 0-100 Environment

Fig. 16. Interaction components diagram. Table 3. Interaction context.

Four information units (add valid mark, add invalid mark, refresh mark and remove mark) are managed by system. Finally, the application behavior is specified through a base of eleven election rules described in the Table 4.

Table 4. Election rules of the behavioral model.

Id Name Description in natural language R1 Pilot's head position If the pilot’s head position is low Then do not use HMV medium R2 Audio channel used If the audio channel is already in use Then do not use Earcon3D modality R3 HMV unavailability If the HMV is not available Then do not use HMV medium R4 LRS unavailability If the LRS is not available Then do not use LRS medium R5 HAS unavailability If the HAS is not available Then do not use HAS medium R6 Air to Air NAS mode If the NAS mode is Air-Air Then do not present the information R7 Too noisy If the noise level is superior to 90 dB Then do not use Auditory mode R8 Important noise If the noise level is between 70 and 90 dB Then Earcon3D is unsuitable R9 Mark presentation If the current EIU is a 3D point Then use Redundancy property

R10 Commands feed-back If the current EIU is a command feed back Then use text and try to express it with the HMV

R11 Error feed back If the current EIU is an error feed back Then use Earcon 2D modality

4.3 Outputs Simulation

Figure 17 presents the WWHT model application for the “add a valid mark” information unit. In a nominal situation, only R9 and R10 rules are applied. The command feed-back is visually (mode) expressed as a text (modality) through the helmet mounted visor (medium). In the same way, the mark is visually and auditively (mode) presented as a combination of a geometric shape (modality) displayed on the helmet and on the main cockpit screen (media) and a 3D Earcon (modality) in the helmet audio system (medium). However, the state of the interaction context may change the multimodal presentation form. For example, if audio canal is already used, the mark will not be presented through the 3D Earcon modality (R2 rule).

The evolution sample proposed in Figure 17 is about the helmet mounted visor (HMV medium). This evolution must switch off the HMV when the pilot lowers his head to look at something inside the cockpit. The mark presentation will then evolve according to the pilot’s head position (context criteria) by hiding or showing the geometric shape modality symbolizing the mark.

Geometric shape

Text

HMV

LRS

HAS

Visual

Auditory 2D Earcon

3D Earcon

Modalities MediaModes

Fig. 17. Marking out a target in a fighter plane cockpit.

5. Future Works

WWHT model defines a framework for the outputs conception and the outputs evolution of a multimodal system according to the interaction context. However what about the user's inputs? We will now analyse the influence of the inputs on the multimodal outputs.

The interface coherence is a major concept in the human computer interaction design. This concept can be extended to the design of all interaction types including multimodal interaction. The outputs of a multimodal system have to be coherent with the user's inputs to maintain the interaction continuity between user and application. This interaction continuity can not be obtained without including inputs in the design process of the multimodal presentation.

Indeed, the input modalities can influence the output modalities election process. For example, lexical feed backs often depend on the used input modality. Keyboarding produces in most cases a visual feedback (text). For vocal commands, an auditory feedback (synthesis voice) is sometimes preferred. Moreover, the inputs can also influence the instantiation process of output modalities. For example, interacting with fingers on a tactile screen requires the display of bigger buttons in comparison with a pen interaction.

Two presentations types can be distinguished: passive presentations and interactive presentations. User's actions do not influence a passive presentation during its life cycle. But user's inputs can be the source of an evolution in the case of interactive presentations.

The distributed approach introduced in the sub-section 2.6 handles user's actions in the information presentation model. For example, the mouse rollover a button can lead to a mutation allowing the presentation of the button features through the addition of a text modality (text tool tip). It can be also the source of a presentation refinement such as the modification of the button background (to underline it). However, the modelling of the interaction context must be extended to take into account input modalities.

On the one hand, a new model on the used input modalities and media must be added to the interaction context. This model should be updated by the input module (or possibly by the dialog controller depending on the architecture model). This implies the implementation of a context server accessible to both input and output modules. On the other hand, new mechanisms must be

WHAT HOW WHICH

Command feed-back

Mark (X)

TextHMV

TextLRS

3D EarconHAS

Geom. shapeHMV

IU

TextHMV

TextLRS Redundancy

EIU Medium Modality Content Attributes

…

…

Geom. shapeHMV

Geom. shapeLRS Redundancy (R9)

(R10)

Style: Arial Size: 12 Bold: no

…

Style: Arial Size: 16 Bold: yes

Low Head

Evolution of the mark presentation

MP2

MP1

Add mark (X)

THEN

Redundancy

Geom. shape HMV

Geom. shapeLRS Redundancy

“Add mark”

“Command Add mark”

…

default.midiVolume: 3

Time: 1 min…

Color: YellowHeight: 0.95 Weight: 1.75

…

Color: Green Height: 0.95 Weight: 1.75

Rectangle

…

Triangle

High Head

Text HMV Geo. S HMV Geo. S LRS

3D HAS

Text HMV

Geo. S LRS

3D HAS

(R2)

defined to inform the input module about the semantic and the identification of rendering objects involved in the input interaction. Without this information, the input module will not be able to interpret user actions. These developments should be made on a new application of the INTUITION project. This application concerns an air traffic control system in which we are responsible for the output module.

6. Conclusion

The contextualisation of the interaction and the management of the whole life cycle of a multimodal presentation represent the two main subjects of our contribution. Our WWHT model proposes a decomposition of the design process into three steps and a contextualisation of the multimodal presentation thanks to an election rules process. It also handles the presentation evolution through the introduction of two different approaches to ensure the presentation coherence and validity in regard to the context evolution. This model extends existing works by strengthening the contextualisation mechanisms and completing the life cycle of a multimodal presentation.

The ELOQUENCE platform represents a direct application of these concepts. It incarnates a design cycle model composed of three steps: analysis, specification and simulation. The platform helps the designer in each step of the design process by proposing a set of tools allowing the specification, simulation and execution of the system outputs.

The WHHT model associated to the ELOQUENCE platform constitutes a framework for the intelligent multimodal presentation of information. It has been applied successfully to the outputs design of simple applications (such as the mobile phone application) as well as more complex and realistic applications (such as the military avionics application).

7. Acknowledgments

The work presented in the paper is partly funded by French DGA (General Delegation for the Armament) under contract #00.70.624.00.470.75.96.

8. References

[1] A Metamodel for the Runtime Architecture of an Interactive System, SIGCHI Bulletin, The UIMS Tool Developers Workshop, January 1992, Volume 24, pp. 32-37.

[2] S. Abrilian, J.-C. Martin and S. Buisine, Algorithms for controlling cooperation between output modalities in 2D embodied conversational agents, Proc. ICMI'03, Vancouver, Canada, 5-7 November 2003, pp. 293-296.

[3] E. André, The generation of multimedia presentations, in: R. Dale, H. Moisl and H. Somers (Eds.), Handbook of Natural Language Processing: Techniques and Applications for the Processing of Language as Text, Marcel Dekker, 2000, pp. 305-327.

[4] E. André, Natural Language in Multimedia/Multimodal Systems, in: R. Mitkov (Ed.), Handbook of Handbook of Computational Linguistics, Oxford University Press, 2003, pp. 650-669.

[5] M. Antona, A. Savidis and C. Stephanidis, MENTOR: An Interactive Design Environment for Automatic User Interface Adaptation, Technical Report 341, ICS-FORTH, Heraklion, Crete, Greece, August 2004.

[6] Y. Arens and E.H. Hovy, The Design of a Model-Based Multimedia Interaction Manager, Artificial Intelligence Review 9 (3) (1995) 167-188.

[7] L. Balme, A. Demeure, N. Barralon, J. Coutaz and G. Calvary, CAMELEON-RT: a Software Architecture Reference Model for Distributed, Migratable, and Plastic User Interfaces, Lecture Notes in Computer Science, Volume 3295 / 2004, Ambient Intelligence: Second European Symposium, EUSAI 2004, Markopoulos P. et al. (Eds), 8-11 November 2004, pp. 291-302.

[8] R. Bastide, P. Palanque, H. Le Duc and J. Munoz, Integrating Rendering Specifications into a Formalism for the Design of Interactive Systems, Proc. 5th Eurographics Workshop on Design: Specification and Verification of Interactive systems (DSV-IS'98), Abington, UK, June 1998.

[9] R. Bastide, D. Bazalgette, Y. Bellik, L. Nigay and C. Nouvel, The INTUITION Design Process: Structuring Military Multimodal Interactive Cockpits Design According to the MVC Design Pattern. Proc. HCI International 2005, Las Vegas, Nevada, USA, July 2005.

[10] R. Bastide, E. Barboni, D. Bazalgette, X. Lacaze, D. Navarre, P. Palanque and A. Schyn Supporting Intuition through Formal Specification of the User Interface for Military Aircraft Cockpit. Proc. HCI International 2005, Las Vegas, Nevada, USA, July 2005.

[11] Y. Bellik, Interfaces Multimodales: Concepts, Modèles et Architectures, Ph.D. Thesis, University of Paris XI, Orsay, France, 1995.

[12] N.O. Bernsen, A reference model for output information in intelligent multimedia presentation systems, in: G.P. Faconti and T. Rist (Eds.), Proc. ECAI ‘96 Workshop on Towards a standard Reference Model for Intelligent Multimedia Presentation Systems, Budapest, August 1996.

[13] S. Bleul, R. Schaefer, W. Mueller, Multimodal Dialog Description for Mobile Devices, Proc. Workshop on XML-based User Interface Description Languages at AVI’04, Gallipoli, Italy, 25 May 2004.

[14] M. Bordegoni, G. Faconti, M.T. Maybury, T. Rist, S. Ruggieri, P. Trahanias and M. Wilson, A Standard Reference Model for Intelligent Multimedia Presentation Systems, Computer Standards and Interfaces 18 (6-7) (1997) 477-496.

[15] J. Bouchet and L. Nigay, ICARE: A Component Based Approach for the Design and Development of Multimodal Interfaces, Extended Abstracts of CHI’04, 2004.

[16] J. Bouchet, L. Nigay and T. Ganille, The ICARE Component-Based Approach for Multimodal Interaction: Application to Military Plane Multimodal Cockpit, Proc. HCI International 2005, Las Vegas, Nevada, USA, July 2005.

[17] A. Braffort, A. Choisier, C. Collet, P. Dalle, F. Gianni, B. Lenseigne and J. Segouat, Toward an annotation software for video of Sign Language, including image processing tools and signing space modelling, Proc. 4th Conf. on Language Resources and Evaluation (LREC’O4), Lisbonne, Portugal, 2004.

[18] G. Calvary, J. Coutaz, D. Thevenin, Q. Limbourg, L. Bouillon and J. Vanderdonckt, A unifying reference framework for multi-target user interfaces, Interacting With Computer 15 (3) (2003) 289-308.

[19] W. Chou, D. A. Dahl, M. Johnston, R. Pieraccini and D. Raggett, W3C: EMMA: Extensible MultiModal Annotation markup language, Retrieved 14-12-2004 from http://www.w3.org/TR/emma/

[20] J. Coutaz, Multimedia and Multimodal User Interfaces: A Software Engineering Perspective, Proc. Workshop on Human Computer Interaction, St Petersburg, Russia, 1992.

[21] J. Coutaz, L. Nigay, D. Salber, A. Blandford, J. May and R.M. Young, Four Easy Pieces for Assessing the Usability of Multimodal Interaction: the CARE Properties, Proc. INTERACT'95, Lillehammer, Norway, 1995.

[22] M. Crease, P. Gray and S. Brewster, A Toolkit of Mechanism and Context Independent Widgets, Proc. 7th Workshop DSV-IS, Limerick, Ireland, 5-6 June 2000, pp. 121-133

[23] M. Dalal, S. Feiner, K. McKeown, S. Pan, M. Zhou, T. Höllerer, J. Shaw, Y. Feng, and J. Fromer, Negotiation for Automated Generation of Temporal Multimedia Presentations, Proc. ACM Multimedia 96, Boston, USA, 18-22 November 1996, pp. 55-64.

[24] A. K. Dey, D. Salber and G.D. Abowd, A conceptual framework and a toolkit for supporting the rapid prototyping of context-aware applications, in: T.P. Moran and P. Dourish (Eds.), Human-Computer Interaction 16 (2-4) (2001) 97-166.

[25] W. H. Graf, The constraint-based layout framework LayLab and its applications, Proc. of the Workshop on Effective Abstractions in Multimedia Layout, Presentations, and Interaction in conjunction with ACM Multimedia'95, San Francisco, U.S.A., 4 November 1995.

[26] M. Johnston and S. Bangalore, Finite-state Multimodal Integration and Understanding, Natural Language Engineering 11 (2) (2005) 159-187.

[27] C. Karagiannidis, A. Koumpis and C. Stephanidis, Deciding `What', `When', `Why', and `How' to adapt in intelligent multimedia presentation systems, Proc. ECAI Workshop on Towards a standard reference model for intelligent multimedia presentation systems, Budapest, Hungary, August 1996.

[28] C. Karagiannidis, A. Koumpis and C. Stephanidis, Adaptation in Intelligent Multimedia Presentation Systems as a Decision Making Process, Computer Standards and Interfaces 18 (2-3) (1997).

[29] K. Katsurada, Y. Nakamura, H. Yamada and T. Nitta, XISL: A Language for Describing Multimodal Interaction Scenarios, Proc. ICMI'03, 5-7 November 2003, pp.281-284.

[30] T. Lemlouma and N. Layaida, Context-aware adaptation for mobile devices, Proc. Mobile Data Management 2004, Berkeley, CA, USA, 19-22 January 2004, pp. 106-111.

[31] Q. Limbourg, J. Vanderdonckt, UsiXML: A User Interface Description Language Supporting Multiple Levels of Independence, in: M. Matera and S. Comai (Eds.), Engineering Advanced Web Applications (2004), Rinton Press.

[32] L. Nigay, J. Coutaz, A Generic Platform for Addressing the Multimodal Challenge, Proc. CHI'95, Denver, Colorado, USA, 7-11 May 1995, pp. 98-105.

[33] T. Rist and P. Brandmeier, Customizing graphics for tiny displays of mobile devices, Personal and Ubiquitous Computing 6 (4) (2002) 260-268.

[34] C. Rousseau, Y. Bellik, F. Vernier and D. Bazalgette, Architecture framework for output multimodal systems design, Proc. OZCHI’04, Wollongong, Australia, 22-24 November 2004.

[35] C. Rousseau, Y. Bellik, F. Vernier and D. Bazalgette, Multimodal Output Simulation Platform for Real-Time Military Systems, Proc. HCI International 2005, Las Vegas, Nevada, USA, 22-27 July 2005.

[36] C. Rousseau, Y. Bellik and F. Vernier, Multimodal Output Specification / Simulation Platform, Proc. ICMI’05, Trento, Italy, 04-06 October 2005.

[37] A. Schmidt, K. Aidoo, A. Takaluoma, U. Tuomela, K. van Laerhoven and W. van de Velde, Advanced interaction in context, Proc. 1st Internat. Symposium on Handheld and Ubiquitous Computing (HUC '99), Karlsruhe, Germany, 1999.

[38] C. Stephanidis, C. Karagiannidis and A. Koumpis, Decision Making in Intelligent User Interfaces, Proc. Intelligent User Interfaces 97, 6-9 January 1997, pp. 195-202.

[39] C. Stephanidis and A. Savidis, Universal Access in the Information Society: Methods, Tools, and Interaction Technologies, Universal Access in the Information Society 1 (1) (2001) 40-55.

[40] D. Thevenin and J. Coutaz, Plasticity of User Interfaces: Framework and Research Agenda, Proc. INTERACT'99, Edinburgh, UK, August 1999, pp. 110-117.

[41] J. Vanderdonckt, D. Grolaux, P. Van Roy, Q. Limbourg, B. Macq and B. Michel, A Design Space for Context Sensitive User Interfaces, Proc. 14th Internat. Conf. on Intelligent and Adaptive Systems and Software Engineering IASSEi'05, Toronto, Canada, 20-22 July 2005.

[42] F. Vernier and L. Nigay, A framework for the Combination and Characterization of Output Modalities, Proc. 7th Internat. Workshop on Design, Specification and Verification of Interactive Systems (DSV-IS), Limerick, Ireland, 5-6 June 2000, pp. 35-50.

Documents

A Framework for the Intelligent Multimodal Presentation of ... · A Framework for the Intelligent Multimodal Presentation of Information Cyril Rousseau1, Yacine Bellik1, Frédéric