CCM oi IST project COMIC Vision and Approach Results first 1.5 years

C CMo i

IST project COMIC

Vision and Approach

Results first 1.5 years

http://www.hcrc.ed.ac.uk/comic/

C CMo i

Vision of COMIC

• Multimodal interaction will only be accepted by non-expert users if fundamental cognitive interaction capabilities of human beings are properly taken into account

C CMo i

Approach of COMIC

• Obtain fundamental knowledge on multimodal interaction – use of speech, pen, and facial expressions

C CMo i

Approach (2)

• Develop new approaches for component technologies that are guided by human factor experiments

C CMo i

Approach (3)• Obtain hands-on experience by building an

integrated multimodal demonstrator for bathroom design that combines new approaches for:– Automatic speech recognition– Automatic pen gesture recognition– Fusion– Dialogue and Action management– Fission– Output generation combining text and speech and

facial expression– System integration– Cognitive knowledge

C CMo i

The partners of COMIC• Max Planck Institute for Psycholinguistics –

Fundamental Cognitive Research• Max Planck Institute for Biological

Cybernetics – Fundamental Cognitive Research • University of Nijmegen – ASR and AGR• University of Sheffield – Dialogue and Action• University of Edinburgh – Fission and Output• DFKI – Fusion and System Integration• ViSoft – Graphical part of Demonstrator

C CMo i

This presentation

• Explanation of the demonstrator

• Results of fundamental cognitive research – Multimodal Interaction– Facial Expressions

• Results of Human Factor Experiments

C CMo i

The COMIC demonstrator• Bathroom design for non-expert users• Final goal is to implement 4 phases:

– 1: Input shape and dimensions of own bathroom (pen and speech input)

– 2: Choose position of sanitary ware (based on templates)

– 3: Conversational dialogue about types of sanitary ware and tiles

– 4: 3D view of negotiated bathroom

• Result is taken to expert salesman who will proceed from there.

C CMo i

The COMIC demonstrator

• Three versions– T12: Proof of technical integration of all

modules– T24: Limited functionality – fixed bathroom/

only tiles– T36: Full functionality – own bathroom,

sanitary ware, tiles

C CMo i

T12 Demonstrator

C CMo i

Fundamental Research on Human-Human Multimodal Interaction

C CMo i

The SLOT Research Platform

• Recording dyadic, natural interactions

• Route negotiation task with road maps

• Use of electronic pen/ink for drawing routes

• Elaborate and theory-free coding of data

• Systematically manipulating available modalities (drawing, visual contact)

C CMo i

C CMo i

Results Quantitative analysis of turn-taking behavior

4x4 dyads; 6 hours annotated interaction• Normally, there is no delay between people’s

turns• With one-way mirror, the “blind” person is

slower to take up her turn• This leads to longer silent periods (pauses)• … which leads to significantly slower

communication

C CMo i

Possible relevance for HCI

In conversational HCI with talking head:• User sees computer’s “face”• User might assume that the computer sees

his face• Speech recognition has a hard time reliably

detecting end-of-speech acoustically

Therefore we hypothesize that • User will notice (even more) that the

computer responds very slowly

nn:nn:nn:nn:

C CMo i

Fundamental Research on Facial Expressions

• Faces do a lot in a conversation • Lip motion for speaking• Emotional Expression (pleasure, surprise, fear)• Dialog flow (back-channeling: confusion, comprehension, agreement)• Co-expression (emphasis and word/topic stress)

• Most work on Avatars focuses exclusively on lip motion for speech.

• We aim to broaden the capabilities of Avatars, allowing for more sophisticated self expression and more subtle dialog control.

• To this end, we use psychophysical knowledge and procedures as a basis for synthesizing human conversational expressions.

C CMo i

First step: Real expressions• We recorded variety of conversational expressions from

several individuals. We then experimentally determined how identifiable and believable those expressions were.

• In general, we found that:– The expressions were easily recognized -- even in the complete absence of

conversation context! (and thus can be useful for back-channeling).

– The pattern of confusions between expressions indicate potential trouble areas (e.g., thinking was often mistaken for confusion!)

– These (“enacted”) expressions were not always recognized or found to be completely sincere (speech might help here).

C CMo i

Next step: What moves when? We are now performing a fine-grained analysis of the necessary and sufficient

components of conversational facial motion. What must move when to produce an identifiable and believable expression?

C CMo i

Relevance for HCI and eCommerce• Psychophysical studies of real expressions offer strong

insights into how one can produce identifiable, realistic, and believable conversational expressions.

• The expansion of Avatars’ expressive capabilities promises to improve the ease of use of HCI systems.

C CMo i

Human Factors Experiments guiding the technology

• University of Nijmegen investigated Input issues (ASR, AGR, Fusion)

• University of Edinburgh investigated output issues (Text, Graphics, and Face, Fission)

C CMo i

Human Factors Experiments

• Exploratory pilot studies– Can users combine pen and speech for entering

data about the layout of a room?– Do they like it, what do they prefer?– System-driven vs mixed-initiative dialogs– Pen+speech data acquisition and analysis

C CMo i

HF Experiments input

• Task: study a blueprint and specify this using speech and/or pen

• Subjects had to specify position + lengths of walls, doors, windows, sanitary ware

• Experiment is directly related to phase 1 of demonstrator

C CMo i

HF Experiments main results

• Subjects prefer gestures and speech, or gestures only; speech only is not preferred

• Subjects show a large variation in behaviour even when restricted to narrowly defined tasks

• Subjects prefer mixed-initiative dialogue• System-driven results in fewer errors, but requires more time

C CMo i

HF Experimentsspeech

• Subjects use three types of speech comments– Within task

“here is a wall with width 3 meter 40 …”– Out-of-task, within dialogue

“now I am going to draw the next wall”– Out-of-dialogue

“I hope I'm drawing this in the right way ..”

C CMo i

HF Experimentspen

Large variation in

• Graphical symbols

• Deictic gestures

• Handwriting

C CMo i

Human Factors ExperimentsOutput

• Fission module: translates abstract dialogue acts into specifications for output channels

• Goal: model the choices made in the COMIC fission module after naturally-occurring interactions.

• Question: What are important natural actions in multimodal dialogue?

C CMo i


• “Wizard of Oz” recordings

• Set up of the recordings: – Subjects (native English speakers, not

bathroom design experts) played the role of a bathroom sales consultant presenting a range of options to the client.

– Total recordings: 7 interactions; approximately 2.5 hours of video.

C CMo i


• Making use of the recordings:Annotation

• Focus on scenes where the “consultant” says things similar to the planned system output– In particular, descriptions and comparisons of

options

• Mark up surface features like those under control of the fission module, and factors predicted to have an effect on those features

C CMo i

Making use of the recordings: Using the results

• Examine:– Range of surface features, deictic gestures,

prosody, facial expressions and gaze; both occurrence and timing

– Correlation between features and factors such as description vs comparison, first vs repeated mention, positive vs negative context

• Use these results in the development of the Fission module

C CMo i

Sample comparison“So they give you a degree of colour, they’re slight– they’re obviously slightly busier than looking at something like this, but, umm, they’re not quite as intense as having a whole block of colour, such as those two.”

C CMo i

Towards T24

• Presentation ViSoft

Documents

CCM oi IST project COMIC Vision and Approach Results first 1.5 years