W3C Multimodal Interaction Activity Dave Raggett W3C & Openwave Systems November 2002

W3C Multimodal Interaction Activity

Dave Raggett <[email protected]>

W3C & Openwave Systems

November 2002

Multimodal Interaction

● The Web started with purely textual content

● Evolved to include graphics, forms, tables, richer layout, and client and server side scripting

● W3C Voice Browser Activity launched in 1999 to enable voice interaction via any telephone

● New challenge is to combine speech with other modes of interaction

● W3C Multimodal Interaction Activity launched in February 2002

Web pages you can speak to and gesture at

Multimodal Devices● We have the hardware

● We have the networks (GPRS, W-CDMA, 1xRT, 802.11)

● We now need to develop the standards to combine different modes of interaction!

Bluetooth headset SonyEricsson P800

Multimodal Devices● Emerging market for Car based systems

— 3x4 inch color display plus control buttons

— Speech input and output

— GPS satelite location, maps, and navigation planning

— Wireless access to traffic information

— Entertainment (radio, CD, DVD, television)

— Control and monitoring of many aspects of car's systems

— Telephone, mobile information and communication services

What Modes and Why?

● Display

– Persistent

– Enliven with graphics/animation (branding)

– Limited on small screens

– Color displays hard to read in sunlight

● Speech

– Transient

– Good for small devices

– Enliven with personalities and audio effects

– Enables Speaker verification

– Environmental/cultural issues

What Modes and Why?

● Keys– Full size keyboards vs Phone keypads

● Pen/Stylus– Needs two hands vs thumbing on keypad

– Handwriting, drawings and gestures

– Special notations (mathematics, music, chemistry)

● Others– Cameras (photos and videos)

– Audio recording and playback

– System sensors (location and movement)

– Haptic devices (BMW iDrive)

– Head and arm movements, facial gestures

When to use a given Mode?● Social context

– Impolite

– Forbidden

– Culture specific

● Environmental Context– Disabilities

– Too noisy to hear or speak

– Too bright to easily read screen

– Need to keep hands/eyes free (driving)

● Personal Preferences– Static vs dynamic preferences

Device Capabilities● Thick Clients (desktops)

— High powered device capable of large vocabulary speech recognition, full sized keyboard and large high resolution display

— Application runs locally

● Thin Clients (mass market mobile phones)— Long battery life, limited processing power and memory, with small display,

keypad and audio.

— Application largely runs in the network.

● Medium Clients (PDA's and high end phones)— Moderate processing power and memory. Intermediate sized display,

stylus/keypad and audio

— Limited local recognition for speech and ink

— Application distributed between device and network

Use Cases● Airline reservation

— Making a reservation on your way to work: a form filling exercise where you get to choose the best modality according to whether you are on foot, in the car, on a train or have reached the office.

● Driving directions— Keep your mind on the road while getting directions: combines speech and vision,

for selections, directions and detours

● Name dialling— Friends and colleagues call you by name, with customized call handling depending

on time of day, what you are doing, and who is calling you. See caller before accepting the call, leave messages for ex's to never call you again, ...

● Finding a hotel— You've just arrived in town and need a hotel for the night, in this example you

speak to a human assistant who guides you through the available choices according to your preferences: combines speech with visual feedback .... Demo ....

MMI Framework

Input

Output

InteractionManager

ApplicationFunctions

User

Interaction Management● Interprets user input

— Taking into account internal state

— Integrating inputs from different modalities

— Anaphora, deixis and ellipsis

— Anne asked Edward to pass her the salt.

— I want that one! (pointing to a toy in a display cabinet)

— Detecting inconsistencies

● Determines how to respond— Update internal state and take appropriate actions

— May involve an explicit model of tasks and dialog history

● System vs User vs Mixed Initiative— System driven dialog leads user by the hand ...

— User driven dialog: user commands via speech, key strokes, mouse clicks, ...

— Mixed initiative: both parties can take control from each other

— Current Web applications can be considered as examples of mixed intiative

Speech is different!

● Speech is transient— You hear it and it's gone!

● Speech is uncertain— Recognizers make mistakes

● Knowing what to say— Carefully chosen prompts guide you to respond appropriately

— Grammars guide recognition based upon expected responses

— Critical for robust speaker independent recognition

● Spoken dialog patterns— e.g. navigation commands, tapered prompts, traffic lights model

● Visual feedback simplifies some aspects of dialog— You see at a glance whether what you said was recognized correctly

Why speech requires a fundamentallydifferent approach from graphical UI's

Single Document Model● Combine XHTML with markup for speech, ink etc.

— Building upon XHTML Modularization — XHTML + SRGS + SSML + XML Events etc.

— XHTML events trigger prompts and activate speech grammars— onload, onunload, onmouseover, onclick, onfocus, onblur, ...

— Simple markup for common cases:— Spoken commands to set focus, follow link, click button

— Break out to scripting when extra flexibility is needed

— Single document model is likely to be attractive to authors

● But there are problems ...— How to execute in a distributed environment?

— Lack of declarative support for common dialog behaviors

— Mixes up presentation and logic

Dual Document Model● XHTML in the device, VoiceXML in the network

— Coupled via events that trigger actions or notify changes— Application state is duplicated in both systems

— Events are passed as messages from one system to the other

— Users can choose whether to speak, or to use keypad or stylus— If user types value into form field, onchanged event is intercepted to pass

new value to VoiceXML interpreter to update the voice version of the form.

— If user uses her voice to fill out the field, the VoiceXML interpreter sends amessage to the XHTML interpreter to update the visual version of the form.

— Simpler to implement than single document approach— Builds upon existing implementations of XHTML and VoiceXML

● May be able to exploited shared data model

● But there are problems ...— Having XHTML and VoiceXML in separate documents

could make this harder for authors

— Mixes up presentation and logic

Separating Presentation and Logic● Web server scripts play important role in dialog

— XHTML and VoiceXML are commonly generated by server-side scripts— User is led through a sequence of dynamically constructed pages

— Application logic is represented within server-side scripts

● Declarative Dialogs— Declarative replacement for server-side scripts

— Decompose applications into smaller tasks

— Driven by goals and current application state— Form filling metaphor (Xforms)

— Plan generation and repair

— Presentation generated through template instantiation

— Gracefully copes with wide variety of device capabilities

● But ...— Perhaps too ambitious, and best left until we have more experience

The Key Role of Events● Events as the means for coupling components of

distributed Multimodal Systems— Functional specification as XML messages

— Agnostic wrt transport protocol (e.g. SIP Events)

— Support for Subscribe/Notify model

— Timestamps for synchronization (temporal/logical)

— Addressing (host/user/process/document/element)

— Efficient transmission via event chunking

● Author's perspective— Scripting vs declarative event handlers

— High level declarations for implicit handlers

— Default handlers for specific events

— High level modality independent vs low level modality dependent

— Events which cause actions vs events that inform of changes

Event Groups● XHTML Events

— onfocus, onblur, ochanged, onload, onunload, onmouseover, onmousout, onclick, onsubmit

● Document mutation events— Load page, set field value, set focus, Xupdate

● Environment, System status, Personal events— enable/disable modes, vehicle in motion/parked, location updates

● Speech and ink services— enable/disable grammars, prompts, results, recording, verification

● Multimedia services— play, pause, stop, rewind, fast forward

● Communication services— Sessions: start/stop, join/leave

— Services: allocate/deallocate

— Presence updates: devices/people

Using Events

● Thin Client and Network based dialog manager— Client runs GUI (XHTML Basic + CSS + Scripting)

— Bi-directional audio channel to network based speech engine

— XHTML events passed to network based dialog system

— Dialog system sends events to GUI to set focus, set field values, change to new page, or to mutate current page

● Author's perspective— Scripting vs declarative event handlers

— High level declarations for implicit handlers

— Default handlers for specific events

— High level modality independent vs low level modality dependent

Input● Natural Language Input

— Speech grammars + Acoustic models for speech recognition

— Ink grammars + stroke models for handwriting recognition

— Extracting semantic results via grammar rule annotations

— Results expressed as XML in application specific markup

● EMMA (Extensible Multi Modal Annotations)— Interface between recognizers and dialog managers

— EMMA annotates XML application data with:

• Data model (link to external definition)• Confidence scores• Time stamps• Alternative recognition hypotheses• Sequential and Partial results• May combine results from multiple input modes

— EMMA is based upon RDF (an RDF vocabulary)

Example● 'I want to fly to Boston'

— N-Best list of alternatives:• Destination: Boston, confidence 0.6

• Destination: Austin, confidence 0.4

— <destination>city name</destination>

<result xmlns:emma="http://www.w3.org/2002/emma" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:alt> <rdf:li emma:confidence="60"> <destination> Boston </destination> </rdf:li> <rdf:li emma:confidence="40"> <destination> Austin </destination> </rdf:li> </rdf:alt> </result>

Pen Input● Pen input has been with us for a long time

— Apple Newton

— Took off with success of Palm Organizer and Pocket CE

— Cell phones like Motorola Accompli A6188, Alcatel One Touch COM, Kycera pDQ, SonyEricsson R380, P800, Siemens Multimobile, Mitsubishi Trium, PC-ePhone

Motorola Alcatel Kycera SonyEricsson Siemens Mitsubishi PC-ephone

Pen Input● Uses

— Gestures

— Drawings

— Note taking

— Handwriting recognition

— Signature verification

— Specialized notations: Math, Music, Chemistry, ...

● Why develop a standard ink format?— Use of ink together with speech

— Server-side interpretation brings extra flexibility

— Passing ink to other people (drawings and notes)

● XML based format— InkXML contribution (IBM, Intel, Motorola, International Unipen Foundation)

Driving Directions Use Case

Name Dialing Use Case

Form Filling Use Case

Multimodal Requirements● Selecting/Constraining available modes

— Allowing user to select which mode are used by user/system

— Allowing system to constrain modes (road safety)

● Flexible use of output modes— Using ouput modes in a complementary fashion

— Using output modes for redundancy

— Using modes in a sequential fashion

● Flexible use of input modes— Allow input in choice of modes (voice or stylus/keypad)

— Allow for inputs combining more than one mode

— 'I want this one' while drawing circle on picture

— Opportunity for inconsistent inputs in different modes

— Saying yes while typing no

Multimodal Requirements● Support a wide range of device capabilities

● Allow for multiple users/devices/servers— Using a telephone in combination with a desktop

— Assisting communication between human users

— Local or Network based speech and ink engines

— Local or Network based dialog management

● Allow integrated and distributed architectures— Event handlers which are agnostic wrt source of event

— Modular markup supporting multiple architectures

● Increasing importance of SIP— SIP is to communications applications as HTTP is to the Web

— SIP allows scripted multi-device/multi-server sessions

— SIP Events: as basis for coupling multimodal systems

Is Multimodal of Genuine Value?● Won't offering a choice of modes confuse people?

— Convenience, e.g. ease of speaking versus thumbing input

— Changing situational context makes different modes attractive

— Legal requirements e.g. when driving

● Isn't this just technology for geeks?— No, it will be used by everyone

— Well designed, it will make services more accessible

— Multimodal is especially suited to mobile

— Long term it could radically change how we interact with computers

● Are the markets ready for multimodal?— Telecoms downturn has hurt rate of adoption of new technology

— Companies involved in W3C Multimodal working group: Avaya, Canon, Cisco, Comverse, France Telecom, Hewlett-Packard, IBM, Intel, Kirusa, Loquendo, Microsoft, Mitsubishi Electric, Motorola, NEC, Nokia, Nortel Networks, Nuance Communications, OnMobile Systems, Openwave, Opera Software, Philips Electronics, PipeBeach, Scansoft, Siemens, Snowshore Networks, Speechworks International, Sun Microsystems, T-Online International, Toyohashi University of Technology,V-Enable, VoiceGenie, and Voxeo

Research Challenges● More robust speech recognition

— Picking nuances of speech out of the background

— Understanding the sound field

— Modeling the microphone and acoustic environment

— Richer models for later stages of speech recognition

— Combining stochastic and linguistic knowledge

— Something in between speaker dependent and independent ASR

● Declarative approaches for separating presentation and behavior— The Web is just starting to separate presentation and data

— Plan based dialogs with tasks as objects? (TalkML2)

● Reducing the burden on the author— Applications as a constellation of tasks

— Combing task specific with general knowledge

— Natural language understanding and common sense skills

— Role of self awareness (dialog memory + long term memory)

— Soft human knowledge versus crisp symbolic knowledge (Semantic Web)

Documents

W3C Multimodal Interaction Activity Dave Raggett W3C & Openwave Systems November 2002