From Voice Browsers to Multimodal Systems

1/41 W3C AC/WWW10Hong Kong May 2001

From Voice Browsers to Multimodal Systems

Dave Raggett

W3C Lead for Voice/Multimodal

W3C & Openwave

[email protected]

http://www.w3.org/Voice

With thanks to Jim Larson

The W3C Speech Interface Framework


Voice – The Natural Interfaceavailable from over a billion phones

• Personal assistant functions:– Name dialing and Search– Personal Information Management– Unified Messaging (mail, Fax & IM)– Call screening & call routing

• Voice Portals– Access to news, information, entertainment,

customer service and V-commerce(e.g. Find a friend, Wine Tips, Flight info, Find a hotel room , Buy ringing tones, Track a shipment)

• Front-ends for Call Centers– 90% cost savings over human agents– Reduced call abandonment rates (IVR)– Increased customer satisfaction

(Portal Demo)


W3C Voice Browser Working Grouphttp://www.w3.org/Voice/Group

• Founded: May 1999 following workshop in October 1998

• Mission– Prepare and review markup languages to enable Internet-based

speech applications

• Has published requirements and specifications for languages in the W3C Speech Interface Framework

• Is now due to be re-chartered with clarified IP policy


Voice Browser WG MembershipAlcatelAnyDeviceAsk JeevesAT&TAvayaBeVocalBrienceBTCanonCiscoComverseConversayEDFFrance TelecomGeneral Magic

HitachiHPIBMInformioIntelIsSoundLernout & HauspieLocus DialogueLucentMicrosoftMiloMitreMotorolaNokiaNortel Networks

Nuance PhilipsOpenwavePipeBeachSpeechHostSpeechWorksSun MicrosystemsTelecom ItaliaTeleraTellmeUnisysVerascapeVoiceGenieVoxeoVoxSurfYahoo


ASRLanguage

Understanding ContextInterpretation

DialogManager

TTSLanguageGeneration

Speech Synthesis ML

Speech RecognitionGrammar ML

N-gram Grammar ML

Natural Language Semantics ML VoiceXML 2.0

Reusable Components

WorldWideWeb

User

DTMF Tone Recognizer

Media Planning

Prerecorded Audio Player

Lexicon

TelephoneSystem

W3C Speech Interface Framework

Call Control


W3C Speech Interface Framework Published Documents

RECPRCR

LCWDWD

REQDialog Speech Speech N-gram NL Reusable Lexicon Call Synthesis Grammar Semantics Comp'ts Control

12-99 12-9912-99 12-99 12-99

12-99 12-99 12-99 5-00

5-00

1-01 1-01

Documents available at http://www.w3.org/Voice

4-01

Soon

2-01

Soon


Voice User Interfaces and VoiceXML

• Why use voice as a user interface? – Far more phones than PCs– More wireless phones than PCs– Hands and eyes free operation

• Why do we need a language for specifying voice dialogs?– High-level language simplifies application development– Separates Voice interface from Application server– Leverage existing Web application development tools

• What does VoiceXML describe?– Conversational dialogs: System and user turns to speak– Dialogs based on form-filling metaphor plus events and links

• W3C is standardizing VoiceXML based upon VoiceXML 1.0 submission by AT&T, IBM, Lucent and Motorola


VoiceXML Architecture

CorporationCarrier

Any PhoneVoiceXML Gateway

PSTN or VoIP

Brings the power of the Web to Voice

Consumer or Corporate Web site

Speech +DTMF

VoiceXMLGrammarsAudio files


Reaching Out to Multiple Channels

Applications Database

XML, Images, Audio, …

XHTML VoiceXML WML/HDML

Content AdaptationAdjust as needed foreach device & user


VoiceXML Features

• Menus, Forms, Sub-dialogs– <menu>, <form>, <subdialog>

• Inputs– Speech Recognition <grammar>

– Recording <record>

– Keypad <dtmf>

• Output– Audio files <audio>

– Text-To-Speech

• Variables– <var>, <script>

• Events– <nomatch>, <noinput>, <help>,

<catch>, <throw>

• Transition & submission– <goto>, <submit>

– Telephony– Call transfer – Telephony information

– Platform– Objects

– Performance– Fetch


Example VoiceXML<menu>

<prompt> <speak>

Welcome to Ajax Travel. Do you want to fly to

<emphasis>

New York

</emphasis>

or

<emphasis>

Washington

</emphasis>

</speak>

</prompt>

<choice next="http://www.NY...".><grammar>

<choice>

<item> New York </item>

<item> Big Apple </item> </choice>

</grammar>

</choice>

<choice next="http://www.Wash...">

<grammar>

<choice> <item> Washington </item>

<item> The Capital </item> </choice>

</grammar>

</choice>

</menu>


<form id="weather_info"> <block>Welcome to the international weather service.</block> <field name=“country"> <prompt>What country?</prompt> <grammar src=“country.gram" type="application/x-jsgf"/> <catch event="help"> Please say the country for which you want the weather. </catch> </field> <field name="city"> <prompt>What city?</prompt> <grammar src="city.gram" type="application/x-jsgf"/> <catch event="help"> Please say the city for which you want the weather. </catch> </field> <block> <submit next="/servlet/weather" namelist="city country"/> </block> </form>

Example VoiceXML


VoiceXML Implementations

• BeVocal

• General Magic

• HeyAnita

• IBM

• Lucent

• Motorola

• Nuance • PipeBeach • SpeechWorks• Telera• Tellme• Voice Genie

See http://www.w3.org/Voice

These are the companies who asked to be listed on the W3C Voice page


Reusable Components

Voice ApplicationDeveloper

ReusableComponents

VoiceXMLScripts

DialogManager



Reusable Dialog Modules• Express application at task level rather than interaction level• Save development time by reusing tried and effective

modules• Increase consistency among applications

Examples include:

Credit card number

Date

Name

Address

Telephone number

Yes/No question

Shopping cart

Order status

Weather

Stock quotes

Sport scores

Word games


Speech Grammar ML

• Specifies the words and patterns of words for which a speaker independent recognizer can listen

• May be specified – Inline as part of a VoiceXML page

– Referenced and stored separately on Web servers

• Three variants: XML, ABNF, N-Gram• Action Tags for “semantic processing”


Three forms of the Grammar ML• XML

– Modeled after Java Speech Grammar Format

– Mandatory for Dialog ML interpreters– Manually specified by developer

• Augmented BNF syntax (ABNF)– Modeled after Java Speech Grammar

Format– Optional for Dialog ML interpreters– May be mapped to and from XML

grammars– Manually specified by developer

• N-grams– Optional for Dialog ML interpreters– Used for larger vocabularies– Generated statistically

<rule id="state" scope="public">

<one-of>

<item> Oregon </item>

<item>Maine </item>

</one-of>

</rule>

public $state = Oregon | Maine


Action Tags

• Specify what VoiceXML variables to set when grammar rules are matched to user input

• Based upon subset of ECMAScript

$drink = coke | pepsi | coca cola {"coke"};

// medium is default if nothing said$size = {"medium"} [small | medium | large | regular {"medium"}]


N-Gram Language Models• Likelihood of a given word following certain

others• Used as a linguistic model to identify most likely

sequence of words that matches the spoken input• N-Grams are computed automatically from a

corpus of many inputs• The N-Gram Markup Language is used as

interchange format for automatic analysis of words and phrases to an dictation ASR engine.


Speech synthesis process

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

• Dr. Jones lives at 175 Park Dr. He weighs 175 lb. He plays bass in a blues band. He also likes to fish; last week he caught a 20 lb. bass.

• Doctor Jones lives at one seventy-five Park Drive. He weighs one hundred and seventy-five pounds. He plays base in a blues band. He likes to fish; last week he caught a twenty-pound bass.

IN OUT

modeled after Sun’s Java Speech Markup Language


Speech Synthesis ML

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Non-markup behavior:infer structure byautomated text analysis Markup support:paragraph, sentence

<paragraph><sentence>

This is the first sentence.</sentence><sentence>

This is the second sentence.</sentence>

</paragraph>


StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Non-markup behavior: automatically identify and convert constructs Markup support: sayas for dates, times, etc.

Examples

<sayas sub="World Wide Web Consortium" > W3C</sayas>

<sayas type="number:digits"> 175 </sayas>

Speech Synthesis ML


StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Non-markup behavior:look up in a pronunciation dictionary Markup support:phoneme, sayas

Example<phoneme alphabet="ipa" ph="tɒmɑtoʊ"> tomato</phoneme>

International Phonetic Alphabet (IPA) using character entities

Phonetic Alphabets• International Phonetic Alphabet• Worldbet• X-SAMPA

Speech Synthesis ML


StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Non-markup behavior:automatically generates prosody through analysis of document structure andsentence syntaxMarkup support:emphasis, break, prosody

Examples<emphasis> Hi </emphasis><break time="3s"/><prosody rate="slow"/>

Prosody elementpitch: high, medium, low, defaultcontourrange: high, medium, low, defaultrate: fast medium, slow, defaultvolume: silent, soft medium, loud, default

Speech Synthesis ML


StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Markup support:voice, audio

Examples<audio src=“laughter.wav">[laughter]</audio><voice age="child"> Mary had a little lamb </voice>

Attributesgender: male, female, neutralage: child, teenager, adult, elder, (integer)variant: different, (integer)name: default, (voice-name)

Speech Synthesis ML


LexiconML - Why?

<lexicon>

either /iy th r/

either /ay th r/

</lexicon>

Pronunciation Lexicon

either

TTS /ay th r/

ASR


either

/ay th r//iy th r/

•Accurate pronunciations are essential in EVERY speech application•Platform default lexicons do not give 100% coverage of user speech


LexiconML - Key Requirements

• Meets both synthesis and recognition requirements

• Pronunciations for any language (including tonal)– reuse standard alphabets, support for suprasegmentals

• Multiple pronunciations per word

• Alternate orthographies– Spelling variations — “colour” and “color”

– Alternative writing systems —Japanese Kanji and Kana

– Abbreviations and Acronyms - e.g. Dr., BT,

• Homophones e.g “read” and “reed” (same sound)

• Homographs e.g. “read” and “read” (same spelling)


Interaction Style• Voice user interfaces needn't be dull• Choose prompts to reflect an explicit choice of

personality• Introduce variety in prompts rather than always

repeating the same thing• Politeness, helpfulness and sense of humor• Target different groups of users e.g. Gen Y• Allow users to select personality (skin)

(Personality Demo)


Call Control

Dialog Manager

VoiceXML

CallControl


User

(Call control Demo)


Call Control Requirements

• Call management—Place outbound call, conditionally answer inbound call, outbound fax

• Call leg management—Create, redirect, interact while on hold

• Conference management—Create, join, exit• Intersession communication—Asynchronous

events• Interpreter context—Invoke, terminate


Natural Language Semantics ML

ASRLanguage

UnderstandingContext

Interpretation


Grammar and semantic tags

Text NLSemantics


Natural Language Semantics ML

• Represent semantic interpretations of an utterance– Speech– Natural language text– Other forms (e.g., handwriting, ocr, DTMF.)

• Used primarily as an interchange format among voice browser components

• Usually generated automatically and not authored directly by developers

• Goal is to use XForms as a data model


Result Interpretation

NLSemantics ML structuregrammarx-modelxmlns

confidencegrammarx-modelxmlns

InputText

Nomatch Noinput Input

Text

modetimestamp-starttimestamp-endconfidence

xf:model xf:instance

Application-specificelements defined byX Forms data model

Xformsdefinition

MeaningIncoming data


What toppings do you have?<interpretation grammar="http://toppings" xmlns:xf="http://www.w3.org/xxx“> <input mode="speech">what toppings to you have?</input> <xf:x-model> <xf: group xf:name="question"/> <xf:string xf:name="questioned_item"/> <xf: string xf:name="questioned_property"/> </xf:group> </xf:x-model> <xf: instance>

<app:question> <app:questioned-item>toppings</app:questioned_item> <app:questioned_property>availability</app:questioned_property> </app:question> </xf:instance></interpretation>


Richer Natural Language

• Most current voice apps restrict users to keywords or short phrases

• The application does most of the talking

• Alternative is to use open grammars with word spotting and let user do the talking

• Rules for figuring out what the user said and why as basis for asking next question

(GM/AskJeeves Demo)


Multimodal = Voice + Displays

• Say which City you want weather for and see the information on your phone

• Say which bands/CD’s you want to buy and confirm the choices visually

What is the weather in San Francisco?

I want to place an orderfor “Hotshot” by Shaggy.


Multimodal Interaction• Multimodal applications

– Voice + Display + Key pad + Stylus etc.– User is free to switch between voice interaction and use of

display/key pad/clicking/handwriting

• July 2000 Published Multimodal Requirements Draft• Demonstrations of Multimodal prototypes at Paris face to

face meeting of Voice Browser WG• Joint W3C/WAP Forum workshop on Multimodal – Hong

Kong September 2000• February 2001 – W3C publishes Multimodal Request for

Proposals• Plan to set up Multimodal Working Group later this year

assuming we get appropriate submission(s)


• Primary market is mobile wireless– cell phones, personal digital assistants and cars

• Timescale is driven by deployment of 3G networks• Input modes:

– speech, keypads, pointing devices, and electronic ink

• Output modes:– speech, audio, and bitmapped or character cell displays

• Architecture should allow for both local and remote speech processing

Multimodal Interaction


Some Ideas …

• Speech enabling XHTML (and WML) without requiring changes to markup language

– New ECMAScript Speech Object?

• Loose coupling of VoiceXML with externally defined pages written in XHTML, SMIL, etc.

– Turn-driven synchronization protocol based on SIP?

• Distributed Speech Processing– Reduce load on wireless network and speech servers– Increase recognition accuracy in presence of noise– ETSI work on Aurora

• Using pen-based gestures to constrain ASR (click and speak)

W3C is seeking detailed proposals with broad industry support as basis for chartering multimodal working group


VoiceXML IP Issues

• Technical work on VoiceXML 2.0 is proceeding well• Publication of VoiceXML 2.0 working draft held up over IP issues

(although internal version is accessible to W3C Members)• Related specifications for grammar, speech synthesis, natural

language synthesis, lexicon, and call control have or shortly will be published.

• W3C and VoiceXML Forum Management are in process of developing a formal Memorandum of Understanding

• W3C is convening a Patent Advisory Group to recommend IP Policy for re-chartering the Voice Browser Activity– Draw inspiration from IETF, ECTF, ETSI and other bodies, e.g. require

all WG members to license essential IP under openly specified RAND terms with operational criteria for effective terms expressed in terms of exit criteria for Candidate Recommendation phase. No requirement for advanced disclosure of IP


Discussion?