41
1/41 W3C AC/WWW10 Hong Kong May 2001 From Voice Browsers to Multimodal Systems Dave Raggett W3C Lead for Voice/Multimodal W3C & Openwave [email protected] http://www.w3.org/Voice With thanks to Jim Larson The W3C Speech Interface Framework

From Voice Browsers to Multimodal Systems

  • Upload
    jubal

  • View
    23

  • Download
    3

Embed Size (px)

DESCRIPTION

With thanks to Jim Larson. From Voice Browsers to Multimodal Systems. The W3C Speech Interface Framework. http://www.w3.org/Voice. Dave Raggett W3C Lead for Voice/Multimodal W3C & Openwave [email protected]. Voice – The Natural Interface available from over a billion phones. - PowerPoint PPT Presentation

Citation preview

Page 1: From Voice Browsers to Multimodal Systems

1/41 W3C AC/WWW10Hong Kong May 2001

From Voice Browsers to Multimodal Systems

Dave Raggett

W3C Lead for Voice/Multimodal

W3C & Openwave

[email protected]

http://www.w3.org/Voice

With thanks to Jim Larson

The W3C Speech Interface Framework

Page 2: From Voice Browsers to Multimodal Systems

2/41 W3C AC/WWW10Hong Kong May 2001

Voice – The Natural Interfaceavailable from over a billion phones

• Personal assistant functions:– Name dialing and Search– Personal Information Management– Unified Messaging (mail, Fax & IM)– Call screening & call routing

• Voice Portals– Access to news, information, entertainment,

customer service and V-commerce(e.g. Find a friend, Wine Tips, Flight info, Find a hotel room , Buy ringing tones, Track a shipment)

• Front-ends for Call Centers– 90% cost savings over human agents– Reduced call abandonment rates (IVR)– Increased customer satisfaction

(Portal Demo)

Page 3: From Voice Browsers to Multimodal Systems

3/41 W3C AC/WWW10Hong Kong May 2001

W3C Voice Browser Working Grouphttp://www.w3.org/Voice/Group

• Founded: May 1999 following workshop in October 1998

• Mission– Prepare and review markup languages to enable Internet-based

speech applications

• Has published requirements and specifications for languages in the W3C Speech Interface Framework

• Is now due to be re-chartered with clarified IP policy

Page 4: From Voice Browsers to Multimodal Systems

4/41 W3C AC/WWW10Hong Kong May 2001

Voice Browser WG MembershipAlcatelAnyDeviceAsk JeevesAT&TAvayaBeVocalBrienceBTCanonCiscoComverseConversayEDFFrance TelecomGeneral Magic

HitachiHPIBMInformioIntelIsSoundLernout & HauspieLocus DialogueLucentMicrosoftMiloMitreMotorolaNokiaNortel Networks

Nuance PhilipsOpenwavePipeBeachSpeechHostSpeechWorksSun MicrosystemsTelecom ItaliaTeleraTellmeUnisysVerascapeVoiceGenieVoxeoVoxSurfYahoo

Page 5: From Voice Browsers to Multimodal Systems

5/41 W3C AC/WWW10Hong Kong May 2001

ASRLanguage

Understanding ContextInterpretation

DialogManager

TTSLanguageGeneration

Speech Synthesis ML

Speech RecognitionGrammar ML

N-gram Grammar ML

Natural Language Semantics ML VoiceXML 2.0

Reusable Components

WorldWideWeb

User

DTMF Tone Recognizer

Media Planning

Prerecorded Audio Player

Lexicon

TelephoneSystem

W3C Speech Interface Framework

Call Control

Page 6: From Voice Browsers to Multimodal Systems

6/41 W3C AC/WWW10Hong Kong May 2001

W3C Speech Interface Framework Published Documents

RECPRCR

LCWDWD

REQDialog Speech Speech N-gram NL Reusable Lexicon Call Synthesis Grammar Semantics Comp'ts Control

12-99 12-9912-99 12-99 12-99

12-99 12-99 12-99 5-00

5-00

1-01 1-01

Documents available at http://www.w3.org/Voice

4-01

Soon

2-01

Soon

Page 7: From Voice Browsers to Multimodal Systems

7/41 W3C AC/WWW10Hong Kong May 2001

Voice User Interfaces and VoiceXML

• Why use voice as a user interface? – Far more phones than PCs– More wireless phones than PCs– Hands and eyes free operation

• Why do we need a language for specifying voice dialogs?– High-level language simplifies application development– Separates Voice interface from Application server– Leverage existing Web application development tools

• What does VoiceXML describe?– Conversational dialogs: System and user turns to speak– Dialogs based on form-filling metaphor plus events and links

• W3C is standardizing VoiceXML based upon VoiceXML 1.0 submission by AT&T, IBM, Lucent and Motorola

Page 8: From Voice Browsers to Multimodal Systems

8/41 W3C AC/WWW10Hong Kong May 2001

VoiceXML Architecture

CorporationCarrier

Any PhoneVoiceXML Gateway

PSTN or VoIP

Brings the power of the Web to Voice

Consumer or Corporate Web site

Speech +DTMF

VoiceXMLGrammarsAudio files

Page 9: From Voice Browsers to Multimodal Systems

9/41 W3C AC/WWW10Hong Kong May 2001

Reaching Out to Multiple Channels

Applications Database

XML, Images, Audio, …

XHTML VoiceXML WML/HDML

Content AdaptationAdjust as needed foreach device & user

Page 10: From Voice Browsers to Multimodal Systems

10/41 W3C AC/WWW10Hong Kong May 2001

VoiceXML Features

• Menus, Forms, Sub-dialogs– <menu>, <form>, <subdialog>

• Inputs– Speech Recognition <grammar>

– Recording <record>

– Keypad <dtmf>

• Output– Audio files <audio>

– Text-To-Speech

• Variables– <var>, <script>

• Events– <nomatch>, <noinput>, <help>,

<catch>, <throw>

• Transition & submission– <goto>, <submit>

– Telephony– Call transfer – Telephony information

– Platform– Objects

– Performance– Fetch

Page 11: From Voice Browsers to Multimodal Systems

11/41 W3C AC/WWW10Hong Kong May 2001

Example VoiceXML<menu>

<prompt>  <speak>

Welcome to Ajax Travel. Do you want to fly to

<emphasis>

New York

</emphasis>

or 

<emphasis>

Washington

</emphasis>

</speak>

</prompt>

  

<choice next="http://www.NY...".><grammar>

<choice>

<item> New York </item>

<item> Big Apple </item> </choice>

</grammar>

</choice>

<choice next="http://www.Wash...">

  <grammar>

<choice> <item> Washington </item>

<item> The Capital </item> </choice>

</grammar>    

</choice>

</menu>

Page 12: From Voice Browsers to Multimodal Systems

12/41 W3C AC/WWW10Hong Kong May 2001

<form id="weather_info"> <block>Welcome to the international weather service.</block> <field name=“country"> <prompt>What country?</prompt> <grammar src=“country.gram" type="application/x-jsgf"/> <catch event="help"> Please say the country for which you want the weather. </catch> </field> <field name="city"> <prompt>What city?</prompt> <grammar src="city.gram" type="application/x-jsgf"/> <catch event="help"> Please say the city for which you want the weather. </catch> </field> <block> <submit next="/servlet/weather" namelist="city country"/> </block> </form>

Example VoiceXML

Page 13: From Voice Browsers to Multimodal Systems

13/41 W3C AC/WWW10Hong Kong May 2001

VoiceXML Implementations

• BeVocal

• General Magic

• HeyAnita

• IBM

• Lucent

• Motorola

• Nuance • PipeBeach • SpeechWorks• Telera• Tellme• Voice Genie

See http://www.w3.org/Voice

These are the companies who asked to be listed on the W3C Voice page

Page 14: From Voice Browsers to Multimodal Systems

14/41 W3C AC/WWW10Hong Kong May 2001

Reusable Components

Voice ApplicationDeveloper

ReusableComponents

VoiceXMLScripts

DialogManager

Voice ApplicationDeveloper

Page 15: From Voice Browsers to Multimodal Systems

15/41 W3C AC/WWW10Hong Kong May 2001

Reusable Dialog Modules• Express application at task level rather than interaction level• Save development time by reusing tried and effective

modules• Increase consistency among applications

Examples include:

Credit card number

Date

Name

Address

Telephone number

Yes/No question

Shopping cart

Order status

Weather

Stock quotes

Sport scores

Word games

Page 16: From Voice Browsers to Multimodal Systems

16/41 W3C AC/WWW10Hong Kong May 2001

Speech Grammar ML

• Specifies the words and patterns of words for which a speaker independent recognizer can listen

• May be specified – Inline as part of a VoiceXML page

– Referenced and stored separately on Web servers

• Three variants: XML, ABNF, N-Gram• Action Tags for “semantic processing”

Page 17: From Voice Browsers to Multimodal Systems

17/41 W3C AC/WWW10Hong Kong May 2001

Three forms of the Grammar ML• XML

– Modeled after Java Speech Grammar Format

– Mandatory for Dialog ML interpreters– Manually specified by developer

• Augmented BNF syntax (ABNF)– Modeled after Java Speech Grammar

Format– Optional for Dialog ML interpreters– May be mapped to and from XML

grammars– Manually specified by developer

• N-grams– Optional for Dialog ML interpreters– Used for larger vocabularies– Generated statistically

<rule id="state" scope="public">

<one-of>

<item> Oregon </item>

<item>Maine </item>

</one-of>

</rule>

public $state = Oregon | Maine

Page 18: From Voice Browsers to Multimodal Systems

18/41 W3C AC/WWW10Hong Kong May 2001

Action Tags

• Specify what VoiceXML variables to set when grammar rules are matched to user input

• Based upon subset of ECMAScript

$drink = coke | pepsi | coca cola {"coke"};

// medium is default if nothing said$size = {"medium"} [small | medium | large | regular {"medium"}]

Page 19: From Voice Browsers to Multimodal Systems

19/41 W3C AC/WWW10Hong Kong May 2001

N-Gram Language Models• Likelihood of a given word following certain

others• Used as a linguistic model to identify most likely

sequence of words that matches the spoken input• N-Grams are computed automatically from a

corpus of many inputs• The N-Gram Markup Language is used as

interchange format for automatic analysis of words and phrases to an dictation ASR engine.

Page 20: From Voice Browsers to Multimodal Systems

20/41 W3C AC/WWW10Hong Kong May 2001

Speech synthesis process

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

• Dr. Jones lives at 175 Park Dr. He weighs 175 lb. He plays bass in a blues band. He also likes to fish; last week he caught a 20 lb. bass.

• Doctor Jones lives at one seventy-five Park Drive. He weighs one hundred and seventy-five pounds. He plays base in a blues band. He likes to fish; last week he caught a twenty-pound bass.

IN OUT

modeled after Sun’s Java Speech Markup Language

Page 21: From Voice Browsers to Multimodal Systems

21/41 W3C AC/WWW10Hong Kong May 2001

Speech Synthesis ML

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Non-markup behavior:infer structure byautomated text analysis Markup support:paragraph, sentence

<paragraph><sentence>

This is the first sentence.</sentence><sentence>

This is the second sentence.</sentence>

</paragraph>

Page 22: From Voice Browsers to Multimodal Systems

22/41 W3C AC/WWW10Hong Kong May 2001

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Non-markup behavior: automatically identify and convert constructs Markup support: sayas for dates, times, etc.

Examples

<sayas sub="World Wide Web Consortium" > W3C</sayas>

<sayas type="number:digits"> 175 </sayas>

Speech Synthesis ML

Page 23: From Voice Browsers to Multimodal Systems

23/41 W3C AC/WWW10Hong Kong May 2001

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Non-markup behavior:look up in a pronunciation dictionary Markup support:phoneme, sayas

Example<phoneme alphabet="ipa" ph="t&#x252;m&#x251;to&#x28A;"> tomato</phoneme>

International Phonetic Alphabet (IPA) using character entities

Phonetic Alphabets• International Phonetic Alphabet• Worldbet• X-SAMPA

Speech Synthesis ML

Page 24: From Voice Browsers to Multimodal Systems

24/41 W3C AC/WWW10Hong Kong May 2001

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Non-markup behavior:automatically generates prosody through analysis of document structure andsentence syntaxMarkup support:emphasis, break, prosody

Examples<emphasis> Hi </emphasis><break time="3s"/><prosody rate="slow"/>

Prosody elementpitch: high, medium, low, defaultcontourrange: high, medium, low, defaultrate: fast medium, slow, defaultvolume: silent, soft medium, loud, default

Speech Synthesis ML

Page 25: From Voice Browsers to Multimodal Systems

25/41 W3C AC/WWW10Hong Kong May 2001

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Markup support:voice, audio

Examples<audio src=“laughter.wav">[laughter]</audio><voice age="child"> Mary had a little lamb </voice>

Attributesgender: male, female, neutralage: child, teenager, adult, elder, (integer)variant: different, (integer)name: default, (voice-name)

Speech Synthesis ML

Page 26: From Voice Browsers to Multimodal Systems

26/41 W3C AC/WWW10Hong Kong May 2001

LexiconML - Why?

<lexicon>

either /iy th r/

either /ay th r/

</lexicon>

Pronunciation Lexicon

either

TTS /ay th r/

ASR

Voice ApplicationDeveloper

either

/ay th r//iy th r/

•Accurate pronunciations are essential in EVERY speech application•Platform default lexicons do not give 100% coverage of user speech

Page 27: From Voice Browsers to Multimodal Systems

27/41 W3C AC/WWW10Hong Kong May 2001

LexiconML - Key Requirements

• Meets both synthesis and recognition requirements

• Pronunciations for any language (including tonal)– reuse standard alphabets, support for suprasegmentals

• Multiple pronunciations per word

• Alternate orthographies– Spelling variations — “colour” and “color”

– Alternative writing systems —Japanese Kanji and Kana

– Abbreviations and Acronyms - e.g. Dr., BT,

• Homophones e.g “read” and “reed” (same sound)

• Homographs e.g. “read” and “read” (same spelling)

Page 28: From Voice Browsers to Multimodal Systems

28/41 W3C AC/WWW10Hong Kong May 2001

Interaction Style• Voice user interfaces needn't be dull• Choose prompts to reflect an explicit choice of

personality• Introduce variety in prompts rather than always

repeating the same thing• Politeness, helpfulness and sense of humor• Target different groups of users e.g. Gen Y• Allow users to select personality (skin)

(Personality Demo)

Page 29: From Voice Browsers to Multimodal Systems

29/41 W3C AC/WWW10Hong Kong May 2001

Call Control

Dialog Manager

VoiceXML

CallControl

Voice ApplicationDeveloper

User

(Call control Demo)

Page 30: From Voice Browsers to Multimodal Systems

30/41 W3C AC/WWW10Hong Kong May 2001

Call Control Requirements

• Call management—Place outbound call, conditionally answer inbound call, outbound fax

• Call leg management—Create, redirect, interact while on hold

• Conference management—Create, join, exit• Intersession communication—Asynchronous

events• Interpreter context—Invoke, terminate

Page 31: From Voice Browsers to Multimodal Systems

31/41 W3C AC/WWW10Hong Kong May 2001

Natural Language Semantics ML

ASRLanguage

UnderstandingContext

Interpretation

Voice ApplicationDeveloper

Grammar and semantic tags

Text NLSemantics

Page 32: From Voice Browsers to Multimodal Systems

32/41 W3C AC/WWW10Hong Kong May 2001

Natural Language Semantics ML

• Represent semantic interpretations of an utterance– Speech– Natural language text– Other forms (e.g., handwriting, ocr, DTMF.)

• Used primarily as an interchange format among voice browser components

• Usually generated automatically and not authored directly by developers

• Goal is to use XForms as a data model

Page 33: From Voice Browsers to Multimodal Systems

33/41 W3C AC/WWW10Hong Kong May 2001

Result Interpretation

NLSemantics ML structuregrammarx-modelxmlns

confidencegrammarx-modelxmlns

InputText

Nomatch Noinput Input

Text

modetimestamp-starttimestamp-endconfidence

xf:model xf:instance

Application-specificelements defined byX Forms data model

Xformsdefinition

MeaningIncoming data

Page 34: From Voice Browsers to Multimodal Systems

34/41 W3C AC/WWW10Hong Kong May 2001

What toppings do you have?<interpretation grammar="http://toppings" xmlns:xf="http://www.w3.org/xxx“> <input mode="speech">what toppings to you have?</input> <xf:x-model> <xf: group xf:name="question"/> <xf:string xf:name="questioned_item"/> <xf: string xf:name="questioned_property"/> </xf:group> </xf:x-model> <xf: instance>

<app:question> <app:questioned-item>toppings</app:questioned_item> <app:questioned_property>availability</app:questioned_property> </app:question> </xf:instance></interpretation>

Page 35: From Voice Browsers to Multimodal Systems

35/41 W3C AC/WWW10Hong Kong May 2001

Richer Natural Language

• Most current voice apps restrict users to keywords or short phrases

• The application does most of the talking

• Alternative is to use open grammars with word spotting and let user do the talking

• Rules for figuring out what the user said and why as basis for asking next question

(GM/AskJeeves Demo)

Page 36: From Voice Browsers to Multimodal Systems

36/41 W3C AC/WWW10Hong Kong May 2001

Multimodal = Voice + Displays

• Say which City you want weather for and see the information on your phone

• Say which bands/CD’s you want to buy and confirm the choices visually

What is the weather in San Francisco?

I want to place an orderfor “Hotshot” by Shaggy.

Page 37: From Voice Browsers to Multimodal Systems

37/41 W3C AC/WWW10Hong Kong May 2001

Multimodal Interaction• Multimodal applications

– Voice + Display + Key pad + Stylus etc.– User is free to switch between voice interaction and use of

display/key pad/clicking/handwriting

• July 2000 Published Multimodal Requirements Draft• Demonstrations of Multimodal prototypes at Paris face to

face meeting of Voice Browser WG• Joint W3C/WAP Forum workshop on Multimodal – Hong

Kong September 2000• February 2001 – W3C publishes Multimodal Request for

Proposals• Plan to set up Multimodal Working Group later this year

assuming we get appropriate submission(s)

Page 38: From Voice Browsers to Multimodal Systems

38/41 W3C AC/WWW10Hong Kong May 2001

• Primary market is mobile wireless– cell phones, personal digital assistants and cars

• Timescale is driven by deployment of 3G networks• Input modes:

– speech, keypads, pointing devices, and electronic ink

• Output modes:– speech, audio, and bitmapped or character cell displays

• Architecture should allow for both local and remote speech processing

Multimodal Interaction

Page 39: From Voice Browsers to Multimodal Systems

39/41 W3C AC/WWW10Hong Kong May 2001

Some Ideas …

• Speech enabling XHTML (and WML) without requiring changes to markup language

– New ECMAScript Speech Object?

• Loose coupling of VoiceXML with externally defined pages written in XHTML, SMIL, etc.

– Turn-driven synchronization protocol based on SIP?

• Distributed Speech Processing– Reduce load on wireless network and speech servers– Increase recognition accuracy in presence of noise– ETSI work on Aurora

• Using pen-based gestures to constrain ASR (click and speak)

W3C is seeking detailed proposals with broad industry support as basis for chartering multimodal working group

Page 40: From Voice Browsers to Multimodal Systems

40/41 W3C AC/WWW10Hong Kong May 2001

VoiceXML IP Issues

• Technical work on VoiceXML 2.0 is proceeding well• Publication of VoiceXML 2.0 working draft held up over IP issues

(although internal version is accessible to W3C Members)• Related specifications for grammar, speech synthesis, natural

language synthesis, lexicon, and call control have or shortly will be published.

• W3C and VoiceXML Forum Management are in process of developing a formal Memorandum of Understanding

• W3C is convening a Patent Advisory Group to recommend IP Policy for re-chartering the Voice Browser Activity– Draw inspiration from IETF, ECTF, ETSI and other bodies, e.g. require

all WG members to license essential IP under openly specified RAND terms with operational criteria for effective terms expressed in terms of exit criteria for Candidate Recommendation phase. No requirement for advanced disclosure of IP

Page 41: From Voice Browsers to Multimodal Systems

41/41 W3C AC/WWW10Hong Kong May 2001

Discussion?