Upload
jubal
View
23
Download
3
Embed Size (px)
DESCRIPTION
With thanks to Jim Larson. From Voice Browsers to Multimodal Systems. The W3C Speech Interface Framework. http://www.w3.org/Voice. Dave Raggett W3C Lead for Voice/Multimodal W3C & Openwave [email protected]. Voice – The Natural Interface available from over a billion phones. - PowerPoint PPT Presentation
Citation preview
1/41 W3C AC/WWW10Hong Kong May 2001
From Voice Browsers to Multimodal Systems
Dave Raggett
W3C Lead for Voice/Multimodal
W3C & Openwave
http://www.w3.org/Voice
With thanks to Jim Larson
The W3C Speech Interface Framework
2/41 W3C AC/WWW10Hong Kong May 2001
Voice – The Natural Interfaceavailable from over a billion phones
• Personal assistant functions:– Name dialing and Search– Personal Information Management– Unified Messaging (mail, Fax & IM)– Call screening & call routing
• Voice Portals– Access to news, information, entertainment,
customer service and V-commerce(e.g. Find a friend, Wine Tips, Flight info, Find a hotel room , Buy ringing tones, Track a shipment)
• Front-ends for Call Centers– 90% cost savings over human agents– Reduced call abandonment rates (IVR)– Increased customer satisfaction
(Portal Demo)
3/41 W3C AC/WWW10Hong Kong May 2001
W3C Voice Browser Working Grouphttp://www.w3.org/Voice/Group
• Founded: May 1999 following workshop in October 1998
• Mission– Prepare and review markup languages to enable Internet-based
speech applications
• Has published requirements and specifications for languages in the W3C Speech Interface Framework
• Is now due to be re-chartered with clarified IP policy
4/41 W3C AC/WWW10Hong Kong May 2001
Voice Browser WG MembershipAlcatelAnyDeviceAsk JeevesAT&TAvayaBeVocalBrienceBTCanonCiscoComverseConversayEDFFrance TelecomGeneral Magic
HitachiHPIBMInformioIntelIsSoundLernout & HauspieLocus DialogueLucentMicrosoftMiloMitreMotorolaNokiaNortel Networks
Nuance PhilipsOpenwavePipeBeachSpeechHostSpeechWorksSun MicrosystemsTelecom ItaliaTeleraTellmeUnisysVerascapeVoiceGenieVoxeoVoxSurfYahoo
5/41 W3C AC/WWW10Hong Kong May 2001
ASRLanguage
Understanding ContextInterpretation
DialogManager
TTSLanguageGeneration
Speech Synthesis ML
Speech RecognitionGrammar ML
N-gram Grammar ML
Natural Language Semantics ML VoiceXML 2.0
Reusable Components
WorldWideWeb
User
DTMF Tone Recognizer
Media Planning
Prerecorded Audio Player
Lexicon
TelephoneSystem
W3C Speech Interface Framework
Call Control
6/41 W3C AC/WWW10Hong Kong May 2001
W3C Speech Interface Framework Published Documents
RECPRCR
LCWDWD
REQDialog Speech Speech N-gram NL Reusable Lexicon Call Synthesis Grammar Semantics Comp'ts Control
12-99 12-9912-99 12-99 12-99
12-99 12-99 12-99 5-00
5-00
1-01 1-01
Documents available at http://www.w3.org/Voice
4-01
Soon
2-01
Soon
7/41 W3C AC/WWW10Hong Kong May 2001
Voice User Interfaces and VoiceXML
• Why use voice as a user interface? – Far more phones than PCs– More wireless phones than PCs– Hands and eyes free operation
• Why do we need a language for specifying voice dialogs?– High-level language simplifies application development– Separates Voice interface from Application server– Leverage existing Web application development tools
• What does VoiceXML describe?– Conversational dialogs: System and user turns to speak– Dialogs based on form-filling metaphor plus events and links
• W3C is standardizing VoiceXML based upon VoiceXML 1.0 submission by AT&T, IBM, Lucent and Motorola
8/41 W3C AC/WWW10Hong Kong May 2001
VoiceXML Architecture
CorporationCarrier
Any PhoneVoiceXML Gateway
PSTN or VoIP
Brings the power of the Web to Voice
Consumer or Corporate Web site
Speech +DTMF
VoiceXMLGrammarsAudio files
9/41 W3C AC/WWW10Hong Kong May 2001
Reaching Out to Multiple Channels
Applications Database
XML, Images, Audio, …
XHTML VoiceXML WML/HDML
Content AdaptationAdjust as needed foreach device & user
10/41 W3C AC/WWW10Hong Kong May 2001
VoiceXML Features
• Menus, Forms, Sub-dialogs– <menu>, <form>, <subdialog>
• Inputs– Speech Recognition <grammar>
– Recording <record>
– Keypad <dtmf>
• Output– Audio files <audio>
– Text-To-Speech
• Variables– <var>, <script>
• Events– <nomatch>, <noinput>, <help>,
<catch>, <throw>
• Transition & submission– <goto>, <submit>
– Telephony– Call transfer – Telephony information
– Platform– Objects
– Performance– Fetch
11/41 W3C AC/WWW10Hong Kong May 2001
Example VoiceXML<menu>
<prompt> <speak>
Welcome to Ajax Travel. Do you want to fly to
<emphasis>
New York
</emphasis>
or
<emphasis>
Washington
</emphasis>
</speak>
</prompt>
<choice next="http://www.NY...".><grammar>
<choice>
<item> New York </item>
<item> Big Apple </item> </choice>
</grammar>
</choice>
<choice next="http://www.Wash...">
<grammar>
<choice> <item> Washington </item>
<item> The Capital </item> </choice>
</grammar>
</choice>
</menu>
12/41 W3C AC/WWW10Hong Kong May 2001
<form id="weather_info"> <block>Welcome to the international weather service.</block> <field name=“country"> <prompt>What country?</prompt> <grammar src=“country.gram" type="application/x-jsgf"/> <catch event="help"> Please say the country for which you want the weather. </catch> </field> <field name="city"> <prompt>What city?</prompt> <grammar src="city.gram" type="application/x-jsgf"/> <catch event="help"> Please say the city for which you want the weather. </catch> </field> <block> <submit next="/servlet/weather" namelist="city country"/> </block> </form>
Example VoiceXML
13/41 W3C AC/WWW10Hong Kong May 2001
VoiceXML Implementations
• BeVocal
• General Magic
• HeyAnita
• IBM
• Lucent
• Motorola
• Nuance • PipeBeach • SpeechWorks• Telera• Tellme• Voice Genie
See http://www.w3.org/Voice
These are the companies who asked to be listed on the W3C Voice page
14/41 W3C AC/WWW10Hong Kong May 2001
Reusable Components
Voice ApplicationDeveloper
ReusableComponents
VoiceXMLScripts
DialogManager
Voice ApplicationDeveloper
15/41 W3C AC/WWW10Hong Kong May 2001
Reusable Dialog Modules• Express application at task level rather than interaction level• Save development time by reusing tried and effective
modules• Increase consistency among applications
Examples include:
Credit card number
Date
Name
Address
Telephone number
Yes/No question
Shopping cart
Order status
Weather
Stock quotes
Sport scores
Word games
16/41 W3C AC/WWW10Hong Kong May 2001
Speech Grammar ML
• Specifies the words and patterns of words for which a speaker independent recognizer can listen
• May be specified – Inline as part of a VoiceXML page
– Referenced and stored separately on Web servers
• Three variants: XML, ABNF, N-Gram• Action Tags for “semantic processing”
17/41 W3C AC/WWW10Hong Kong May 2001
Three forms of the Grammar ML• XML
– Modeled after Java Speech Grammar Format
– Mandatory for Dialog ML interpreters– Manually specified by developer
• Augmented BNF syntax (ABNF)– Modeled after Java Speech Grammar
Format– Optional for Dialog ML interpreters– May be mapped to and from XML
grammars– Manually specified by developer
• N-grams– Optional for Dialog ML interpreters– Used for larger vocabularies– Generated statistically
<rule id="state" scope="public">
<one-of>
<item> Oregon </item>
<item>Maine </item>
</one-of>
</rule>
public $state = Oregon | Maine
18/41 W3C AC/WWW10Hong Kong May 2001
Action Tags
• Specify what VoiceXML variables to set when grammar rules are matched to user input
• Based upon subset of ECMAScript
$drink = coke | pepsi | coca cola {"coke"};
// medium is default if nothing said$size = {"medium"} [small | medium | large | regular {"medium"}]
19/41 W3C AC/WWW10Hong Kong May 2001
N-Gram Language Models• Likelihood of a given word following certain
others• Used as a linguistic model to identify most likely
sequence of words that matches the spoken input• N-Grams are computed automatically from a
corpus of many inputs• The N-Gram Markup Language is used as
interchange format for automatic analysis of words and phrases to an dictation ASR engine.
20/41 W3C AC/WWW10Hong Kong May 2001
Speech synthesis process
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
• Dr. Jones lives at 175 Park Dr. He weighs 175 lb. He plays bass in a blues band. He also likes to fish; last week he caught a 20 lb. bass.
• Doctor Jones lives at one seventy-five Park Drive. He weighs one hundred and seventy-five pounds. He plays base in a blues band. He likes to fish; last week he caught a twenty-pound bass.
IN OUT
modeled after Sun’s Java Speech Markup Language
21/41 W3C AC/WWW10Hong Kong May 2001
Speech Synthesis ML
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Non-markup behavior:infer structure byautomated text analysis Markup support:paragraph, sentence
<paragraph><sentence>
This is the first sentence.</sentence><sentence>
This is the second sentence.</sentence>
</paragraph>
22/41 W3C AC/WWW10Hong Kong May 2001
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Non-markup behavior: automatically identify and convert constructs Markup support: sayas for dates, times, etc.
Examples
<sayas sub="World Wide Web Consortium" > W3C</sayas>
<sayas type="number:digits"> 175 </sayas>
Speech Synthesis ML
23/41 W3C AC/WWW10Hong Kong May 2001
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Non-markup behavior:look up in a pronunciation dictionary Markup support:phoneme, sayas
Example<phoneme alphabet="ipa" ph="tɒmɑtoʊ"> tomato</phoneme>
International Phonetic Alphabet (IPA) using character entities
Phonetic Alphabets• International Phonetic Alphabet• Worldbet• X-SAMPA
Speech Synthesis ML
24/41 W3C AC/WWW10Hong Kong May 2001
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Non-markup behavior:automatically generates prosody through analysis of document structure andsentence syntaxMarkup support:emphasis, break, prosody
Examples<emphasis> Hi </emphasis><break time="3s"/><prosody rate="slow"/>
Prosody elementpitch: high, medium, low, defaultcontourrange: high, medium, low, defaultrate: fast medium, slow, defaultvolume: silent, soft medium, loud, default
Speech Synthesis ML
25/41 W3C AC/WWW10Hong Kong May 2001
StructureAnalysis
TextNormali-
zation
Text-to-Phoneme
Conversion
Prosody Analysis
WaveformProduction
Markup support:voice, audio
Examples<audio src=“laughter.wav">[laughter]</audio><voice age="child"> Mary had a little lamb </voice>
Attributesgender: male, female, neutralage: child, teenager, adult, elder, (integer)variant: different, (integer)name: default, (voice-name)
Speech Synthesis ML
26/41 W3C AC/WWW10Hong Kong May 2001
LexiconML - Why?
<lexicon>
either /iy th r/
either /ay th r/
</lexicon>
Pronunciation Lexicon
either
TTS /ay th r/
ASR
Voice ApplicationDeveloper
either
/ay th r//iy th r/
•Accurate pronunciations are essential in EVERY speech application•Platform default lexicons do not give 100% coverage of user speech
27/41 W3C AC/WWW10Hong Kong May 2001
LexiconML - Key Requirements
• Meets both synthesis and recognition requirements
• Pronunciations for any language (including tonal)– reuse standard alphabets, support for suprasegmentals
• Multiple pronunciations per word
• Alternate orthographies– Spelling variations — “colour” and “color”
– Alternative writing systems —Japanese Kanji and Kana
– Abbreviations and Acronyms - e.g. Dr., BT,
• Homophones e.g “read” and “reed” (same sound)
• Homographs e.g. “read” and “read” (same spelling)
28/41 W3C AC/WWW10Hong Kong May 2001
Interaction Style• Voice user interfaces needn't be dull• Choose prompts to reflect an explicit choice of
personality• Introduce variety in prompts rather than always
repeating the same thing• Politeness, helpfulness and sense of humor• Target different groups of users e.g. Gen Y• Allow users to select personality (skin)
(Personality Demo)
29/41 W3C AC/WWW10Hong Kong May 2001
Call Control
Dialog Manager
VoiceXML
CallControl
Voice ApplicationDeveloper
User
(Call control Demo)
30/41 W3C AC/WWW10Hong Kong May 2001
Call Control Requirements
• Call management—Place outbound call, conditionally answer inbound call, outbound fax
• Call leg management—Create, redirect, interact while on hold
• Conference management—Create, join, exit• Intersession communication—Asynchronous
events• Interpreter context—Invoke, terminate
31/41 W3C AC/WWW10Hong Kong May 2001
Natural Language Semantics ML
ASRLanguage
UnderstandingContext
Interpretation
Voice ApplicationDeveloper
Grammar and semantic tags
Text NLSemantics
32/41 W3C AC/WWW10Hong Kong May 2001
Natural Language Semantics ML
• Represent semantic interpretations of an utterance– Speech– Natural language text– Other forms (e.g., handwriting, ocr, DTMF.)
• Used primarily as an interchange format among voice browser components
• Usually generated automatically and not authored directly by developers
• Goal is to use XForms as a data model
33/41 W3C AC/WWW10Hong Kong May 2001
Result Interpretation
NLSemantics ML structuregrammarx-modelxmlns
confidencegrammarx-modelxmlns
InputText
Nomatch Noinput Input
Text
modetimestamp-starttimestamp-endconfidence
xf:model xf:instance
Application-specificelements defined byX Forms data model
Xformsdefinition
MeaningIncoming data
34/41 W3C AC/WWW10Hong Kong May 2001
What toppings do you have?<interpretation grammar="http://toppings" xmlns:xf="http://www.w3.org/xxx“> <input mode="speech">what toppings to you have?</input> <xf:x-model> <xf: group xf:name="question"/> <xf:string xf:name="questioned_item"/> <xf: string xf:name="questioned_property"/> </xf:group> </xf:x-model> <xf: instance>
<app:question> <app:questioned-item>toppings</app:questioned_item> <app:questioned_property>availability</app:questioned_property> </app:question> </xf:instance></interpretation>
35/41 W3C AC/WWW10Hong Kong May 2001
Richer Natural Language
• Most current voice apps restrict users to keywords or short phrases
• The application does most of the talking
• Alternative is to use open grammars with word spotting and let user do the talking
• Rules for figuring out what the user said and why as basis for asking next question
(GM/AskJeeves Demo)
36/41 W3C AC/WWW10Hong Kong May 2001
Multimodal = Voice + Displays
• Say which City you want weather for and see the information on your phone
• Say which bands/CD’s you want to buy and confirm the choices visually
What is the weather in San Francisco?
I want to place an orderfor “Hotshot” by Shaggy.
37/41 W3C AC/WWW10Hong Kong May 2001
Multimodal Interaction• Multimodal applications
– Voice + Display + Key pad + Stylus etc.– User is free to switch between voice interaction and use of
display/key pad/clicking/handwriting
• July 2000 Published Multimodal Requirements Draft• Demonstrations of Multimodal prototypes at Paris face to
face meeting of Voice Browser WG• Joint W3C/WAP Forum workshop on Multimodal – Hong
Kong September 2000• February 2001 – W3C publishes Multimodal Request for
Proposals• Plan to set up Multimodal Working Group later this year
assuming we get appropriate submission(s)
38/41 W3C AC/WWW10Hong Kong May 2001
• Primary market is mobile wireless– cell phones, personal digital assistants and cars
• Timescale is driven by deployment of 3G networks• Input modes:
– speech, keypads, pointing devices, and electronic ink
• Output modes:– speech, audio, and bitmapped or character cell displays
• Architecture should allow for both local and remote speech processing
Multimodal Interaction
39/41 W3C AC/WWW10Hong Kong May 2001
Some Ideas …
• Speech enabling XHTML (and WML) without requiring changes to markup language
– New ECMAScript Speech Object?
• Loose coupling of VoiceXML with externally defined pages written in XHTML, SMIL, etc.
– Turn-driven synchronization protocol based on SIP?
• Distributed Speech Processing– Reduce load on wireless network and speech servers– Increase recognition accuracy in presence of noise– ETSI work on Aurora
• Using pen-based gestures to constrain ASR (click and speak)
W3C is seeking detailed proposals with broad industry support as basis for chartering multimodal working group
40/41 W3C AC/WWW10Hong Kong May 2001
VoiceXML IP Issues
• Technical work on VoiceXML 2.0 is proceeding well• Publication of VoiceXML 2.0 working draft held up over IP issues
(although internal version is accessible to W3C Members)• Related specifications for grammar, speech synthesis, natural
language synthesis, lexicon, and call control have or shortly will be published.
• W3C and VoiceXML Forum Management are in process of developing a formal Memorandum of Understanding
• W3C is convening a Patent Advisory Group to recommend IP Policy for re-chartering the Voice Browser Activity– Draw inspiration from IETF, ECTF, ETSI and other bodies, e.g. require
all WG members to license essential IP under openly specified RAND terms with operational criteria for effective terms expressed in terms of exit criteria for Candidate Recommendation phase. No requirement for advanced disclosure of IP
41/41 W3C AC/WWW10Hong Kong May 2001
Discussion?