Introduction and overview. Outline A short history of the field Speech synthesis (TTS) Automatic speech recognition (ASR) Dialog system architectures

Introduction and overview

Outline• A short history of the field• Speech synthesis (TTS)• Automatic speech recognition (ASR)• Dialog system architectures• Voice on the Web (perhaps show the Siri video)• Voice on the Web and W3C Standards• Relation to linguistic theory• A brief look at the course plan• What this course is not about• What this course could mean to you• Introduction to lab assignments and platforms• Designing and developing spoken dialog systems• Present project• Give home assignment 1: Call flow design and evaluation• Present Lab assignment 1)

A short history of the field

• 1966, Joseph Weizenbaum, Eliza

• Sundial ATIS Verbmobil

• AIML NLP system

• VXML "Voice XML", dialog markup language (primarily for telephony) developed initially by AT&T then administered by an industry consortium and finally a W3C specification.

• Voxeo, Tropo

Speech synthesis

text speech

Speech recognition

speech text (or some semantic representation)

Dialog management

• Finite-state based dialog management

• Frame based (form-based) dialog management

• Information-state based dialog management

• Plan based dialog management

Spoken dialogue system

Why voice• Wireless devices have small screens and limited input

capabilities.• Telephone keypad can give users only a limited number of

choices.• Speech technology is improving.• The exchange of information between a person and a computer

is becoming more like a real conversation.• Users want hands-free or eyes-free use.• From a business viewpoint, voice applications open up a host of

new revenue opportunities.• There exist many more telephones than computers with the

potential to access the Internet.

Traditional Interactive Voice Response (IVR)

Speech versus Touch Tone

Applications

• Information providing systems: – weather reports – stock quotes – timetables

• Transaction-based systems: – calendar functions – shopping – financial transactions – travel reservations

Architecture 1

Architecture 2

Components• Natural language understanding

– Proper Name identification– part of speech tagging– parser

• dialog manager• output generator

– natural language generator– gesture generator– layout engine

• input recognizer/decoder– automatic speech recognizer– gesture recognizer– handwriting recognizer

• output renderer– text-to-speech engine– talking head– robot or avatar

• multi-modal fusion

Types of systems• by modality

– text-based– spoken dialog system– graphical user interface– multi-modal

• by device – telephone-based systems– PDA systems– in-car systems– robot systems– desktop/laptop systems

• native• in-browser systems• in-virtual machine

– in-virtual environment– robots

• by style – command-based– menu-driven– natural language– speech graffiti

• by initiative – system initiative– user initiative– mixed initiative

• by application – information service– command-and-control– entertainment– education/tutorial– edutainment– reminder systems– companion systems– healthcare– eldercare– assistive/access systems

Mobile voice apps

• Voice on the Web • http://www.youtube.com/watch?v=OURZpqh-

35A&eurl=&feature=player_embedded

http://www.youtube.com/watch?v=OURZpqh-35A&eurl=&feature=player_embedded

http://www.youtube.com/watch?v=OURZpqh-35A&eurl=&feature=player_embedded

Relation to other fields

• Phonetics• Phonology• Syntax• Semantics• Pragmatics

• spoken language understanding

• psycholinguistics• human communication• discourse analysis

• human-computer interaction• computational linguistics• NL-parsing• NL-generation• language modeling• multi-modal fusion• multi-modal fission• psychology• cognitive science• affective dialog• user modeling• embodied communication

A brief look at the course plan

What this course is not about

• Sophisticated dialog management• Multi-modal systems• Non-spoken dialog systems

What this course could mean to you

• Will prepare you for writing a thesis in the area of dialog systems (if you so choose)

• Will prepare you for work in the industry• A link to the linkedin page

Is this something for a linguist?

Roles in the process

• Dialog designer • VoiceXML programmer • Voice talent• Grammar writer• TTS specialist• Speech recognition specialist• Quality assurance specialist • Server specialist• Manager

Who are the big players in the area?

• Google– http://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-t

echnology.html• Microsoft

– http://gigaom.com/2010/12/06/microsoft-claims-its-place-in-a-voice-enabled-world/

• Apple– http://www.dailyfinance.com/story/company-news/apples-siri-purcha

se-heats-up-the-race-toward-a-voice-activated/19458344/• IBM

– http://www.ibm.com/news/in/en/2010/08/20/a896686u56875f96.html

• Nuance– http://gigaom.com/2011/01/19/nuance-releases-mobile-sdk-to-speec

hify-apps/• Voxeo• AT&T

http://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-technology.html

http://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-technology.html

http://gigaom.com/2010/12/06/microsoft-claims-its-place-in-a-voice-enabled-world/

http://gigaom.com/2010/12/06/microsoft-claims-its-place-in-a-voice-enabled-world/

http://www.dailyfinance.com/story/company-news/apples-siri-purchase-heats-up-the-race-toward-a-voice-activated/19458344/

http://www.dailyfinance.com/story/company-news/apples-siri-purchase-heats-up-the-race-toward-a-voice-activated/19458344/

http://www.ibm.com/news/in/en/2010/08/20/a896686u56875f96.html

http://www.ibm.com/news/in/en/2010/08/20/a896686u56875f96.html

http://gigaom.com/2011/01/19/nuance-releases-mobile-sdk-to-speechify-apps/

http://gigaom.com/2011/01/19/nuance-releases-mobile-sdk-to-speechify-apps/

• The Emergence of Speech as a Mobile Platform Market Trends Speech-Enabled Mobile Apps Gaining Acceptance – Voice Control in a Mission-Critical Environment– Search Engine for Audio-Visual Content– Instantaneous Language Translation– IBM's Spoken Web

• What's Driving Speech as a Mobile Platform? – Mobile Devices and Peripherals– Cloud Computing– Open Technologies– Mashups and the Programmable Web– Legislation

• Closing the (Mobile) Digital Divide• An Overview of Emerging SAAP Applications Current Speech-Equipped Devices are Merely the Tip of the Iceberg• SaaP Enables New Application Interaction

– Spoken Alerts– Mobile Reminders– Synthesized Speech– Email and Text Messages– Speech-to-Text for Voicemail

• SaaP Enables Voice User Interfaces• Speech Recognition: The Foundation of Speech-Enabled Apps Constrained vs. Natural Language Processing• Automated vs. Hybrid Speech Recognition• Applications for Speech Recognition

– Speaker Authentication– Email and Text Messages Composition– Launch and Control Mobile Apps

• Special Case: Voice Activation

Call flow and call flow diagrams

Evaluating speech and dialog technology

W3C Speech Standards

Torbjörn Lager

The big picture

Webbläsare

HTML

Webbservrar

The place of speech technology

• … speech technology itself has a very long way to go. … the most important thing may turn out to be be not the speech technology itself, but the way in which speech technology connects to all the other technologies.

Tim Berners-Lee

The big picture

Webb-servers

VoiceXML-browser(ASR, TTS)

VoiceXML

HTML-browser

HTML

The What and Why of Standards

• Software standards include terminology, languages and protocols specified by committees of experts for widespread use in the software industry. Software standards have both advantages and disadvantages.

• Advantages:– developers can create applications using the standard languages that are

portable across a variety of platforms; – products from different vendors are able to interact with each other;– a community of experts evolves around the standard and is available to

develop products and services based on the standard. • Disadvantages:

– some developers feel that standards may inhibit creativity and stall the introduction of superior technology.

• However, in the area of speech, vendors are enthusiastic about standards and frequently complain that standards are not developed fast enough.

• Emerging speech-technology standards could give a boost to an industry hampered by proprietary software and hardware.

World Wide Web Consortium

http://www.w3.org/

http://www.w3.org/

W3C Speech Standards

• Speech Recognition Grammar Specification (SRGS) –• What the user can say

• Semantic Interpretation for Speech Recognition (SISR) –• What the user means

• Speech Synthesis Markup Language (SSML) – • What the user hears

• Pronunciation Lexicon Specification (PLS) – • How words are pronounced

Intro to XML

• Standard for storage and transportation of data

• Maintained by W3C (w3.org/TR/REC-xml)• Elements and tags• Well-formedness• Validity• DTD• Editor (Textmate + XMLmate)

Speech synthesis

Speech synthesis

text

speechlangvoice

persona

A peek inside the black box

• http://www.explainthatstuff.com/how-speech-synthesis-works.html

http://www.explainthatstuff.com/how-speech-synthesis-works.html

http://www.explainthatstuff.com/how-speech-synthesis-works.html

Documents

Introduction and overview. Outline A short history of the field Speech synthesis (TTS) Automatic speech recognition (ASR) Dialog system architectures