34
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems Conversing with machines

Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Speech Processing 11-492/18-492Speech Processing 11-492/18-492

Spoken Dialog SystemsConversing with machines

Page 2: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Spoken Dialog SystemsSpoken Dialog Systems

Not just ASR bolted onto TTSNot just ASR bolted onto TTS Different styles of interactionDifferent styles of interaction

Question/response systemsQuestion/response systems Mixed initiative systemsMixed initiative systems ““How May I Help You?” open questionsHow May I Help You?” open questions True conversational machine-human interactionTrue conversational machine-human interaction

Page 3: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

SDS OverviewSDS Overview

IntroductionIntroduction Building simple dialog systemsBuilding simple dialog systems VoiceXMLVoiceXML

A language for writing systemsA language for writing systems Beyond tree-based systemsBeyond tree-based systems Beyond spoken languageBeyond spoken language Non-task-oriented systemsNon-task-oriented systems Real-world deployment considerationsReal-world deployment considerations

Page 4: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

SDS ApplicationsSDS Applications

Information giving/requestInformation giving/request Flights, buses, stocks and weatherFlights, buses, stocks and weather Driving directionsDriving directions Answer questions, newsAnswer questions, news

TransactionalTransactional Reply your emailReply your email Credit card and bank enquiries, product purchaseCredit card and bank enquiries, product purchase

MaintenanceMaintenance Technical supportTechnical support Customer serviceCustomer service

Page 5: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog
Page 6: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

SDS ApplicationsSDS Applications

EntertainmentEntertainment Game characters (NPC), toys, robotsGame characters (NPC), toys, robots

TutoringTutoring Math, scienceMath, science Language learningLanguage learning

Health careHealth care Depression screeningDepression screening Aphasia therapy Aphasia therapy

Page 7: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Dialog TypesDialog Types System initiativeSystem initiative

Form-filling paradigmForm-filling paradigm Can switch language models at each turnCan switch language models at each turn Can “know” which is likely to be saidCan “know” which is likely to be said

Mixed initiativeMixed initiative Users can go where they likeUsers can go where they like System or user can lead the discussionSystem or user can lead the discussion

Classifying:Classifying: Users can say what they likeUsers can say what they like But really only “N” operations possibleBut really only “N” operations possible E.g. AT&T? “How may I help you?”E.g. AT&T? “How may I help you?”

Page 8: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

System Initiative System Initiative

Most commonMost common Machine controls the callMachine controls the call Few choices in the dialogFew choices in the dialog

Simple form filling:Simple form filling: What is your bank account numberWhat is your bank account number

Advantages:Advantages: You know what users will say (sort of)You know what users will say (sort of) Hard for user to get confusedHard for user to get confused Hard for system to get confusedHard for system to get confused Easy to buildEasy to build

Disadvantages:Disadvantages: Limited flexibility in interactionLimited flexibility in interaction Fixed dialog structureFixed dialog structure

Most reliable, but many turnsMost reliable, but many turns

Page 9: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

System InitiativeSystem Initiative

Let’s Go Bus InformationLet’s Go Bus Information 412 268 3526 (Anytime)412 268 3526 (Anytime) Provides bus information for PittsburghProvides bus information for Pittsburgh

Tell MeTell Me Company getting others to build systemsCompany getting others to build systems Stocks, weather, entertainmentStocks, weather, entertainment

Page 10: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Mixed InitiativeMixed Initiative

User or system takes initiativeUser or system takes initiative More interesting dialogsMore interesting dialogs ““jump” through different parts of dialog statejump” through different parts of dialog state

AdvantagesAdvantages More realistic dialog More realistic dialog Can do more complex tasksCan do more complex tasks

DisadvantagesDisadvantages Can get confusingCan get confusing Can miss important partsCan miss important parts

Page 11: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Classification DialogsClassification Dialogs

Sort out from N thingsSort out from N things User says “anything” and system directs themUser says “anything” and system directs them ReceptionistReceptionist

I have a problem with my billI have a problem with my bill What’s the area code for MiamiWhat’s the area code for Miami Did you know I can see the beach from hereDid you know I can see the beach from here

AdvantagesAdvantages (Apparently) complex understanding(Apparently) complex understanding Solves a very common taskSolves a very common task

DisadvantagesDisadvantages Actually quite restrictiveActually quite restrictive Needs data to train fromNeeds data to train from Needs to be updated Needs to be updated

Page 12: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Beyond TelephonesBeyond Telephones

TelematicsTelematics Voice communication in carsVoice communication in cars CPS, music selection etc CPS, music selection etc

Web-based dialog systemsWeb-based dialog systems Robot InteractionRobot Interaction

Robot-robot and robot-human interactionRobot-robot and robot-human interaction Animated talking headAnimated talking head

Non-player characters – web agentsNon-player characters – web agents Speech to Speech translationSpeech to Speech translation CMU Skylar: integrating many dialog CMU Skylar: integrating many dialog

systemssystems

Page 13: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Team TalkTeam Talk

Using speech to control multiple robotsUsing speech to control multiple robots Robots have names and distinct voicesRobots have names and distinct voices They report to each other and to you in voiceThey report to each other and to you in voice

Page 14: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Other SDS Other SDS

Microsoft: Situated InteractionMicrosoft: Situated Interaction Talking Head that follows youTalking Head that follows you

CMU SV: AidasCMU SV: Aidas Restaurant recommendations in situRestaurant recommendations in situ

Page 15: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

True conversationTrue conversation

Requires more than just speechRequires more than just speech Non-verbal noises: laughing, er, um, etcNon-verbal noises: laughing, er, um, etc Eye gazeEye gaze Proper timing (not waiting 500ms before Proper timing (not waiting 500ms before

speaker)speaker) Back-channelingBack-channeling MovementMovement Talking about nothingTalking about nothing

Page 16: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

RoboreceptionistRoboreceptionist

Entrance to NSHEntrance to NSH Keyboard (no ASR)Keyboard (no ASR) TTS, face, movementTTS, face, movement Range finder to detect peopleRange finder to detect people Significant background Significant background

charactercharacter Mostly talks about nothingMostly talks about nothing

Page 17: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Personal Intelligent SystemsPersonal Intelligent Systems

Example: Apple Siri, Google Now, Microsoft Example: Apple Siri, Google Now, Microsoft Cortana, Amazon Echo, etc.Cortana, Amazon Echo, etc.

Hub of all applicationsHub of all applications ExtendableExtendable PersonalizationPersonalization Cross-LanguageCross-Language Cross-CulturalCross-Cultural Future: interface-> true companionFuture: interface-> true companion

Page 18: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Speech Processing 11-492/18-492Speech Processing 11-492/18-492

Spoken Dialog SystemsSDS components

Page 19: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Spoken Dialog SystemsSpoken Dialog Systems

More than just ASR and TTSMore than just ASR and TTS RecognitionRecognition Language understandingLanguage understanding Manipulation of utterancesManipulation of utterances Generation of new informationGeneration of new information Text generationText generation SynthesisSynthesis

Page 20: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

SDS ArchitectureSDS Architecture

Language Generation

ASRLanguage

Understanding

Synthesis

Dialog Manager

Error Handling Strategies

Non Understanding

Page 21: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

SDS InternalsSDS Internals

Language UnderstandingLanguage Understanding From words to structureFrom words to structure

Dialog ManagerDialog Manager State of dialog (who is talking)State of dialog (who is talking) Direction of dialog (what next)Direction of dialog (what next) References, user profile etcReferences, user profile etc Interaction of database/internetInteraction of database/internet

Language GenerationLanguage Generation From structure to wordsFrom structure to words

Page 22: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Language UnderstandingLanguage Understanding

Parsing of SPEECH not TEXTParsing of SPEECH not TEXT Eh, I wanna go, wanna go to Boston tomorrowEh, I wanna go, wanna go to Boston tomorrow If its not too much trouble I’d be very grateful if If its not too much trouble I’d be very grateful if

one might be able to aid me in arranging my one might be able to aid me in arranging my travel arrangements to Boston, Logan airport, travel arrangements to Boston, Logan airport, at sometime tomorrow morning, thank you.at sometime tomorrow morning, thank you.

Boston, tomorrowBoston, tomorrow

Page 23: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Parsing: Output structureParsing: Output structure

““I wanna go to Boston, tomorrow”I wanna go to Boston, tomorrow” Destination: BOSDestination: BOS Departure: 20081028, AMDeparture: 20081028, AM Airline: unspecifeeAirline: unspecifee Special: unspecifeeSpecial: unspecifee

Convert speech to structureConvert speech to structure Sufficient for further processing/querySufficient for further processing/query

Page 24: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Interaction ExampleInteraction Example

Intelligent Agent

Cheap Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose? I can help you go there.

find a cheap eating place oor taiwanese oood

User

Page 25: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

SDS ProcessSDS Process

find a cheap eating place oor taiwanese oood

User

target

ooodpriceAMOD

NN

seeking

PREP_FORIntelligent

Agent

Page 26: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

SDS ProcessSDS Process

User

target

ooodpriceAMOD

NN

seeking

PREP_FOR

Organized Domain Knowledge

Intelligent Agent

Ontology Induction

(semantic slot)

find a cheap eating place oor taiwanese oood

Page 27: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

SDS ProcessSDS Process

User

target

ooodpriceAMOD

NN

seeking

PREP_FOR

Organized Domain Knowledge

Intelligent Agent

Ontology Induction

(semantic slot)

Structure Learning

(inter-slot relation)

find a cheap eating place oor taiwanese oood

Page 28: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

SDS ProcessSDS Process

User

target

ooodpriceAMOD

NN

seeking

PREP_FORIntelligent

Agent

seeking=“find”target=“eating place”price=“cheap”oood=“taiwanese”

find a cheap eating place oor taiwanese oood

Page 29: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

find a cheap eating place oor taiwanese oood

SDS ProcessSDS Process

User

target

ooodpriceAMOD

NN

seeking

PREP_FORIntelligent

Agent

seeking=“find”target=“eating place”price=“cheap”oood=“taiwanese”

Semantic Decoding

Page 30: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Automatic Slot IneuctionAutomatic Slot Ineuction

Chen et al. ASRU’13Chen et al. ASRU’13

can i have a cheap restaurant

Frame: capability

Frame: expensiveness

Frame: locale by use

Domain

DomainGeneral

slot candidate

32

can i have a cheap restaurant

Page 31: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Parsing vs Language ModelParsing vs Language Model

Language ModelLanguage Model Model what actually gets saidModel what actually gets said

Parsing Parsing Extract the information you wantExtract the information you want

Models *can* be sharedModels *can* be shared Only accept things in the grammarOnly accept things in the grammar Can be over limitingCan be over limiting

Page 32: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Neural Networks for SLUNeural Networks for SLU

RNN for Slot FillingRNN for Slot Filling

Step 1: word embedding Step 1: word embedding Step 2: short-term dependencies capturingStep 2: short-term dependencies capturing Step 3: long-term dependencies capturingStep 3: long-term dependencies capturing Step 4: different types of neural architectureStep 4: different types of neural architecture

http://deeplearning.net/tutorial/rnnslu.html#rnnsluhttp://deeplearning.net/tutorial/rnnslu.html#rnnslu

Mesnil et al. 2013

Page 33: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Interactive Learning for SLUInteractive Learning for SLU

Luis : Interactive machine learning for Luis : Interactive machine learning for language understandinglanguage understanding

Advantages: Advantages: Non-expert could add in knowledge in Non-expert could add in knowledge in feature engineeringfeature engineeringActive-learning reduces heavy labelingActive-learning reduces heavy labeling

https://www.luis.ai/ https://www.luis.ai/

Williams et al. 2016

Page 34: Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/sds_intro_comp.pdfSDS Internals Language Understanding From words to structure Dialog Manager State of dialog

Dialog ManagerDialog Manager

Maintain stateMaintain state Where are we in the dialogWhere are we in the dialog Whose turn is itWhose turn is it

Waiting for speakerWaiting for speaker Waiting for database query (stall user)Waiting for database query (stall user)

Deal with barge-inDeal with barge-in