Application of Speech Recognition, Synthesis, Dialog

Application of Speech Recognition,Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 Carnegie Mellon

Speech for communicationSpeech for communication

• The difference between speech and languageThe difference between speech and language

• Speech recognition and speech understandingSpeech recognition and speech understanding


Speech recognition can only identify words

System does not know what you want

System does not know who you are


Speech and Audio ProcessingSpeech and Audio Processing

• Signal processing:Signal processing:• Convert the audio wave into a sequence of feature vectorsConvert the audio wave into a sequence of feature vectors

• Speech recognition:Speech recognition:• Decode the sequence of feature vectors into a sequence of Decode the sequence of feature vectors into a sequence of

wordswords

• Semantic interpretation:Semantic interpretation:• Determine the meaning of the recognized wordsDetermine the meaning of the recognized words

• Dialog Management:Dialog Management:• Correct errors and help get the task doneCorrect errors and help get the task done

• Response GenerationResponse Generation• What words to use to maximize user understandingWhat words to use to maximize user understanding

• Speech synthesis:Speech synthesis:• Generate synthetic speech from a ‘marked-up’ word stringGenerate synthetic speech from a ‘marked-up’ word string


Data FlowData Flow

DiscourseInterpretation

DialogManagement

SignalProcessing

SpeechRecognition

SemanticInterpretation

ResponseGeneration

Speech Synthesis

Part I Part II


Semantic Interpretation: Word StringsSemantic Interpretation: Word Strings

• Content is just wordsContent is just words• System: System: What is your address? What is your address?• User: User: My address is My address is fourteen eleven main streetfourteen eleven main street

• Need concept extraction / keyword(s) spottingNeed concept extraction / keyword(s) spotting

• ApplicationsApplications• template fillingtemplate filling• directory servicesdirectory services• information retrievalinformation retrieval


Semantic Interpretation: Pattern-BasedSemantic Interpretation: Pattern-Based

• Simple (typically regular) patterns specify contentSimple (typically regular) patterns specify content

• ATIS (Air Traffic Information System) Task:ATIS (Air Traffic Information System) Task:

• System:System: What are your travel plans? What are your travel plans?

• User:User: [On Monday], I’m going [from Boston] [to San [On Monday], I’m going [from Boston] [to San Francisco].Francisco].

• Content: [DATE=Monday, ORIGIN=Boston, Content: [DATE=Monday, ORIGIN=Boston, DESTINATION=SFO] DESTINATION=SFO]


Robustness and Partial SuccessRobustness and Partial Success

• Controlled SpeechControlled Speech• limited task vocabulary; limited task grammarlimited task vocabulary; limited task grammar

• Spontaneous SpeechSpontaneous Speech• Can have high out-of-vocabulary (OOV) rateCan have high out-of-vocabulary (OOV) rate• Includes restarts, word fragments, omissions, phrase Includes restarts, word fragments, omissions, phrase

fragments, disagreements, and other disfluenciesfragments, disagreements, and other disfluencies• Contains much grammatical variationContains much grammatical variation• Causes high word error-rate in recognizerCauses high word error-rate in recognizer

• Interpretation is often partial, allowing:Interpretation is often partial, allowing:• omissionomission• parsing fragmentsparsing fragments


Speech Dialog ManagementSpeech Dialog Management


Discourse & Dialog ProcessingDiscourse & Dialog Processing

• Discourse interpretation:Discourse interpretation:• Understand what the user really intends by interpreting Understand what the user really intends by interpreting

utterances in contextutterances in context

• Dialog management:Dialog management:• Determine system goals in response to user utterances Determine system goals in response to user utterances

based on user intentionbased on user intention

• Response generation:Response generation:• Generate natural language utterances to achieve the Generate natural language utterances to achieve the

selected goalsselected goals


Discourse InterpretationDiscourse Interpretation

• Goal: understand what the user really intendsGoal: understand what the user really intends

• Example: Can you move it?Example: Can you move it?• What does “it” refer to?What does “it” refer to?• Is the utterance intended as a Is the utterance intended as a simple yes-no querysimple yes-no query or a or a

request to perform an actionrequest to perform an action??

• Issues addressed: Issues addressed: • Reference resolutionReference resolution• Intention recognitionIntention recognition

• Interpret user utterances in contextInterpret user utterances in context


U: Where is A Bug’s Life playing in Monroeville?S: A Bug’s Life is playing at the Carmike theater.U: When is it playing there?S: It’s playing at 2pm, 5pm, and 8pm.U: I’d like 1 adult and 2 children for the first show. How much would that be?

Reference ResolutionReference Resolution

• Knowledge sources:Knowledge sources:• Domain knowledgeDomain knowledge• Discourse knowledgeDiscourse knowledge• World knowledgeWorld knowledge


Reference ResolutionReference Resolution

• Focus stacks:Focus stacks:• Maintain recent objects in stackMaintain recent objects in stack• Select objects that satisfy semantic/pragmatic Select objects that satisfy semantic/pragmatic

constraints starting from top of stackconstraints starting from top of stack• Take into account discourse structureTake into account discourse structure

• Rule-based filtering & ranking of objects for pronoun Rule-based filtering & ranking of objects for pronoun resolutionresolution


Dialog Management: Motivating ExampleDialog Management: Motivating Example

S: Would you like movie showtime or theater playlist information?U: Movie showtime.S: What movie do you want showtime information about?U: Saving Private Ryan.S: At what theater do you want to see Saving Private Ryan?U: Carmike.S: Saving Private Ryan is not playing at the Carmike theater.


Interacting with the userInteracting with the user

Dialogmanager

Domainagent

Domainagent

Domainagent

•Guide interaction through task

•Map user inputs and system state into actions

•Interact with back-end(s)

•Interpret information using domain knowledge


• Goal: determine what to accomplish in response to Goal: determine what to accomplish in response to user utterances, e.g.:user utterances, e.g.:• Answer user questionAnswer user question• Solicit further informationSolicit further information• Confirm/Clarify user utteranceConfirm/Clarify user utterance• Notify invalid queryNotify invalid query• Notify invalid query and suggest alternativeNotify invalid query and suggest alternative

• Interface between user/language processing Interface between user/language processing components and system knowledge basecomponents and system knowledge base

Dialog ManagementDialog Management


Graph-based systemsGraph-based systems

Welcome to Bank ABC! Please say one of the following:

Balance, Hours, Loan, ...

What type of loan are you interested in?Please say one of the following:

Mortgage, Car, Personal, ...

. . . .


Frame-based systemsFrame-based systems

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..










Transition onkeyword or phrase


Application Task ComplexityApplication Task Complexity

• Examples:Examples:

Simple Complex

Call Routing

TravelPlanning

WeatherInformation

ATIS

AutomaticBanking

UniversityCourseAdvising

• Directly affects:Directly affects:• Types and quantity of system knowledgeTypes and quantity of system knowledge• Complexity of system’s reasoning abilitiesComplexity of system’s reasoning abilities


Dialog ComplexityDialog Complexity

• Determines what can be talked about:Determines what can be talked about:• The task onlyThe task only• Subdialog: e.g., clarification, confirmationSubdialog: e.g., clarification, confirmation• The dialog itself: meta-dialogThe dialog itself: meta-dialog

• Could you hold on for a minute?Could you hold on for a minute?

• What was that click? Did you hear it?What was that click? Did you hear it?

• Determines who can talk about them:Determines who can talk about them:• System onlySystem only• User onlyUser only• Both participantsBoth participants


Dialogue Management: ProcessDialogue Management: Process

• Determines how the system will go about selecting Determines how the system will go about selecting among the possible goalsamong the possible goals

• At the dialogue level, determined by system designer in At the dialogue level, determined by system designer in terms of initiative strategies:terms of initiative strategies:• System-initiativeSystem-initiative: system always has control, user only : system always has control, user only

responds to system questionsresponds to system questions• User-initiativeUser-initiative: user always has control, system passively : user always has control, system passively

answers user questionsanswers user questions• Mixed-initiativeMixed-initiative: control switches between system and : control switches between system and

user using fixed rulesuser using fixed rules• Variable-initiativeVariable-initiative: control switches between system and : control switches between system and

user dynamically based on participant roles, dialogue user dynamically based on participant roles, dialogue history, etc.history, etc.


Response GenerationResponse Generation

U: Is Saving Private Ryan playing at the Chatham cinema?


S: No, it’s not.

• S provides elliptical responseS provides elliptical response


S: No, Saving Private Ryan is not playing at the Chatham cinema.

• S provides full response (which provides S provides full response (which provides grounding information)grounding information)


S: No, Saving Private Ryan is not playing at the Chatham cinema; the theater’s under renovation.

• S provides full response and supporting evidenceS provides full response and supporting evidence


Communicating with the userCommunicating with the user

LanguageGenerator

Speechsynthesizer

DisplayGenerator

ActionGenerator

•Decide what to say to user (and how to phrase it)

•Construct sounds and intonation


Response GenerationResponse Generation

• Goal: generate natural language utterances to achieve Goal: generate natural language utterances to achieve goal(s) selected by the dialogue manager goal(s) selected by the dialogue manager

• Issues:Issues:• Content selection: determining what to sayContent selection: determining what to say• Surface realization: determining how to say it Surface realization: determining how to say it

• Generation gapGeneration gap: discrepancy between the actual : discrepancy between the actual output of the content selection process and the output of the content selection process and the expected input of the surface realization processexpected input of the surface realization process


Language generationLanguage generation

• Template-based systemsTemplate-based systems• Sentence templates with variablesSentence templates with variables

• ““Linguistic” systemsLinguistic” systems• Generate surface from meaning representationGenerate surface from meaning representation

• Stochastic approachesStochastic approaches• Statistical models of domain-expert speechStatistical models of domain-expert speech


Dialog EvaluationDialog Evaluation

• Goal: determine how “well” a dialogue system performsGoal: determine how “well” a dialogue system performs

• Main difficulties:Main difficulties:• No strict right or wrong answersNo strict right or wrong answers• Difficult to determine what features make a dialogue Difficult to determine what features make a dialogue

system better than anothersystem better than another• Difficult to select metrics that contribute to the overall Difficult to select metrics that contribute to the overall

“goodness” of the system“goodness” of the system• Difficult to determine how the metrics compensate for one Difficult to determine how the metrics compensate for one

anotheranother• Expensive to collect new data for evaluating incremental Expensive to collect new data for evaluating incremental

improvement of systemsimprovement of systems


Dialog Evaluation (Cont’d)Dialog Evaluation (Cont’d)

• System-initiative, explicit System-initiative, explicit confirmationconfirmation• better task success better task success

raterate• lower WERlower WER• longer dialogslonger dialogs• fewer recovery fewer recovery

subdialogssubdialogs• less naturalless natural

• Mixed-initiative, no Mixed-initiative, no confirmationconfirmation• lower task success lower task success

raterate• higher WERhigher WER• shorter dialogsshorter dialogs• more recovery more recovery

subdialogssubdialogs• more naturalmore natural


Speech SynthesisSpeech Synthesis


Speech Synthesis (Text-to-Speech TTS)Speech Synthesis (Text-to-Speech TTS)

• Prior knowledgePrior knowledge• Vocabulary from words to sounds; surface markupVocabulary from words to sounds; surface markup

• Recorded promptsRecorded prompts

• Formant synthesisFormant synthesis• Model vocal tract as source and filtersModel vocal tract as source and filters

• Concatenative synthesisConcatenative synthesis• Record and segment expert’s voiceRecord and segment expert’s voice• Splice appropriate units into full utterancesSplice appropriate units into full utterances

• Intonation modelingIntonation modeling


Recorded PromptsRecorded Prompts

• The simplest (and most common) solution is to record The simplest (and most common) solution is to record prompts spoken by a (trained) humanprompts spoken by a (trained) human

• Produces human quality voiceProduces human quality voice

• Limited by number of prompts that can be recordedLimited by number of prompts that can be recorded

• Can be extended by limited cut-and-paste or template Can be extended by limited cut-and-paste or template fillingfilling


The Source-Filter Model of Formant SynthesisThe Source-Filter Model of Formant Synthesis

• Model of features to be extracted and fittedModel of features to be extracted and fitted

• Excitation or Voicing Source(s) to model sound sourceExcitation or Voicing Source(s) to model sound source• standard wave of glottal pulses for voiced soundsstandard wave of glottal pulses for voiced sounds• randomly varying noise for unvoiced soundsrandomly varying noise for unvoiced sounds• modification of airflow due to lips, etc.modification of airflow due to lips, etc.• high frequency (F0 rate), quasi-periodic, choppyhigh frequency (F0 rate), quasi-periodic, choppy• modeled with vector of glottal waveform patterns in modeled with vector of glottal waveform patterns in

voiced regionsvoiced regions• Acoustic Filter(s) Acoustic Filter(s)

• shapes the frequency character of vocal tract and shapes the frequency character of vocal tract and radiation character at the lipsradiation character at the lips

• relatively slow (samples around 5ms suffice) and relatively slow (samples around 5ms suffice) and stationarystationary

• modeled with LPC (linear predictive coding)modeled with LPC (linear predictive coding)


Concatenative SynthesisConcatenative Synthesis

• Record basic inventory of soundsRecord basic inventory of sounds• Retrieve appropriate sequence of units at run time Retrieve appropriate sequence of units at run time • Concatenate and adjust durations and pitchConcatenate and adjust durations and pitch• Synthesize waveformSynthesize waveform


Diphone and Polyphone SynthesisDiphone and Polyphone Synthesis

• Phone sequences capture Phone sequences capture co-articulationco-articulation

• Cut speech in positions that minimize context contaminationCut speech in positions that minimize context contamination

• Need single phones, diphones and sometimes triphonesNeed single phones, diphones and sometimes triphones

• Reduce number collected byReduce number collected by• phonotactic constraintsphonotactic constraints• collapsing in cases of no co-articulationcollapsing in cases of no co-articulation

• Data Collection MethodsData Collection Methods• Collect data from a single (professional) speakerCollect data from a single (professional) speaker• Select text with maximal coverage (typically with greedy Select text with maximal coverage (typically with greedy

algorithm), oralgorithm), or• Record minimal pairs in desired contexts (real words or Record minimal pairs in desired contexts (real words or

nonsense)nonsense)


Signal Processing for Concatenative SynthesisSignal Processing for Concatenative Synthesis

• Diphones recorded in one context must be generated in Diphones recorded in one context must be generated in other contextsother contexts

• Features are extracted from recorded units Features are extracted from recorded units

• Signal processing manipulates features to smooth Signal processing manipulates features to smooth boundaries where units are concatenatedboundaries where units are concatenated

• Signal processing modifies signal via ‘interpolation’Signal processing modifies signal via ‘interpolation’• intonation intonation • durationduration


Intonation in Intonation in Bell Labs TTS Bell Labs TTS

• Generate a sequence of F0 targets for synthesisGenerate a sequence of F0 targets for synthesis

• Example:Example:• We were away a year ago. We were away a year ago. • phones: w E w R & w A & y E r & g Ophones: w E w R & w A & y E r & g O

source: Multilingual Text-to-Speech Synthesis, R. Sproat, ed., Kluwer, 1998


What you can do with Speech RecognitionWhat you can do with Speech Recognition

• TranscriptionTranscription• dictation, information retrievaldictation, information retrieval

• Command and controlCommand and control• data entry, device control, navigation, call routingdata entry, device control, navigation, call routing

• Information accessInformation access• airline schedules, stock quotes, directory assistanceairline schedules, stock quotes, directory assistance

• Problem solvingProblem solving• travel planning, logisticstravel planning, logistics


Human-machine interface is critical

Speech recognition is NOT the core function of most applications

Errorful recognition is a fact of life

Speech is a feature of applications that offers specific advantages


Properties of RecognizersProperties of Recognizers• Speaker IndependentSpeaker Independent vs. Speaker Dependent vs. Speaker Dependent

• Large VocabularyLarge Vocabulary (2K-200K words) vs. Limited Vocabulary (2-200) (2K-200K words) vs. Limited Vocabulary (2-200)

• Continuous Continuous vs. Discretevs. Discrete

• Speech Recognition Speech Recognition vs. Speech Verificationvs. Speech Verification

• Real Time Real Time vs. multiples of real timevs. multiples of real time

• Spontaneous SpeechSpontaneous Speech vs. Read Speech vs. Read Speech

• Noisy Environment vs. Quiet EnvironmentNoisy Environment vs. Quiet Environment

• High Resolution Microphone vs. Telephone vs. CellphoneHigh Resolution Microphone vs. Telephone vs. Cellphone

• Push-and-hold vs. push-to-talk vs. always-listeningPush-and-hold vs. push-to-talk vs. always-listening

• Adapt to speaker vs. non-adaptiveAdapt to speaker vs. non-adaptive

• Low vs. High LatencyLow vs. High Latency

• With online incremental results vs. final resultsWith online incremental results vs. final results

• Dialog ManagementDialog Management


Speech Recognition vs. Touch ToneSpeech Recognition vs. Touch Tone

Shorter calls Choices mean something Automate more tasks Reduces annoying operations Available


Transcription and DictationTranscription and Dictation

• Transcription is transforming a stream of human speech Transcription is transforming a stream of human speech into computer-readable forminto computer-readable form• Medical reports, court proceedings, notesMedical reports, court proceedings, notes• Indexing (e.g., broadcasts)Indexing (e.g., broadcasts)

• Dictation is the interactive composition of textDictation is the interactive composition of text• Report, correspondence, etc.Report, correspondence, etc.


SpeechWearSpeechWear

• Vehicle inspection taskVehicle inspection task• USMC mechanics, fixed inspection formUSMC mechanics, fixed inspection form• Wearable computer (COTS components)Wearable computer (COTS components)• html-based task representationhtml-based task representation

• film clipfilm clip

http://www.speech.cs.cmu.edu/Video/SpeechWear.rm


Speech recognition and understandingSpeech recognition and understanding

• Sphinx systemSphinx system• speaker-independentspeaker-independent• continuous speechcontinuous speech• large vocabularylarge vocabulary

• ATIS systemATIS system• air travel information retrievalair travel information retrieval• context managementcontext management

• film clipfilm clip (1994)(1994)

http://www.speech.cs.cmu.edu/Video/CSR_ATIS.rm


Automate services, lower payrollAutomate services, lower payroll

Shorten time on holdShorten time on hold

Shorten agent and client call timeShorten agent and client call time

Reduce fraudReduce fraud

Improve customer serviceImprove customer service

Sample Market: Sample Market: Call CentersCall Centers


Interface guidelinesInterface guidelines

• State transparencyState transparency

• Input controlInput control

• Error recoveryError recovery

• Error detectionError detection

• Error correctionError correction

• Log performanceLog performance

• Application integration Application integration


Speech RecognitionSpeech Recognition Figure out what a person is saying.

Speaker VerificationSpeaker Verification Authenticate that a person is who she/he claims to be.

Limited speech patterns

Speaker IdentificationSpeaker IdentificationAssigns an identity to the voice of an unknown person.

Arbitrary speech patterns

Applications related to Speech RecognitionApplications related to Speech Recognition


What You Have What You Have

key, card, tokenkey, card, token What You KnowWhat You Know

password, PIN, maiden namepassword, PIN, maiden name Who You Are Who You Are

Three Types of SecurityThree Types of Security

Stronger AuthenticationStronger Authentication


Family Tree: Family Tree: Voice BiometricsVoice Biometrics

Speech Recognition

Speech Processing

Speech Synthesis

Digitized Speech

InputOutput

Speaker Verification

Speaker Identification

Voice Biometrics

…

Signature Verif.

Typing Dynamics

Face RecognitionFinger GeometryFingerprinting

Hand GeometryIris/Retina Scan

Biometrics

DNA


Carnegie Mellon Speech DemosCarnegie Mellon Speech Demos

• CMU CommunicatorCMU Communicator• Call: 1-877-CMU-PLAN (268-7526), Call: 1-877-CMU-PLAN (268-7526), alsoalso 268-5144, or 268-5144, or

x8-1084x8-1084• the information is accurate; you can use it for your own the information is accurate; you can use it for your own

travel planning…travel planning…

CMU Universal Speech Interface (USI)CMU Universal Speech Interface (USI)

• CMU Movie LineCMU Movie LineSeems to be about apartments now…Seems to be about apartments now…• Call: (412) 268-1185

http://www.speech.cs.cmu.edu/Communicator/

http://www.speech.cs.cmu.edu/usi/

http://www.speech.cs.cmu.edu/Movieline/


Telephone DemosTelephone Demos

• Nuance Nuance http://www.nuance.comhttp://www.nuance.com

• Banking: 1-650-847-7438Banking: 1-650-847-7438• Travel Planning: 1-650-847-7427Travel Planning: 1-650-847-7427• Stock Quotes: 1-650-847-7423Stock Quotes: 1-650-847-7423

• SpeechWorks SpeechWorks http://www.speechworks.com/demos/demos.htmhttp://www.speechworks.com/demos/demos.htm

• Banking: 1-888-729-3366Banking: 1-888-729-3366• Stock Trading: 1-800-786-2571Stock Trading: 1-800-786-2571

• MIT Spoken Language Systems Laboratory MIT Spoken Language Systems Laboratory http://www.sls.lcs.mit.edu/sls/whatwedo/applications.htmlhttp://www.sls.lcs.mit.edu/sls/whatwedo/applications.html • Travel Plans (Pegasus): 1-877-648-8255Travel Plans (Pegasus): 1-877-648-8255• Weather (Jupiter): 1-888-573-8255Weather (Jupiter): 1-888-573-8255

• IBM IBM http://www-3.ibm.com/software/speech/http://www-3.ibm.com/software/speech/

• Mutual Funds, Name Dialing: 1-877-VIA-VOICEMutual Funds, Name Dialing: 1-877-VIA-VOICE

http://www.nuance.com/

http://www.speechworks.com/demos/demos.htm

http://www.sls.lcs.mit.edu/sls/whatwedo/applications.html

http://www-3.ibm.com/software/speech/

Questions?

Documents

Application of Speech Recognition, Synthesis, Dialog