55
Application of Speech Recognition, Synthesis, Dialog

Application of Speech Recognition, Synthesis, Dialog

Embed Size (px)

DESCRIPTION

Application of Speech Recognition, Synthesis, Dialog. Speech for communication. The difference between speech and language Speech recognition and speech understanding. Speech recognition can only identify words. System does not know what you want System does not know who you are. - PowerPoint PPT Presentation

Citation preview

Page 1: Application of Speech Recognition, Synthesis, Dialog

Application of Speech Recognition,Synthesis, Dialog

Page 2: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 Carnegie Mellon

Speech for communicationSpeech for communication

• The difference between speech and languageThe difference between speech and language

• Speech recognition and speech understandingSpeech recognition and speech understanding

Page 3: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 3 Carnegie Mellon

Speech recognition can only identify words

System does not know what you want

System does not know who you are

Page 4: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 4 Carnegie Mellon

Speech and Audio ProcessingSpeech and Audio Processing

• Signal processing:Signal processing:• Convert the audio wave into a sequence of feature vectorsConvert the audio wave into a sequence of feature vectors

• Speech recognition:Speech recognition:• Decode the sequence of feature vectors into a sequence of Decode the sequence of feature vectors into a sequence of

wordswords

• Semantic interpretation:Semantic interpretation:• Determine the meaning of the recognized wordsDetermine the meaning of the recognized words

• Dialog Management:Dialog Management:• Correct errors and help get the task doneCorrect errors and help get the task done

• Response GenerationResponse Generation• What words to use to maximize user understandingWhat words to use to maximize user understanding

• Speech synthesis:Speech synthesis:• Generate synthetic speech from a ‘marked-up’ word stringGenerate synthetic speech from a ‘marked-up’ word string

Page 5: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 5 Carnegie Mellon

Data FlowData Flow

DiscourseInterpretation

DialogManagement

SignalProcessing

SpeechRecognition

SemanticInterpretation

ResponseGeneration

Speech Synthesis

Part I Part II

Page 6: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 6 Carnegie Mellon

Semantic Interpretation: Word StringsSemantic Interpretation: Word Strings

• Content is just wordsContent is just words• System: System: What is your address? What is your address?• User: User: My address is My address is fourteen eleven main streetfourteen eleven main street

• Need concept extraction / keyword(s) spottingNeed concept extraction / keyword(s) spotting

• ApplicationsApplications• template fillingtemplate filling• directory servicesdirectory services• information retrievalinformation retrieval

Page 7: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 7 Carnegie Mellon

Semantic Interpretation: Pattern-BasedSemantic Interpretation: Pattern-Based

• Simple (typically regular) patterns specify contentSimple (typically regular) patterns specify content

• ATIS (Air Traffic Information System) Task:ATIS (Air Traffic Information System) Task:

• System:System: What are your travel plans? What are your travel plans?

• User:User: [On Monday], I’m going [from Boston] [to San [On Monday], I’m going [from Boston] [to San Francisco].Francisco].

• Content: [DATE=Monday, ORIGIN=Boston, Content: [DATE=Monday, ORIGIN=Boston, DESTINATION=SFO] DESTINATION=SFO]

Page 8: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 9 Carnegie Mellon

Robustness and Partial SuccessRobustness and Partial Success

• Controlled SpeechControlled Speech• limited task vocabulary; limited task grammarlimited task vocabulary; limited task grammar

• Spontaneous SpeechSpontaneous Speech• Can have high out-of-vocabulary (OOV) rateCan have high out-of-vocabulary (OOV) rate• Includes restarts, word fragments, omissions, phrase Includes restarts, word fragments, omissions, phrase

fragments, disagreements, and other disfluenciesfragments, disagreements, and other disfluencies• Contains much grammatical variationContains much grammatical variation• Causes high word error-rate in recognizerCauses high word error-rate in recognizer

• Interpretation is often partial, allowing:Interpretation is often partial, allowing:• omissionomission• parsing fragmentsparsing fragments

Page 9: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10 Carnegie Mellon

Speech Dialog ManagementSpeech Dialog Management

Page 10: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 11 Carnegie Mellon

Discourse & Dialog ProcessingDiscourse & Dialog Processing

• Discourse interpretation:Discourse interpretation:• Understand what the user really intends by interpreting Understand what the user really intends by interpreting

utterances in contextutterances in context

• Dialog management:Dialog management:• Determine system goals in response to user utterances Determine system goals in response to user utterances

based on user intentionbased on user intention

• Response generation:Response generation:• Generate natural language utterances to achieve the Generate natural language utterances to achieve the

selected goalsselected goals

Page 11: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 12 Carnegie Mellon

Discourse InterpretationDiscourse Interpretation

• Goal: understand what the user really intendsGoal: understand what the user really intends

• Example: Can you move it?Example: Can you move it?• What does “it” refer to?What does “it” refer to?• Is the utterance intended as a Is the utterance intended as a simple yes-no querysimple yes-no query or a or a

request to perform an actionrequest to perform an action??

• Issues addressed: Issues addressed: • Reference resolutionReference resolution• Intention recognitionIntention recognition

• Interpret user utterances in contextInterpret user utterances in context

Page 12: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 13 Carnegie Mellon

U: Where is A Bug’s Life playing in Monroeville?S: A Bug’s Life is playing at the Carmike theater.U: When is it playing there?S: It’s playing at 2pm, 5pm, and 8pm.U: I’d like 1 adult and 2 children for the first show. How much would that be?

Reference ResolutionReference Resolution

• Knowledge sources:Knowledge sources:• Domain knowledgeDomain knowledge• Discourse knowledgeDiscourse knowledge• World knowledgeWorld knowledge

Page 13: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 14 Carnegie Mellon

Reference ResolutionReference Resolution

• Focus stacks:Focus stacks:• Maintain recent objects in stackMaintain recent objects in stack• Select objects that satisfy semantic/pragmatic Select objects that satisfy semantic/pragmatic

constraints starting from top of stackconstraints starting from top of stack• Take into account discourse structureTake into account discourse structure

• Rule-based filtering & ranking of objects for pronoun Rule-based filtering & ranking of objects for pronoun resolutionresolution

Page 14: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 15 Carnegie Mellon

Dialog Management: Motivating ExampleDialog Management: Motivating Example

S: Would you like movie showtime or theater playlist information?U: Movie showtime.S: What movie do you want showtime information about?U: Saving Private Ryan.S: At what theater do you want to see Saving Private Ryan?U: Carmike.S: Saving Private Ryan is not playing at the Carmike theater.

Page 15: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 16 Carnegie Mellon

Interacting with the userInteracting with the user

Dialogmanager

Domainagent

Domainagent

Domainagent

•Guide interaction through task

•Map user inputs and system state into actions

•Interact with back-end(s)

•Interpret information using domain knowledge

Page 16: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 17 Carnegie Mellon

• Goal: determine what to accomplish in response to Goal: determine what to accomplish in response to user utterances, e.g.:user utterances, e.g.:• Answer user questionAnswer user question• Solicit further informationSolicit further information• Confirm/Clarify user utteranceConfirm/Clarify user utterance• Notify invalid queryNotify invalid query• Notify invalid query and suggest alternativeNotify invalid query and suggest alternative

• Interface between user/language processing Interface between user/language processing components and system knowledge basecomponents and system knowledge base

Dialog ManagementDialog Management

Page 17: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 19 Carnegie Mellon

Graph-based systemsGraph-based systems

Welcome to Bank ABC! Please say one of the following:

Balance, Hours, Loan, ...

What type of loan are you interested in?Please say one of the following:

Mortgage, Car, Personal, ...

. . . .

Page 18: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 20 Carnegie Mellon

Frame-based systemsFrame-based systems

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Zxfgdh_dxab: _____askjs: _____dhe: _____aa_hgjs_aa: _____..

Transition onkeyword or phrase

Page 19: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 21 Carnegie Mellon

Application Task ComplexityApplication Task Complexity

• Examples:Examples:

Simple Complex

Call Routing

TravelPlanning

WeatherInformation

ATIS

AutomaticBanking

UniversityCourseAdvising

• Directly affects:Directly affects:• Types and quantity of system knowledgeTypes and quantity of system knowledge• Complexity of system’s reasoning abilitiesComplexity of system’s reasoning abilities

Page 20: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 22 Carnegie Mellon

Dialog ComplexityDialog Complexity

• Determines what can be talked about:Determines what can be talked about:• The task onlyThe task only• Subdialog: e.g., clarification, confirmationSubdialog: e.g., clarification, confirmation• The dialog itself: meta-dialogThe dialog itself: meta-dialog

• Could you hold on for a minute?Could you hold on for a minute?

• What was that click? Did you hear it?What was that click? Did you hear it?

• Determines who can talk about them:Determines who can talk about them:• System onlySystem only• User onlyUser only• Both participantsBoth participants

Page 21: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 24 Carnegie Mellon

Dialogue Management: ProcessDialogue Management: Process

• Determines how the system will go about selecting Determines how the system will go about selecting among the possible goalsamong the possible goals

• At the dialogue level, determined by system designer in At the dialogue level, determined by system designer in terms of initiative strategies:terms of initiative strategies:• System-initiativeSystem-initiative: system always has control, user only : system always has control, user only

responds to system questionsresponds to system questions• User-initiativeUser-initiative: user always has control, system passively : user always has control, system passively

answers user questionsanswers user questions• Mixed-initiativeMixed-initiative: control switches between system and : control switches between system and

user using fixed rulesuser using fixed rules• Variable-initiativeVariable-initiative: control switches between system and : control switches between system and

user dynamically based on participant roles, dialogue user dynamically based on participant roles, dialogue history, etc.history, etc.

Page 22: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 25 Carnegie Mellon

Response GenerationResponse Generation

U: Is Saving Private Ryan playing at the Chatham cinema?

Page 23: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 26 Carnegie Mellon

S: No, it’s not.

• S provides elliptical responseS provides elliptical response

Page 24: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 27 Carnegie Mellon

S: No, Saving Private Ryan is not playing at the Chatham cinema.

• S provides full response (which provides S provides full response (which provides grounding information)grounding information)

Page 25: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 28 Carnegie Mellon

S: No, Saving Private Ryan is not playing at the Chatham cinema; the theater’s under renovation.

• S provides full response and supporting evidenceS provides full response and supporting evidence

Page 26: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 29 Carnegie Mellon

Communicating with the userCommunicating with the user

LanguageGenerator

Speechsynthesizer

DisplayGenerator

ActionGenerator

•Decide what to say to user (and how to phrase it)

•Construct sounds and intonation

Page 27: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 30 Carnegie Mellon

Response GenerationResponse Generation

• Goal: generate natural language utterances to achieve Goal: generate natural language utterances to achieve goal(s) selected by the dialogue manager goal(s) selected by the dialogue manager

• Issues:Issues:• Content selection: determining what to sayContent selection: determining what to say• Surface realization: determining how to say it Surface realization: determining how to say it

• Generation gapGeneration gap: discrepancy between the actual : discrepancy between the actual output of the content selection process and the output of the content selection process and the expected input of the surface realization processexpected input of the surface realization process

Page 28: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 31 Carnegie Mellon

Language generationLanguage generation

• Template-based systemsTemplate-based systems• Sentence templates with variablesSentence templates with variables

• ““Linguistic” systemsLinguistic” systems• Generate surface from meaning representationGenerate surface from meaning representation

• Stochastic approachesStochastic approaches• Statistical models of domain-expert speechStatistical models of domain-expert speech

Page 29: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 32 Carnegie Mellon

Dialog EvaluationDialog Evaluation

• Goal: determine how “well” a dialogue system performsGoal: determine how “well” a dialogue system performs

• Main difficulties:Main difficulties:• No strict right or wrong answersNo strict right or wrong answers• Difficult to determine what features make a dialogue Difficult to determine what features make a dialogue

system better than anothersystem better than another• Difficult to select metrics that contribute to the overall Difficult to select metrics that contribute to the overall

“goodness” of the system“goodness” of the system• Difficult to determine how the metrics compensate for one Difficult to determine how the metrics compensate for one

anotheranother• Expensive to collect new data for evaluating incremental Expensive to collect new data for evaluating incremental

improvement of systemsimprovement of systems

Page 30: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 33 Carnegie Mellon

Dialog Evaluation (Cont’d)Dialog Evaluation (Cont’d)

• System-initiative, explicit System-initiative, explicit confirmationconfirmation• better task success better task success

raterate• lower WERlower WER• longer dialogslonger dialogs• fewer recovery fewer recovery

subdialogssubdialogs• less naturalless natural

• Mixed-initiative, no Mixed-initiative, no confirmationconfirmation• lower task success lower task success

raterate• higher WERhigher WER• shorter dialogsshorter dialogs• more recovery more recovery

subdialogssubdialogs• more naturalmore natural

Page 31: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 34 Carnegie Mellon

Speech SynthesisSpeech Synthesis

Page 32: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 35 Carnegie Mellon

Speech Synthesis (Text-to-Speech TTS)Speech Synthesis (Text-to-Speech TTS)

• Prior knowledgePrior knowledge• Vocabulary from words to sounds; surface markupVocabulary from words to sounds; surface markup

• Recorded promptsRecorded prompts

• Formant synthesisFormant synthesis• Model vocal tract as source and filtersModel vocal tract as source and filters

• Concatenative synthesisConcatenative synthesis• Record and segment expert’s voiceRecord and segment expert’s voice• Splice appropriate units into full utterancesSplice appropriate units into full utterances

• Intonation modelingIntonation modeling

Page 33: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 36 Carnegie Mellon

Recorded PromptsRecorded Prompts

• The simplest (and most common) solution is to record The simplest (and most common) solution is to record prompts spoken by a (trained) humanprompts spoken by a (trained) human

• Produces human quality voiceProduces human quality voice

• Limited by number of prompts that can be recordedLimited by number of prompts that can be recorded

• Can be extended by limited cut-and-paste or template Can be extended by limited cut-and-paste or template fillingfilling

Page 34: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 38 Carnegie Mellon

The Source-Filter Model of Formant SynthesisThe Source-Filter Model of Formant Synthesis

• Model of features to be extracted and fittedModel of features to be extracted and fitted

• Excitation or Voicing Source(s) to model sound sourceExcitation or Voicing Source(s) to model sound source• standard wave of glottal pulses for voiced soundsstandard wave of glottal pulses for voiced sounds• randomly varying noise for unvoiced soundsrandomly varying noise for unvoiced sounds• modification of airflow due to lips, etc.modification of airflow due to lips, etc.• high frequency (F0 rate), quasi-periodic, choppyhigh frequency (F0 rate), quasi-periodic, choppy• modeled with vector of glottal waveform patterns in modeled with vector of glottal waveform patterns in

voiced regionsvoiced regions• Acoustic Filter(s) Acoustic Filter(s)

• shapes the frequency character of vocal tract and shapes the frequency character of vocal tract and radiation character at the lipsradiation character at the lips

• relatively slow (samples around 5ms suffice) and relatively slow (samples around 5ms suffice) and stationarystationary

• modeled with LPC (linear predictive coding)modeled with LPC (linear predictive coding)

Page 35: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 39 Carnegie Mellon

Concatenative SynthesisConcatenative Synthesis

• Record basic inventory of soundsRecord basic inventory of sounds• Retrieve appropriate sequence of units at run time Retrieve appropriate sequence of units at run time • Concatenate and adjust durations and pitchConcatenate and adjust durations and pitch• Synthesize waveformSynthesize waveform

Page 36: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 40 Carnegie Mellon

Diphone and Polyphone SynthesisDiphone and Polyphone Synthesis

• Phone sequences capture Phone sequences capture co-articulationco-articulation

• Cut speech in positions that minimize context contaminationCut speech in positions that minimize context contamination

• Need single phones, diphones and sometimes triphonesNeed single phones, diphones and sometimes triphones

• Reduce number collected byReduce number collected by• phonotactic constraintsphonotactic constraints• collapsing in cases of no co-articulationcollapsing in cases of no co-articulation

• Data Collection MethodsData Collection Methods• Collect data from a single (professional) speakerCollect data from a single (professional) speaker• Select text with maximal coverage (typically with greedy Select text with maximal coverage (typically with greedy

algorithm), oralgorithm), or• Record minimal pairs in desired contexts (real words or Record minimal pairs in desired contexts (real words or

nonsense)nonsense)

Page 37: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 41 Carnegie Mellon

Signal Processing for Concatenative SynthesisSignal Processing for Concatenative Synthesis

• Diphones recorded in one context must be generated in Diphones recorded in one context must be generated in other contextsother contexts

• Features are extracted from recorded units Features are extracted from recorded units

• Signal processing manipulates features to smooth Signal processing manipulates features to smooth boundaries where units are concatenatedboundaries where units are concatenated

• Signal processing modifies signal via ‘interpolation’Signal processing modifies signal via ‘interpolation’• intonation intonation • durationduration

Page 38: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 46 Carnegie Mellon

Intonation in Intonation in Bell Labs TTS Bell Labs TTS

• Generate a sequence of F0 targets for synthesisGenerate a sequence of F0 targets for synthesis

• Example:Example:• We were away a year ago. We were away a year ago. • phones: w E w R & w A & y E r & g Ophones: w E w R & w A & y E r & g O

source: Multilingual Text-to-Speech Synthesis, R. Sproat, ed., Kluwer, 1998

Page 39: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 48 Carnegie Mellon

What you can do with Speech RecognitionWhat you can do with Speech Recognition

• TranscriptionTranscription• dictation, information retrievaldictation, information retrieval

• Command and controlCommand and control• data entry, device control, navigation, call routingdata entry, device control, navigation, call routing

• Information accessInformation access• airline schedules, stock quotes, directory assistanceairline schedules, stock quotes, directory assistance

• Problem solvingProblem solving• travel planning, logisticstravel planning, logistics

Page 40: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 49 Carnegie Mellon

Human-machine interface is critical

Speech recognition is NOT the core function of most applications

Errorful recognition is a fact of life

Speech is a feature of applications that offers specific advantages

Page 41: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 50 Carnegie Mellon

Properties of RecognizersProperties of Recognizers• Speaker IndependentSpeaker Independent vs. Speaker Dependent vs. Speaker Dependent

• Large VocabularyLarge Vocabulary (2K-200K words) vs. Limited Vocabulary (2-200) (2K-200K words) vs. Limited Vocabulary (2-200)

• Continuous Continuous vs. Discretevs. Discrete

• Speech Recognition Speech Recognition vs. Speech Verificationvs. Speech Verification

• Real Time Real Time vs. multiples of real timevs. multiples of real time

• Spontaneous SpeechSpontaneous Speech vs. Read Speech vs. Read Speech

• Noisy Environment vs. Quiet EnvironmentNoisy Environment vs. Quiet Environment

• High Resolution Microphone vs. Telephone vs. CellphoneHigh Resolution Microphone vs. Telephone vs. Cellphone

• Push-and-hold vs. push-to-talk vs. always-listeningPush-and-hold vs. push-to-talk vs. always-listening

• Adapt to speaker vs. non-adaptiveAdapt to speaker vs. non-adaptive

• Low vs. High LatencyLow vs. High Latency

• With online incremental results vs. final resultsWith online incremental results vs. final results

• Dialog ManagementDialog Management

Page 42: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 56 Carnegie Mellon

Speech Recognition vs. Touch ToneSpeech Recognition vs. Touch Tone

Shorter calls Choices mean something Automate more tasks Reduces annoying operations Available

Page 43: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 57 Carnegie Mellon

Transcription and DictationTranscription and Dictation

• Transcription is transforming a stream of human speech Transcription is transforming a stream of human speech into computer-readable forminto computer-readable form• Medical reports, court proceedings, notesMedical reports, court proceedings, notes• Indexing (e.g., broadcasts)Indexing (e.g., broadcasts)

• Dictation is the interactive composition of textDictation is the interactive composition of text• Report, correspondence, etc.Report, correspondence, etc.

Page 44: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 58 Carnegie Mellon

SpeechWearSpeechWear

• Vehicle inspection taskVehicle inspection task• USMC mechanics, fixed inspection formUSMC mechanics, fixed inspection form• Wearable computer (COTS components)Wearable computer (COTS components)• html-based task representationhtml-based task representation

• film clipfilm clip

Page 45: Application of Speech Recognition, Synthesis, Dialog
Page 46: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 60 Carnegie Mellon

Speech recognition and understandingSpeech recognition and understanding

• Sphinx systemSphinx system• speaker-independentspeaker-independent• continuous speechcontinuous speech• large vocabularylarge vocabulary

• ATIS systemATIS system• air travel information retrievalair travel information retrieval• context managementcontext management

• film clipfilm clip (1994)(1994)

Page 47: Application of Speech Recognition, Synthesis, Dialog
Page 48: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 62 Carnegie Mellon

Automate services, lower payrollAutomate services, lower payroll

Shorten time on holdShorten time on hold

Shorten agent and client call timeShorten agent and client call time

Reduce fraudReduce fraud

Improve customer serviceImprove customer service

Sample Market: Sample Market: Call CentersCall Centers

Page 49: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 63 Carnegie Mellon

Interface guidelinesInterface guidelines

• State transparencyState transparency

• Input controlInput control

• Error recoveryError recovery

• Error detectionError detection

• Error correctionError correction

• Log performanceLog performance

• Application integration Application integration

Page 50: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 64 Carnegie Mellon

Speech RecognitionSpeech Recognition Figure out what a person is saying.

Speaker VerificationSpeaker Verification Authenticate that a person is who she/he claims to be.

Limited speech patterns

Speaker IdentificationSpeaker IdentificationAssigns an identity to the voice of an unknown person.

Arbitrary speech patterns

Applications related to Speech RecognitionApplications related to Speech Recognition

Page 51: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 65 Carnegie Mellon

What You Have What You Have

key, card, tokenkey, card, token What You KnowWhat You Know

password, PIN, maiden namepassword, PIN, maiden name Who You Are Who You Are

Three Types of SecurityThree Types of Security

Stronger AuthenticationStronger Authentication

Page 52: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 66 Carnegie Mellon

Family Tree: Family Tree: Voice BiometricsVoice Biometrics

Speech Recognition

Speech Processing

Speech Synthesis

Digitized Speech

InputOutput

Speaker Verification

Speaker Identification

Voice Biometrics

Signature Verif.

Typing Dynamics

Face RecognitionFinger GeometryFingerprinting

Hand GeometryIris/Retina Scan

Biometrics

DNA

Page 53: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 67 Carnegie Mellon

Carnegie Mellon Speech DemosCarnegie Mellon Speech Demos

• CMU CommunicatorCMU Communicator• Call: 1-877-CMU-PLAN (268-7526), Call: 1-877-CMU-PLAN (268-7526), alsoalso 268-5144, or 268-5144, or

x8-1084x8-1084• the information is accurate; you can use it for your own the information is accurate; you can use it for your own

travel planning…travel planning…

CMU Universal Speech Interface (USI)CMU Universal Speech Interface (USI)

• CMU Movie LineCMU Movie LineSeems to be about apartments now…Seems to be about apartments now…• Call: (412) 268-1185

Page 54: Application of Speech Recognition, Synthesis, Dialog

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 69 Carnegie Mellon

Telephone DemosTelephone Demos

• Nuance Nuance http://www.nuance.comhttp://www.nuance.com

• Banking: 1-650-847-7438Banking: 1-650-847-7438• Travel Planning: 1-650-847-7427Travel Planning: 1-650-847-7427• Stock Quotes: 1-650-847-7423Stock Quotes: 1-650-847-7423

• SpeechWorks SpeechWorks http://www.speechworks.com/demos/demos.htmhttp://www.speechworks.com/demos/demos.htm

• Banking: 1-888-729-3366Banking: 1-888-729-3366• Stock Trading: 1-800-786-2571Stock Trading: 1-800-786-2571

• MIT Spoken Language Systems Laboratory MIT Spoken Language Systems Laboratory http://www.sls.lcs.mit.edu/sls/whatwedo/applications.htmlhttp://www.sls.lcs.mit.edu/sls/whatwedo/applications.html • Travel Plans (Pegasus): 1-877-648-8255Travel Plans (Pegasus): 1-877-648-8255• Weather (Jupiter): 1-888-573-8255Weather (Jupiter): 1-888-573-8255

• IBM IBM http://www-3.ibm.com/software/speech/http://www-3.ibm.com/software/speech/

• Mutual Funds, Name Dialing: 1-877-VIA-VOICEMutual Funds, Name Dialing: 1-877-VIA-VOICE

Page 55: Application of Speech Recognition, Synthesis, Dialog

Questions?