Building Ubiquitous and Robust Speech and Natural Language ...isoft.postech.ac.kr/publication/presentation/iuitutorial-gary.pdf · IUI 2007 tutorial 11 The Noisy Channel Model •

Building Ubiquitous and Robust Speech and Natural Language

Interfaces I

Gary Geunbae Lee, Ph.D., ProfessorDept. CSE, POSTECH

2IUI 2007 tutorial

Contents

• PART-I: Statistical Speech/Language Processing (60min)– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding

• PART-II: Technology of Spoken Dialog Systems (80min)– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech

• PART-III: Statistical Machine Translation (40min)– Statistical Machine Translation– Phrase-based SMT– Speech Translation

3IUI 2007 tutorial

Ubiquitous computing

• Ubiquitous computing: network + sensor + computing• Pervasive computing• Third paradigm computing• Calm technology• Invisible computing

• Irobot style interface – human language + hologram

4IUI 2007 tutorial

Ubiquitous computer interface?

• Computer – robot, home appliances, audio, telephone, fax machine, toaster, coffee machine, etc (every objects)

• Universal speech interface project (CMU)

• VoiceBox commercial systems

• Telematics Dialog Interface (POSTECH, LG, DiQuest)

5IUI 2007 tutorial

Tele-serviceTele-service

Car-navigationCar-navigation Home networkingHome networking

Robot interfaceRobot interface

Example Domain

6IUI 2007 tutorial

What’s hard – ambiguities, ambiguities, all different levels of ambiguities

John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. [from J. Eisner lecture note]

- donut: To get a donut (doughnut; spare tire) for his car?- Donut store: store where donuts shop? or is run by donuts? or looks like a big donut? or

made of donut?- From work: Well, actually, he stopped there from hunger and exhaustion, not just from work.- Every few hours: That’s how often he thought it? Or that’s for coffee?- it: the particular coffee that was good every few hours? the donut store? the situation- Too expensive: too expensive for what? what are we supposed to conclude about what John

did?

7IUI 2007 tutorial

Structural vs. Statistical: Technology innovation thru dialectic

Statistical analysis

data driven

empirical

connectionist

speech community

Structural analysisrule driven

rationalsymbolic

NLU, Chomskian, Shankian, AI community

8IUI 2007 tutorial

Structural NLP

• grammar rules + lexicons – Grammatical category (POS, syntactic category)– unification features (connectivity, agreements, semantics..)

• chart parsing• compositional semantics

• Limitation: enormous ambiguity– “List the sales of the products produced in 1973 with the products produced

in 1972” ==> 455 parses (Martin et. al. 1981)

9IUI 2007 tutorial

Statistical NLP

• Grammar’s role? – estimating which word sequence is legal?– Pr (w1, w2, …wn) – pr(w1)pr(w2|w1)pr(w3|w1w2)…..pr(wn|w1, ….wn-1)– pr(w2 |w1) = count (w1w2) / count (w1) [MLE]– E.g.) the (big, pig) dog– Shannon game -- predicting the next word given word sequence

• language modeling -- probability matrix– Language model evaluation -- cross entropy – - Σ pr(w1,n) log prM(w1,n) – when prM(w1,n) = pr(w1,n) cross entropy becomes minimum and language

model M is perfect

10IUI 2007 tutorial

Contents

• PART-I: Statistical Speech/Language Processing– Natural Language Processing – short intro– Automatic Speech Recognition– (Spoken) Language Understanding

• PART-II: Technology of Spoken Dialog Systems– Spoken Dialog Systems– Dialog Management– Dialog Studio– Information Access Dialog– Emotional & Context-sensitive Chatbot– Multi-modal Dialog– Conversational Text-to-Speech

• PART-III: Statistical Machine Translation– Statistical Machine Translation– Phrase-based SMT– Speech Translation

11IUI 2007 tutorial

The Noisy Channel Model

• Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993]

• The noisy channel model [Lee et al., 1996]– Acoustic input considered a noisy version of a source sentence

Sourcesentence

Noisysentence

Guess atoriginal sentence

Where is the bus stop ? Where is the bus stop ?

Noisy Channel Decoder

12IUI 2007 tutorial

The Noisy Channel Model

• What is the most likely sentence out of all sentences in the language L given some acoustic input O?

• Treat acoustic input O as sequence of individual observations – O = o1,o2,o3,…,ot

• Define a sentence as a sequence of words:– W = w1,w2,w3,…,wn

)|(maxargˆ OWPWLW∈

=

)()|(maxargˆ WPWOPWLW∈

=

)()()|(maxargˆ

OPWPWOPW

LW∈=

Bayes rule

Golden rule

13IUI 2007 tutorial

Speech Recognition Architecture Meets Noisy Channel

FeatureExtraction Decoding

AcousticModel

PronunciationModel

LanguageModel

Where is the bus stop ?

Speech Signals Word Sequence

Wher is the bus stop ?

NetworkConstruction

SpeechDB

TextCorpora

HMMEstimation

G2P

LMEstimation

)()|(maxargˆ WPWOPWLW∈

=

WO

14IUI 2007 tutorial

Network Construction

I

L

S

A

M

일

이

삼

사

I L

I

S A M

S A삼

사

일

이

Acoustic Model Pronunciation Model Language Model

I

I L

S A M

Wordtransition

P(일|x)

P(사|x)

P(삼|x)

P(이|x)LM is applied

S A

start end이

일

사

삼

Between-wordtransition

Intra-wordtransition

Search Network

• Expanding every word to state level, we get a search network [Demuyncket al., 1997]

15IUI 2007 tutorial

References (1/2)

• L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, 1986. Maximum mutual information estimation of hidden Markov model ICASSP, pp.49–52.

• C. Beaujard and M. Jardino, 1999. Language Modeling based on Automatic Word Concatenations, In Proceedings of 8th European Conference on Speech Communication and Technology, vol. 4, pp.1563-1566.

• K. Demuynck, J. Duchateau, and D. V. Compernolle, 1997. A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146.

• T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104.

• M. Mohri, F. Pereira, and M Riley, 2002. Weighted finite-state transducers in speech recognition, Computer Speech and Language, vol.16, no.1, pp.69–88.

16IUI 2007 tutorial

References (2/2)

• B. H. Juang, S. E. Levinson, and M. M. Sondhi, 1986. Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309.

• C. H. Lee, F. K. Soong, and K. K. Paliwal, 1996. Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers.

• K. K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set forthe HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173.

• L. R. Rabiner, 1989, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, vol.77, no.2, pp.257–286.

• L. R. Rabiner and B. H. Juang, 1993. Fundamentals of Speech Recognition, Prentice-Hall.

• S. J. Young, N. H. Russell, and J. H. S Thornton, 1989. Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department.

• S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK book. Entropics Cambridge Research Lab., Cambridge, UK.

17IUI 2007 tutorial

Contents




18IUI 2007 tutorial

Spoken Language Understanding (SLU)

• Spoken language understanding is to map natural language speech to frame structure encoding of its meanings [Wang et al., 2005]

• What’s difference between NLU and SLU?– Robustness; noise and ungrammatical spoken language– Domain-dependent; further deep-level semantics (e.g. Person vs. Cast)– Dialog; dialog history dependent and utt. by utt. analysis

• Traditional approaches; natural language to SQL conversion

ASRSpeech

SLU SQLGenerate Database

Text SemanticFrame SQL Response

A typical ATIS system (from [Wang et al., 2005])

19IUI 2007 tutorial

Semantic Representation

• Semantic frame (frame and slot/value structure) [Gildeaand Jurafsky, 2002]

– An intermediate semantic representation to serve as the interface between user and dialog system

– Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting.

“Show me flights from Seattle to Boston”

ShowFlight

Subject Flight

FLIGHT Departure_City Arrival_City

SEA BOS

<frame name=‘ShowFlight’ type=‘void’><slot type=‘Subject’>FLIGHT</slot><slot type=‘Flight’/>

<slot type=‘DCity’>SEA</slot><slot type=‘ACity’>BOS</slot>

</slot></frame>

Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005]

20IUI 2007 tutorial

Knowledge-based Systems

• Knowledge-based systems:– Developers write a syntactic/semantic grammar– A robust parser analyzes the input text with the grammar– Without a large amount of training data

• Previous works– MIT: TINA (natural language understanding) [Seneff, 1992]– CMU: PHEONIX [Pellom et al., 1999]– SRI: GEMINI [Dowding et al., 1993]

• Disadvantages1) Grammar development is an error-prone process2) It takes multiple rounds to fine-tune a grammar3) Combined linguistic and engineering expertise is required to construct a

grammar with good coverage and optimized performance4) Such a grammar is difficult and expensive to maintain

21IUI 2007 tutorial

Statistical Systems

• Statistical SLU approaches:– System can automatically learn from example sentences with their

corresponding semantics– The annotation are much easier to create and do not require specialized

knowledge• Previous works

– Microsoft: HMM/CFG composite model [Wang et al., 2005]– AT&T: CHRONUS (Finite-state transducers) [Levin and Pieraccini, 1995]– Cambridge Univ: Hidden vector state model [He and Young, 2005]– Postech: Semantic frame extraction using statistical classifiers [Eun et al.,

2004; Eun et al., 2005; Jeong and Lee, 2006]

• Disadvantages1) Data-sparseness problem; system requires a large amount of corpus2) Lack of domain knowledge

22IUI 2007 tutorial

Reducing the Effort of Human Annotation

• Active + Semi-supervised learning for SLU [Tur et al., 2005]– Use raw data, and divide them into two sets Sraw = Sactive + Ssemi

Raw data

SmallLabeled data Model

Predict &Estimate

Confidence< threshold

ActiveLearning

Filter

Labeledsamples

> threshold

yes

no

Augmenteddata

+

+

23IUI 2007 tutorial

Semantic Frame Extraction

Dialog ActIdentification

Dialog ActIdentification

Frame-SlotExtraction

Frame-SlotExtraction

RelationExtractionRelation

Extraction

UnificationUnification

Feature Extraction / SelectionFeature Extraction / Selection

Info.SourceInfo.

Source

++

++

++

++ ++

Overall architecture for semantic analyzer

I like DisneyWorld.

Domain: ChatDialog Act: StatementMain Action: LikeObject.Location=DisneyWorld

Examples of semantic frame structure

• Semantic Frame Extraction (~ Information Extraction Approach)1) Dialog act / Main action Identification ~ Classification2) Frame-Slot Object Extraction ~ Named Entity Recognition3) Object-Attribute Attachment ~ Relation Extraction– 1) + 2) + 3) ~ Unification

How to get to DisneyWorld?Domain: NavigationDialog Act: WH-questionMain Action: SearchObject.Location.Destination=DisneyWorld

24IUI 2007 tutorial

Frame-Slot Object Extraction

Sequence Labeling Inference

Conditional Random Fields[Lafferty et al. 2001]

yytt--11 yytt yyt+1t+1

xxtt--11 xxtt xxt+1t+1

CRF = Undirected graphical model

• Frame-Slot Extraction ~ NER = Sequence Labeling Problem

• A probabilistic model

25IUI 2007 tutorial

Long-distance Dependency in NER

…… ……flyfly fromfrom denverdenver toto chicagochicago onon decdec.. 10th10th 19991999

DEPART.MONTH

…… ……returnreturn fromfrom denverdenver toto chicagochicago onon decdec.. 10th10th 19991999

RETURN.MONTH

Feature Gain

• A Solution: Trigger-Induced CRF [Jeong and Lee, 2006]– Basic idea is to add only bundle of (trigger) features which increase log-

likelihood of training data– Measuring gain to evaluate the (trigger) features using Kullback-Leibler

divergence

26IUI 2007 tutorial

References (1/2)

• J. Dowding, J. M. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore, D. and Moran. 1993. Gemini: A natural language system for spoken language understanding. ACL, 54-61.

• J. Eun, C. Lee, and G. G. Lee, 2004. An information extraction approach for spoken language understanding. ICSLP.

• J. Eun, M. Jeong, and G. G. Lee, 2005. A Multiple Classifier-based Concept-Spotting Approach for Robust Spoken Language Understanding. Interspeech 2005-Eurospeech.

• D. Gildea, and D. Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245-288.

• Y. He, and S. Young. January 2005. Semantic processing using the Hidden Vector State model. Computer Speech and Language, 19(1):85-106.

• M. Jeong, and G. G. Lee. 2006. Exploiting non-local features for spoken language understanding. COLING/ACL.

• J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. ICML.

27IUI 2007 tutorial

References (2/2)

• E. Levin, and R. Pieraccini. 1995. CHRONUS, the next generation, In Proceedings of 1995 ARPA Spoken Language Systems Technical Workshop, 269--271, Austin, Texas.

• B. Pellom, W. Ward., and S. Pradhan. 2000. The CU Communicator: An Architecture for Dialogue Systems. ICSLP.

• R. E. Schapire., M. Rochery, M. Rahim, and N. Gupta. 2002, Incorporating prior knowledge into boosting. ICML. pp538-545.

• S. Seneff. 1992. TINA: a natural language system for spoken language applications, Computational Linguistics, 18(1):61--86.

• G. Tur, D. Hakkani-Tur, and R. E. Schapire. 2005. Combining active and semi-supervised learning for spoken language understanding. Speech Communication. 45:171-186

• Y. Wang, L. Deng, and A. Acero. September 2005, Spoken Language Understanding: An introduction to the statistical framework. IEEE Signal Processing Magazine, 27(5)

28IUI 2007 tutorial

Contents




29IUI 2007 tutorial

Dialog for EPG (POSTECH)

Unified Chatting and Goal-oriented Dialog (POSTECH)

30IUI 2007 tutorial

Spoken Dialog System

ASRASR

SLUSLUDMDM

RGRG

Models,Rules

Models,Rules

Semantic Meaning

ORIGIN_CITY: WASHINGTONDESTINATION_CITY: DENVERFLIGHT_TYPE: ROUNDTRIP

Dialog Management

System Action

GET DEPARTURE_DATE

Response Generation

System Speech

Which date do you want to fly from Washington to Denver?

Automatic SpeechRecognition

User Speech

“I need a flight from Washington DC to Denver roundtrip”

Recognized Sentence

Spoken Language Understanding

31IUI 2007 tutorial

VoiceXML-based System

• What is VoiceXML? – The HTML(XML) of the voice web. [W3C, working draft]– The open standard markup language for voice application

• Can do– Rapid implementation and management– Integrated with World Wide Web– Mixed-Initiative dialog– Able to input push button on telephone– Simple dialog implementation solution

• VoiceXML dialogs are built from – <menu>, <form> (similar to “Slot & Filling” system)

• Limiting User’s Response– Verification, and Help for invalid response– Good speech recognition accuracy

32IUI 2007 tutorial

Example – <Form>

<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"><form id="login">

<field name="phone_number" type="phone"> <prompt>

Please say your complete phone number</prompt>

</field> <field name="pin_code" type="digits">

<prompt>Please say your PIN code

</prompt></field><block>

<submit next=“http://www.example.com/servlet/login”namelist=phone_number pin_code"/>

</block></form>

</vxml>

Browser : Please say your complete phone numberUser : 800-555-1212Browser : Please say your PIN codeUser : 1 2 3 4

http://www.example.com/servlet/login

http://www.example.com/servlet/login

33IUI 2007 tutorial

Frame-based Approach

• Frame-based system [McTear, 2004]– Asks the user questions to fill slots in a template in order to perform a task

(form-filling task)– Permits the user to respond more flexibly to the system’s prompts (as in

Example 2.)– Recognizes the main concepts in the user’s utterance

Example 1)• System: What is your destination?• User: London.• System: What day do you want to

travel?• User: Friday

Example 2)• System: What is your destination?• User: London on Friday around 10

in the morning.• System: I have the following

connection …

34IUI 2007 tutorial

Agent-Based Approach

• Properties [Allen et al., 1996]– Complex communication using unrestricted natural language– Mixed-Initiative– Co-operative problem solving– Theorem proving, planning, distributed architectures– Conversational agents

• An example

• System attempts to provide a more co-operative response that might address the user’s needs.

User : I’m looking for a job in the Calais area. Are there any servers?

System : No, there aren’t any employment servers for Calais. However, there is an employment server for Pasde-Calais and an employment server for Lille. Are you interested in one of these?

35IUI 2007 tutorial

Galaxy Communicator Framework

• The Galaxy Communicator software infrastructure is a distributed, message-based, hub-and-spoke infrastructure optimized for constructing spoken dialog systems. [Bayer et al., 2001]

• An open source architecture for constructing dialog systems • History: MIT Galaxy system Developed and maintained by MITRE

Message-passing protocolHub and Clients architecture

36IUI 2007 tutorial

References (1/2)

• J. F. Allen, B. Miller, E. Ringger and T. Sikorski. 1996. A Robust System for Natural Spoken Dialogue, ACL.

• S. Bayer, C. Doran, and B. George. 2001. Dialogue Interaction with the DARPA Communicator Infrastructure: The Development of Useful Software. HLT Research.

• R. Cole, editor., Survey of the state of the art in human language technology, Cambridge University Press, New York, NY, USA, 1997.

• G. Ferguson, and J. F. Allen. 1998. TRIPS: An Integrated Intelligent Problem-Solving Assistant, AAAI, pp26-30.

• K. Komatani, F. Adachi, S. Ueno, T. Kawahara, and H. Okuno. 2003. Flexible Spoken Dialogue System based on User Models and Dynamic Generation of VoiceXML Scripts. SIGDIAL.

• S. Larsson, and D. Traum. 2000. Information state and dialogue management in the TRINDI Dialogue Move Engine Toolkit, Natural Language Engineering, 6(3-4).

• S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink, and G. Sagerer. 2003. Providing the basis for human-robotinteraction: A multi-modal attention system for a mobile robot. ICMI. pp. 28–35.

37IUI 2007 tutorial

References (2/2)

• E. Levin, R. Pieraccini, and W. Eckert. 2000, A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing. 8(1):11-23

• C. Lee, S. Jung, J. Eun, M. Jeong, and G. G. Lee. 2006. A Situation-based Dialogue Management using Dialogue Examples. ICASSP.

• W. Marilyn, H. Lynette, and A. John. 2000. Evaluation for DarpaCommunicator Spoken Dialogue Systems. LREC.

• M. F. McTear, Spoken Dialogue Technology, Springer, 2004.• I. O’Neil, P. Hanna, X. Liu, D. Greer, and M. McTear. 2005. Implementing

advanced spoken dialog management in Java. Speech Communication, 54(1):99-124.

• B. Pellom, W. Ward., and S. Pradhan. 2000. The CU Communicator: An Architecture for Dialogue Systems. ICSLP.

• A. Rudnicky, E. Thayer, P. Constantinides, C. Tchou, R. Shern, K. Lenzo, W. Xu, and A. Oh. 1999. Creating natural dialogs in the Carnegie Mellon Communicator system. Eurospeech, 4, pp1531-1534.

• W3C, Voice Extensible Markup Language (VoiceXML) Version 2.0 Working Draft, http://www.w3c.org/TR/voicexml20/

38IUI 2007 tutorial

Contents




39IUI 2007 tutorial

The Role of Dialog Management

• For example, in the flight reservation system– System : Welcome to the Flight Information Service. Where would you like

to travel to?– Caller : I would like to fly to London on Friday arriving around 9 in the

morning.– System : ????????????????????

In order to process this utterance, the system has to engage in the following

processes:

1) Recognize the words that the caller said. (Speech Recognition)

2) Assign a meaning to these words. (Language Understanding)

3) Determine how the utterance fits into the dialog so far and decide what to

do next. (Dialog Management)

There is a flight that departs at 7:45 a.m. and arrives at 8:50 a.m.

40IUI 2007 tutorial

Information State Update Approach – Rule-based DM(Larsson and Traum, 2000 )

• A method of specifying a dialogue theory that makes it straightforward to implement

• Consisting of following five constituents– Information Components

– Including aspects of common context– (e.g., participants, common ground, linguistic and intentional structure,

obligations and commitments, beliefs, intentions, user models, etc.)– Formal Representations

– How to model the information components– (e.g., as lists, sets, typed feature structures, records, etc.)

41IUI 2007 tutorial

Information State Approach

– Dialogue Moves– Trigger the update of the information state– Be correlated with externally performed actions

– Update Rules– Govern the updating of the information state

– Update Strategy– For deciding which rules to apply at a given point from the set of applicable

ones

42IUI 2007 tutorial

Example Dialogue

43IUI 2007 tutorial

Example Dialogue

44IUI 2007 tutorial

• A Tree Branching for Every Possible Situation– It can become very complex. Start

Information +Origin

Information +Destination

Information +Origin + Dest.

Information +Date

Information +Origin + Date

Information +Origin + Dest +Date

Information +Dest + Date

Flight #

Flight # +Date

Flight # +Information

Flight # +Reservation

The Hand-crafted Dialog Model is Not Domain Portable

45IUI 2007 tutorial

An Optimization Problem

• Dialog Management as an Optimization Problem– Optimization Goal

– Achieve an application goal to minimize a cost function (=objective function)– In General

– To minimize the turn of user-system and the DB access until filling all slots

– Simple Example : Month and Day Problem– Designing a dialog system that gets a correct date (month and day) from a user

through the shortest possible interaction– Objective Function

• How to Mathematically Formalize?– Markov Decision Process (MDP)

slots unfilled nsinteractio *#Errors*#*#C feiD ωωω ++=

46IUI 2007 tutorial

Mathematical Formalization

• Markov Decision Process (MDP) (Levin et al 2000)– Problems with cost (or reward) objective function are well modeled as

Markov Decision Process.– The specification of a sequential decision problem for a fully observable

environment that satisfies the Markov Assumption and yields additive rewards.

Dialog Action(Prompts, Queries, etc.)

Dialog Manager

Environment (User, External DB or other Servers)

Dialog StateCost(Turn, Error, DB Access, etc.)

47IUI 2007 tutorial

Month and Day Example

Optimal strategy is the one that minimizes the cost.

Strategy 1 is optimal if wi + P2* we - wf > 0 Recognition error rate is too high

Strategy 1. Good Bye.

Strategy2.

Strategy 3.

--

--

DayMonth

Which date ? Good Bye.

--

DayMonth

Day-

Which day ? Which month?--

Good Bye.

--

--

2 11 **C fi ωω +=

0 P*2* 3 23 **C fei ωωω ++=

Strategy 3 is optimal if 2*(P1-P2)* we - wi > 0 P1 is much more high than P2 against a cost of longer interaction

0 P*2* 2 12 **C fei ωωω ++=

48IUI 2007 tutorial

POMDP (Young 2002)

• Partially Observable Markov Decision Process (POMDP)– POMDP extends Markov Decision Process by removing the requirement

that the system knows its current state precisely.– Instead, the system makes observations about the outside world which give

incomplete information about the true current state.– Belief State : A distribution over MDP states in the absence of knowing its state

exactly .

s

r(s,a)

),|(

)(),|'(),'|(),,|'()'(

baop

sbsaspasopbaospsb Ss

∑∈==

b(s)

s`

∑∈

=Ss

asrsbab ),()(),(ρ

MDP POMDP

Current State

Reward Function

Next State

49IUI 2007 tutorial

Example-based Dialog Model Learning (Lee et al 2006)

• Example-Based Dialog Modeling– Automatically modeling from dialog corpus

– Example-based techniques using dialog example database (DEDB).– This model is simple and domain portable.

– DEDB Indexing and Searching – Query key : user intention, semantic frames, discourse history.

– Tie-breaking – Utterance similarity Measure

– Lexico-Semantic Similarity : Normalized edit distance– Discourse History Similarity : Cosine similarity

50IUI 2007 tutorial

• Indexing and Querying– Semantic-based indexing for dialog example database

– Lexical-based example database needs much more examples.– The SLU results is the most important index key.

– Automatically indexing from dialog corpus.

Example-based Dialog Modeling

Utterance 그럼 SBS 드라마는 언제 하지?(when is the SBS drama showing?)

Dialog Act Wh-question

Main Action Search_start_time

Component Slots [channel = SBS, genre =drama]

Discourse History [1,0,1,0,0,0,0,0,0]

System Action Inform(date, start_time, program)

Input : User Utterance

Output : System Concept

Indexing Key

51IUI 2007 tutorial

Example-based Dialog Modeling

• Tie-breaking– Lexico-Semantic Representation

– Utterance Similarity Measure

User Utterance그럼 SBS 드라마는 언제 하지?(when is the SBS drama showing?)

Component Slots [channel = SBS, genre = 드라마(dramas)]

Lexico-Semantic Representation 그럼 [channel] [genre] 는 언제 하 지

그럼 [channel] [genre] 는 언제 하 지

Slot-Filling Vector : [1,0,1,0,0,0,0,0,0][date] [genre] 는 몇 시에 하 니

Slot-Filling Vector : [1,0,0,1,0,0,0,0,0]

Current User Utterance

Retrieved ExamplesLexico-Semantic Similarity

Discourse History Similarity

52IUI 2007 tutorial

Strategy of Example-based Dialog Modeling

DialogueCorpus

Dialogue Example DB

DomainExpert

User’s Utterance

AutomaticIndexing

Retrieval

DiscourseHistory

Query Generation

DialogueExamplesTie-breaking

Lexico-semantic SimilarityDiscourse history Similarity

Utterance Similarity

SemanticFrame

Best DialogueExample

User Intention

System Responses

DialogueCorpus

DialogueCorpus

Dialogue Example DB

DomainExpert

User’s Utterance

AutomaticIndexing

Retrieval

DiscourseHistory

Query Generation

DialogueExamplesTie-breaking





SemanticFrame

Best DialogueExample

User Intention

System Responses

53IUI 2007 tutorial

EPGDEDB

EPGDialog Corpus

EPG Expert

Discourse HistoryStack

• previous user utterance• previous dialog act and semantic frame• previous slot-filling vector

Frame-SlotExtraction (EPG)

Dialog Act Identification

DiscourseInference

USER : What is on TV now?USER : What is on TV now?

• Agent = Task• Domain = EPG

• Dialog Act = Wh-question

• Main Action= Search_Program• Start_Time = now

Retrieved Dialog Examples

• Calculate utterance similarity System

Response

EPGMeta-Rule

XML RuleParser

• When no example is retrieved, meta-rules are used.

Domain SpotterAgent Spotter

SYSTEM : “XXX” is on SBS, …..SYSTEM : “XXX” is on SBS, …..

Web Contents

DatabaseManager

TV ScheduleDatabase

Multi-domain/genre Dialog Expert

54IUI 2007 tutorial

References

• S. Larsson and D. Traum, “Information state and dialogue management in the TRINDI Dialogue Move Engine Toolkit”, Natural Language Engineering, vol. 6, no. 3-4, pp. 323-340, 2000

• E. Levin, R. Pieraccini, and W. Eckert, “A stochastic model of human-machine interaction for learning dialog strategies”, IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 11-23, 2000.

• Steve Young. Talking to Machine (Statistically speaking), ICASSP 2002 , Denver• I. Lane and T. Kawahara. 2006. Verification of speech recognition results incorporating

in-domain confidence and discourse coherence measures, IEICE Transactions on Information and Systems, 89(3):931-938.

• C. Lee, S. Jung, J. Eun, M. Jeong, and G. G. Lee. 2006. A Situation-based Dialogue Management using Dialogue Examples. ICASSP.

• C. Lee, S.Jung, M. Jeong, and G. G. Lee. 2006. Chat and Goal-Oriented Dialog Together: A Unified Example-based Architecture for Multi-Domain Dialog Management, Proceedings of the IEEE/ACL 2006 workshop on spoken language technology (SLT), Aruba.

• D. Litman and S. Pan. 1999. Empirically evaluating an adaptable spoken dialogue system. ICUM, pp55-64.

• M. F. McTear, Spoken Dialogue Technology, Springer, 2004.• I. O’Neil, P. Hanna, X. Liu, D. Greer, and M. McTear. 2005. Implementing advanced

spoken dialog management in Java. Speech Communication, 54(1):99-124. • M. Walker, D. Litman, C. Kamm, and A. Abella. 1997. PARADISE: A general

framework for evaluating spoken dialogue agents. ACL/EACL, pp271-280.

55IUI 2007 tutorial

Contents




56IUI 2007 tutorial

• Motivation– The biggest problem to use dialog systems in a practical field is

“System Maintenance is difficult!”– Practical Dialog Systems need:

– Easy and Fast Dialog Modeling to handle new patterns of dialog– Easy to build up new information sources

– TV-Guide domain needs new TV-Schedule everyday– Reduce human efforts for maintaining– All dialog components should be synchronized!– Easy to tutor the system– Semi-automatic learning ability is necessary.

– Human can’t teach everything.

• Previous work– Rapid application development; CSLU Toolkit [CSLU Toolkit]– Scheme design & management; SGStudio [Wang and Acero, 2005]– Help non-experts in developing a user interface; SUEDE [Anoop et al., 2001]

Dialog Workbench/Studio

57IUI 2007 tutorial

Dialog Workbench

• Dialog Studio [Jung et al., 2006]– Dialog workbench System for example-based spoken dialog system– Can do

– Tutor the dialog system by adding & editing dialog examples– Synchronize all dialog components

– ASR + SLU + DM + Information Accessing– Providing semi-automatic learning ability– Reducing human-efforts for building up or maintaining dialog systems.

– Key idea– Generate Possible Dialog Candidates from Corpus– Predicting the possible dialog tagging information using a current model– Human approving or disapproving.

58IUI 2007 tutorial

Issue – “Human Efforts Reduction”• New dialog example Tagging

– Can be supported by the System using old models.

–

– DUP automatically generates the instances.– Administrator can audit DUP and modify the instances.

– ASR, SLU models are automatically trained

New dialog utterance

Old dialog manager tries to handle it

Display the result.

Human audit & modify the result

Dialog Example Editing

Dialog Utterance Pool(Automatically generated example candidates)

ASRModel

SLUModel

Example-basedDM Model

New Corpus Generation Example-DB Indexing

Recommendation

Generation

Audit & Modify

59IUI 2007 tutorial

POSTECH Dialog Studio Demo

60IUI 2007 tutorial

References (1/2)

• S. J. Cox, and S. Dasmahapatra. 2000. A semantically-based confidence measure for speech recognition. In Proc. of the ICSLP 2000, Beijing.

• J. Eun, C. Lee, and G. G. Lee. 2004. An information extraction approach for spoken language understanding. In: Proc. of the ICSLP, Jeju Korea.

• T. J. Hazen, J. Polifroni, and S. Seneff. 2002. Recognition confidence scoring and its use in speech language understanding systems. Computer Speech and Language, vol. 16, no. 1, pp. 49–67.

• T. J. Hazen, T. Burianek, J. Polifroni, and S. Seneff. 2000. Recognition confidence scoring for use in speech understanding systems. In Proc. of the the ISCA ASR2000 Tutorial and Research Workshop, Paris.

• H. Jiang. 2005. Confidence measures for speech recognition. Speech Communication, vol. 45, no. 4, pp. 455–470.

• S. Jung, C. Lee, G. G Lee, 2006. Three Phase Verification for Spoken Dialog System. In Proc. IUI.

61IUI 2007 tutorial

References (2/2)

• M. McTear, I. O’Neill, P. Hanna, and X. Liu. 2005. Handling errors and determining confirmation strategies - an object-based approach. Speech Communication, vol. 45, no. 3, pp. 249–269.

• I. O’Neill, P. Hanna, X. Liu, D.Greer, and M. McTear. 2005. Implementing advanced spoken dialogue management in Java. Science of Computer Programming, vol. 54, no. 1, pp. 99–124.

• T. Paek, and E. Horvitz. 2000. Conversation as action under uncertainty. In Proc. of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 455-464.

• Ratnaparkhi, 1998. A Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. Dissertation. University of Pennsylvania.

• F. Torres, L.F. Hurtado, F.E Garcia, Sanchis, and E. Segarra. 2005. Error handling in a stochastic dialog system through confidence measures. Speech Communication, vol. 45, no. 3, pp. 211–229.

62IUI 2007 tutorial

References

• K. S. Anoop, R.K. Scott, J. Chen, A. Landay, and C. Chen, 2001. SUEDE: Iterative, Informal Prototyping for Speech Interfaces. Video poster in Extended Abstracts of Human Factors in Computing Systems: CHI, Seattle, WA, pp. 203-204.

• S. Jung, C. Lee, G. G. Lee. 2006. Dialog Studio: An Example Based Spoken Dialog System Development Workbench, Dialogs on dialog: Multidisciplinary Evaluation of Advanced Speech-based Interactive Systems, Interspeech2006-ICSLP satellite workshop

• Y. Wang, and A. Acero. 2005. SGStudio: Rapid Semantic Grammar Development for Spoken Language Understanding. Proceedings of the Eurospeech Conference. Lisbon, Portugal.

• CSLU Toolkit, http://cslu.cse.ogi.edu/toolkit/

63IUI 2007 tutorial

Contents




64IUI 2007 tutorial

Information Access Dialog

InformationSources

Dialog Manager

Question Query

Answer Result

65IUI 2007 tutorial

Information Access Agent

RDB Access Module Question Answering Module

Relational Database

WEB

Information Sources

66IUI 2007 tutorial

Building Relational DB from Unstructured Data

• A Relational DB Model is Equivalent to an Entity-Relationship Model

• We can build an ER Model with the Information Extraction Approach– Named-Entity Recognition (NER)– Relation Extraction

Relational Database

WEB

67IUI 2007 tutorial

• Named-Entity Recognition (NER)

– A task that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, etc. [Chinchor, 1998]

Hillary Clinton moved to New York last year.Hillary Clinton moved to New York last year.

Person Geo-Political Entity

Named-Entity Recognition

68IUI 2007 tutorial

Relation Extraction

• Relation Extraction

– A task that detects and classification relations between named-entities

Hillary Clinton moved to New York last year.Hillary Clinton moved to New York last year.

Person Geo-Political Entity

AT.Residence

69IUI 2007 tutorial

Question Answering

• Question Answering System for Information Access Dialog System– SiteQ [Lee et al. 2001; Lee and Lee, 2002]– Search answers, not documents

POS Tagging

Answer TypeIdentification

AnswerJustification

Query Formation

Dynamic AnswerPassage Selection

Answer Finding

DocumentRetrieval

Question

Answer Type

Answer

70IUI 2007 tutorial

References (1/2)

• C. Blaschke, L. Hirschman, and A.Yeh. 2004. BioCreative Workshop.• N. Chinchor. 1998. Overview of MUC-7/MET-2, MUC-7.• N. Kambhatla. 2004. Combining lexical, syntactic and semantic features with

Maximum Entropy models for extracting relations. ACL.• E. Kim, Y. Song, C. Lee, K. Kim, G. G. Lee, B. Yi, and J. Cha. 2006. Two-

phase learning for biological event extraction and verification. ACM TALIP 5(1):61-73

• J. Kim, T. Ohta, Y. Tsuruoka, and Y. Tateisi. 2003. GENIA corpus - a semantically annotated corpus for bio-textmining, Bioinformatics, Vol 19 Suppl.1, pp. 180-182.

• J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labelling sequence data. ICML.

• G. G. Lee, J. Seo, S. Lee, H. Jung, B. H. Cho, C. Lee, B. Kwak, J. Cha, D. Kim, J. An, H. Kim, and K. Kim. 2001. SiteQ: Engineering High Performance QA system Using Lexico-Semantic Pattern Matching and Shallow NLP. TREC-10.

71IUI 2007 tutorial

References (2/2)

• S. Lee, and G. G. Lee. 2002. SiteQ/J: A question answering system for Japanese. NTCIR workshop 3 meeting: evaluation of information retrieval, automatic text summarization and question answering, QA tasks.

• A. McCallum, and W. Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, CoNLL.

• S. Soderland. 1999. Learning information extraction rules for semi-structured and free text. Machine Learning, 34, 233-72

• Y. Song, E. Kim, G. G. Lee, and B. Yi. 2005. POSBIOTM-NER: a trainable biomedical named-entity recognition system. Bioinformatics, 21 (11): 2794-2796.

• G. Zhou, J. Su, J. Zhang, M. Zhang. 2005. Exploring Various Knowledge in Relation Extraction. ACL.

72IUI 2007 tutorial

Contents




73IUI 2007 tutorial

POSTECH Chatbot Demo

74IUI 2007 tutorial

Emotion Recognition

• Emotion Recognition

• Why is Emotion Recognition important in dialog systems?– Emotion is a part of User Context.

– It has been recognized as one of the most significant factor of people to communicate with each other. [T. Polzin, 2000]

– Application : Affective HCI (Human-Computer Interface)– Home Networking, Intelligent Robot, ChatBot, …

““I feel blue I feel blue today.today.””

““Do you need a Do you need a cheercheer--up music? "up music? "

““what up?what up?””

75IUI 2007 tutorial

Traditional Emotion Recognition

USER : I am very happy.USER : I am very happy.

Facial Expression Analysis Speech Analysis Linguistic

Analysis

Classifier for Final Emotion

Decision

Emotion Hypothesis

Facial Expression Speech Text

76IUI 2007 tutorial

Emotional Categories

• Emotional Categories

System Categories

Emotional Speech DB

• Positive: Confident, encouraging, friendly, happy, interested• Negative: angry, anxious, bored, frustrated, sad, fear• Neutral• Ex) EPSaT (Emotional Prosody Speech and Transcription), SiTEC DB

Call Center• Positive, Non-Positive• Anger, Fear, Satisfaction, Excuse, Neutral• Ex) HMIHY, Stock Exchange Customer Service Center

TutorSystem

• Positive, Negative, Neutral• Ex) ITSpoke

Chat Messenger Neutral, Happy, Sad, Surprise, Afraid, Disgusted, Bored, …

77IUI 2007 tutorial

Emotional Features

• Speech-to-Emotion– Acoustic correlates related to prosody of speech have been used for

recognizing emotions. – Such as pitch, energy, and speech rate of the utterance,

– In general, the features extracted from speech play a significant role in recognizing emotion.

Feature-Set DescriptionAcoustic-Prosodic Fundamental Frequency(f0) – max, min,

mean, standard deviationEnergy – max, min, mean, standard deviationSpeaking Rate – voice frame/total frame

Pitch Contour ToBI Contour, nuclear pitch accent, phrase+boundary tones

Voice Quality Spectral tilt

78IUI 2007 tutorial

Emotional Features

• Text-to-Emotion– Basic Idea

– People tend to use specific words to express their emotions in spoken dialogs.– Because they have learned how some words are related to the corresponding

emotions.

– Psychologists have tried to identify the language of emotions by asking people to list the English words that describe specific emotions.

– They identified emotional keyword in spoken language.– It is highly domain dependent.

Feature-Set Description

Lexical

N-gram (Unigram, Bigram, Trigram)Non-speech human noise : laughter and sighsFilled Pause : “Oh”Emotional Keyword

Pragmatic Dialog Act, User Identifier

Context Past observances in the previous user turns

79IUI 2007 tutorial

Classifier

• Basic Algorithm within Emotion Recognizer– It is very similar to text categorization and topic detection.

– Most of emotion detection system uses the same basic algorithm used in the text categorization or topic detection.

• Emotional Keyword Extraction– Emotional Salience

– Using mutual information

– TF*IDF (Cosine Normalization)

• Emotion Classifier– Feature-Based Classification with combining several features.

– Using statistical classification methodologies– Linear discriminant classifier (LDC)– K-nearest neighborhood (k-NN)– Support Vector Machine (SVM)

∑ =×

×=

t

i iji

ijiji

nNtf

nNtfw

12

,

,,

)]/log([

)/log(

∑=

===k

jjnnjnn )e,w(i)w|e(P)wW;E(I)w(sal

1

80IUI 2007 tutorial

Advanced ChatBot

• Context-Sensitive ChatBot– Most of chatbot can not reflect the discourse context of previous user turns.

[ALICE, 2002; B. Shawar, 2002]

USER : Did you watch a Korean team’s game yesterday?SYSTEM : What game?.USER : How come you didn’t know that?SYSTEM : Well, I don’t know.

Context-Free ChatBot

USER : Did you watch a Korean team’s game yesterday?SYSTEM : What game?.USER : How come you didn’t know that?SYSTEM : I am not interested in sports.

Context-Sensitive ChatBotWhat’s better?

Obscure answering

More intelligent answering

81IUI 2007 tutorial

Discourse History in ChatBot

USER : I had a date last weekDialog Act = statement-non-opinionMain Goal = performAgent = chatDomain = frienddate = last weekevent = dateSYSTEM : Did you like her?

USER : Well, I think so, but.Dialog Act = statement-opinionMain Goal = judgeAgent = chatDomain = friendSYSTEM : but, what?

USER : I don’t know her real feeling.Dialog Act = statement-opinionMain Goal = knowAgent = chatDomain = friendtheme = real feelingSYSTEM : Why? What happen?

Utterance 1

Utterance 2

Utterance 3

DA = statement-non-opinionMA = performDOMAIN = friendDate = 1Theme = 1

DA = statement-opinionMA = judgeDOMAIN = friend

DA = statement-opinionMA = knowDOMAIN = friendTheme = 1

Context-Free Index Key

Discourse Coherence

θ>−−−= rform)opinion,penonentdge|statemopinion,jutP(statemen),MA|DA,MAP(DA t-t-tt 11

Context-Sensitive Index Key

Previous Semantics = “statement-non-opinion,perform”Previous Keyword = “date”Scenario Session = “2”DA = statement-opinionMA=judgeDOMAIN=friend

Previous Semantics = “statement-opinion,judge”Previous Keyword = “NULL”Scenario Session = “2”DA = statement-opinionMA=knowDOMAIN=friend

Previous Semantics = “<s>,<s>”Previous Keyword = “date”DA = statement-non-opinionMA = performDOMAIN = friendDate = 1Theme = 1

Abstraction of previous user turn

82IUI 2007 tutorial

References

• ALICE. 2002. A.L.I.C.E, A.I. Foundation. http://www.alicebot.org/• L. Holzman and W. Pottenger, 2003. Classification of Emotions in Internet

Chat: An Application of Machine Learning Using Speech Phonemes, Technical Report LU-CSE-03-002, Lehigh University.

• J. Liscombe, 2006. Detecting and Responding to Emotion inn Speech: Experiments in Three Domains, Ph.D. Thesis Proposal, Columbia University

• D. Litman and K. Forbes-Riley, 2005. Recognizing student emotions and attitudes on the basis of utterances in spoken tutoring dialogues with both human and computer tutors, Speech Communication, 48(5):559-590.

• C. M. Lee and S. S. Narayanan. 2005. Toward Detecting Emotions in Spoken Dialogos, IEEE Transactions on Speech and Audio Processing, 13(2):293-303.

• T. Polzin and A. Waibel. 2000. Emotion-sensitive human-computer interfaces. the ISCA Workshop on Speech and Emotion.

• B. Shawar and E. Atwell, 2002. A comparison between Alice and Elizabeth chatbot systems. School of Computing Research Report, University of Leeds

• X. Zhe and A. Boucouvalas, 2002. Text-to-Emotion Engine for Real Time Internet Communication, CSNDDSP.

83IUI 2007 tutorial

Contents




84IUI 2007 tutorial

POSTECH multimodal Dialog System Demo

85IUI 2007 tutorial

Multi-Modal Dialog

• Task performance and user preference for multi-modal over speech interfaces [Oviatt et al., 1997]

– 10% faster task completion,– 23% fewer words,– 35% fewer task errors,– 35% fewer spoken disfluencies

What is a decent Japanese restaurant near here?.

Hard to represent using only uni-modal !!

86IUI 2007 tutorial

Multi-Modal Dialog

• Components of multi-modal dialog system [Chai et al., 2002]

Speech

Gesture

Spoken LanguageUnderstanding

GestureUnderstanding

MultimodalIntegrator

dialogManager

Face Expression

Uni-modal Understanding

Multi-modal Understanding

& reference analysis

DiscourseUnderstanding

Uni-modal interpretation frame

Uni-modal interpretation frame

Multi-modal interpretation frame

87IUI 2007 tutorial

References (1/2)

• R. A. Bolt, 1980, “Put that there: Voice and gesture at the graphics interface,” Computer Graphics Vol. 14, no. 3, 262-270.

• J. Chai, S. Pan, M. Zhou, and K. Houck, 2002, Context-based Multimodal Understanding in Conversational Systems. Proceedings of the Fourth International Conference on Multimodal Interfaces (ICMI).

• J. Chai, P. Hong, and M. Zhou, 2004, A Probabilistic Approach to Reference Resolution in Multimodal User Interfaces. Proceedings of 9th International Conference on Intelligent User Interfaces (IUI-04), 70-77.

• J. Chai, Z. Prasov, J. Blaim, and R. Jin., 2005, Linguistic Theories in Efficient Multimodal Reference Resolution: an Empirical Investigation. Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 43-50.

• P.R. Cohen, M. Johnston, D.R. McGee, S.L. Oviatt, J.A. Pittman, I. Smith, L. Chen, and J. Clow, 1997, "QuickSet: Multimodal Interaction for Distributed Applications," Intl. Multimedia Conference, 31-40.

88IUI 2007 tutorial

References (2/2)

• H. Holzapfel, K. Nickel, R. Stiefelhagen, 2004, Implementation and Evaluation of a ConstraintBased Multimodal Fusion System for Speech and 3D Pointing Gestures, Proceedings of the International Conference on Multimodal Interfaces, (ICMI),

• M. Johnston, 1998. Unification-based multimodal parsing. Proceedings of the International Joint Conference of the Association for Computational Linguistics and the International Committee on Computational Linguistics , 624-630.

• M. Johnston, and S. Bangalore. 2000. Finite-state multimodal parsing and understanding. Proceedings of COLING-2000.

• M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor. 2002. MATCH: An architecture for multimodal dialogue systems. In Proceedings of ACL-2002.

• S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI '97.

89IUI 2007 tutorial

Contents




90IUI 2007 tutorial

POSTECH conversational TTS demoKorean (Dialog)

91IUI 2007 tutorial

• Text-to-speech system [M. Beutnagel, et al., 1999; J. Schroeter, 2005]– Front end

– Text normalization: take raw text and convert things like numbers and abbreviations into their written-out word equivalents.

– Linguistic analysis: POS-tagging, grapheme-to-phoneme conversion– Prosody generation: pitch, duration, intensity, pause

– Back end– Unit selection: select the most similar units in speech DB to make actual sound

output

Conversational Text-to-Speech

Textnormalization

LinguisticAnalysis

ProsodyGeneration

UnitSelection

Text

SynthesisBack-end

Speech

(Symbolic linguistic representation)

92IUI 2007 tutorial

• Given an alphabet of spelling symbols (graphemes) and an alphabet of phonetic symbols (phonemes), a mapping should be achieved transliterating strings of graphemes into strings of phonemes [W. Daelemans, et al., 1996]

• Alignment

Multilingual Grapheme-to-Phoneme Conversion

_

|

_

e__yogggahPhonemes:

||||||||

ㅔㅇ_ㅛㄱㄱㅏㅎGraphemes:

<Rule Generation>

Alignment Rule extraction Rule pruning Rule association Dictionary

<G2P Conversion>

Text normalizerInput text Canonical form of graphemes Phonemes

93IUI 2007 tutorial

• Predicting break index from POS tagged/syntax analyzed sentence• Break index [J. Lee, et al., 2002]

– No break: phrase-internal word boundary and a juncture smaller than a word boundary

– Minor break: minimal phrasal juncture such as an AP (accentual phrase) boundary

– Major break: a strong phrasal juncture such as an IP (intonational phrase) boundary

Break Index Prediction

Probabilistic break index prediction

C4.5

Break index tagged POS tag sequence

Break index tagged POS tag sequence

POS tag sequence

Trigram (wtag wtag break wtag)

Decision tree for error correction

94IUI 2007 tutorial

• Using C4.5 (decision tree)• Assume linguistic information and lexical information have influence to

tone of syllable• IP tone label prediction [K. E. Dusterhoff, et al., 1999]

– Assign one tone among “L%”, “H%”, “LH%”, “HL%”, “LHL%” and “HLH%” tone to the last syllable of IP

– Features– POS, punctuation type, the length of phrase, onset, nucleus, coda

• AP tone label prediction– Assign one tone among “L” and “H” tone to each syllable of AP– Features

– POS, the length of phrase, the location in prosodic phrase

Pitch Prediction using K-ToBI

95IUI 2007 tutorial

• Index of units: pitch, duration, position in syllable, neighboring phones• Half-diphone synthesis [A. J. Hunt, 1996; A. Conkie, 1999]

– The diphone cuts the units at the points of relative stability (the center of a phonetic realization), rather than at the volatile phone-phone transition, where so-called coarticulatory effects appear.

Unit Selection

96IUI 2007 tutorial

References (1/2)

• M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A. Syrdal. 1999. The AT&T Next-Gen TTS System. Joint Meeting of ASA, EAA, and DAGA.

• A. Conkie. 1999. Robust Unit Selection System for Speech Synthesis. Joint Meeting of ASA, EAA, and DAGA.

• W. Daelemans. 1996. Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion. Progress in Speech Synthesis, Springer Verlag, pp77-90.

• K. E. Dusterhoff, A. W. Black, and P. Taylor. 1999. Using decision trees within the tilt intonation model to predict f0 contours. Eurospeech-99.

• A. J. Hunt, and A. W. Black. 1996. Unit Selection in a concatenation speech synthesis system using a large speech database. ICASSP-96, vol. 1, pp 373-376.

97IUI 2007 tutorial

• S. Kim. 2000. K-ToBI (Korean ToBI) Labelling Conventions. UCLA Working Papers in Phonetics 99.

• S. Kim, J. Lee, B. Kim, and G. G. Lee. 2006. Incorporating Second-Order Information Into Two-Step Major Phrase Break Prediction for Korean. ICSLP-06

• J. Lee, B. Kim, and G. G. Lee. 2002. Automatic Corpus-based Tone and Break-Index Prediction using K-ToBI Representation. ACM transactions on Asian language information processing (TALIP), Vol 1, Issue 3, pp207-224.

• J. Lee, S. Kim, and G. G. Lee. 2006. Grapheme-to-Phoneme Conversion Using Automatically Extracted Associative Rules for Korean TTS System. ICSLP-06

• J. Schroeter. 2005. Electrical Engineering Handbook, pp16(1)-16(12).

References (2/2)

98IUI 2007 tutorial

Contents




99IUI 2007 tutorial

Statistical Machine TranslationPOSTECH Statistical MT System Demo

Korean-EngishJapanese-KoreanSpeech to Speech

100IUI 2007 tutorial

SMT Task

• SMT: Statistical Machine Translation• Task:

– Translate a sentence in a language into another language– using statistical features of data.

나는나는 생각한다생각한다, , 고로고로 나는나는 존재한다존재한다..

I think thus I am.I think thus I am.

P(I | P(I | 나는나는 ) = 0.7 , P( me | ) = 0.7 , P( me | 나는나는 ) = 0.2 , ) = 0.2 , ……

P(thinkP(think||생각하다생각하다) = 0.5, ) = 0.5, P(thinkP(think| | 생각생각) = 0.4 ,) = 0.4 ,……

……


The Machine Translation Pyramid

Interlingua

Native semantics

Native syntax

Native sentence

Foreign sentence

Foreign syntax

Foreign semantics

Interlingua based system requires

syntactic analysis, semantic analysis, language generation …… .

that is, all other NLP techniques and linguistics.

Interlingua based system


SMT in the Machine Translation Pyramid

Interlingua

Native semantics

Native syntax

Native sentence

Foreign sentence

Foreign syntax

Foreign semantics

Statistical system requires nothing but data and statistics.

Do not requires any other NLP techniques and linguistics.

Statistical system


Statistical Model

• Statistical Modeling

KoreanKorean--EnglishEnglishParallel textParallel text English textEnglish text

KoreanKorean Broken Broken EnglishEnglish EnglishEnglish

Translation Translation modelmodel

Language Language modelmodel

SStatistical analysistatistical analysis SStatistical analysistatistical analysis

)(eP)|( ekP


Statistical Model

• Fundamental models– Language Model

– Makes English fluently– Translation Model

– Makes translation correctly– Decoding Algorithm

– Finds best sentence

TranslationModel

LanguageModel

Output Input

DecodingAlgorithm

)()|(maxarg)|(maxarg

ePekPkePe

e

ebest

==


Translation Model

• Give a probability to a word/phrase pair– For a given word/phrase, list all the possible translations.– For a good translation, give high probability– For a poor translation, give low probability

• Independent assumption– Word translations are independent one another.

– Probability of a sentence translation = Product of words translation probabilities

∏=i

ii ekPEKP )|()|(


Decoding

• Search space – exponential to the length of sentence

– Pruning Reduces search space– Threshold pruning & Beam search algorithms

n: f: -----P: 1.0

n: If: *----P: 0.5

n: thinkf: -*---P: 0.4

n: amf: *---*P: 0.13

n: thinkf: **---P: 0.25

No word No word translatedtranslated

A word A word translatedtranslated

Two words Two words translatedtranslated


Evaluation

• BLEU score– Most famous metric– Range 0~1.– Higher score means better translation

∑=

⋅=N

nnn pw

1)logexp(BPBLEU

BP: BP: factor related to the length of candidate translationfactor related to the length of candidate translation

ppnn: : nn--gram precision, ignoring duplicate countgram precision, ignoring duplicate count

N: N: maximum order of nmaximum order of n--gramgram

wwnn: weight: weight

⎩⎨⎧

≤>

= − rcerc

cr if if1

BP )/1(

c : length of candidate translationc : length of candidate translation

r : length of reference sentencer : length of reference sentence


Contents





IBM Model

• Model 1– Source length only dependent on target length– Assume uniform probability for position alignment– Source word only dependent on aligned word

• Model 2– Target position depends on the source position

• Model 3– Add Fertility Model

• Model 4– Model re-ordering of phrases– deficient: alignment can generate source positions outside of length

• Model 5– Remove deficiency from model 4


GIZA++

• GIZA– Part of the SMT toolkit EGYPT– An word alignment tool– An implementation of IBM Model 4.

• GIZA++– And extension of GIZA– Model 5, HMM alignment model …


Phrase-based SMT

• Pharaoh: [Philipp Koehn, 2003]– An implementation of Statistical Phrase-based Machine Translation– Phrase:

– Not a syntactic phrase– A sequence of contiguous words

– SMT, but Translation unit is the phrase


Pharaoh Overview

• Based on noisy channel model (Typical SMT )• Language model p(e) replaced with

– Word cost introduced to adjust output length– ω > 1 : prefer longer translation– ω = 1 : don’t care about length– ω < 1 : prefer shorter translation

)()()( elengthLM epep ω=


Pharaoh Overview

• Translation model p(f|e) replaced with

• Input sentence f is segmented into a sequences of I phrases– Translation occurs phrase by phrase – And assume independence of each translation of phrase– Distortion probability d() is introduced.

– ai : start position of the foreign phrase that was translated into the ith English phrase

– bi: end position of the foreign phrase that was translated into the (i-1)th English phrase

∏=

−−=I

iiiii

IIbadekekp

1111 )()|()|( φ

Ik 1

ii ek →

|1|1

1)( −−−

−=− ii baii bad α


Pharaoh Training

• Alignment, Intersection and union

K-E 생맥주

한 잔 주 세요 .

A

Draft

Beer

,

Please

.

E-k 생맥주


A

Draft

Beer

,

Please

.

Inter-sect 생맥주


A

Draft

Beer

,

Please

.


한 잔 주 세요

.

A

Draft

Beer ?

,

Please ?

. GIZA++ results

Intersection

HeuristicUnion


Pharaoh Training

• Learning all phrase pairs that are consistent with the word alignment

• (A Draft | 생맥주) ( Beer | 한 잔 ) (, | 주) (Please | 세요) (. | .)• (A Draft Beer | 생맥주 한 잔) (Beer , | 한 잔 주) (, Please | 주 세요) (Please . | 세요 .)• (A Draft Beer , | 생맥주 한 잔 주 ) ( Beer , Please | 한 잔 주 세요 ) ( , Please | 주 세요 .)• (A Draft Beer , Please | 생맥주 한 잔 주 세요 ) ( Beer , Please . | 한 잔 주 세요 .)• (A Draft Beer , Please . | 생맥주 한 잔 주 세요 . )


한 잔 주 세요

.

A

Draft

Beer

,

Please

.


Techniques to improve

• Pre-processing– Normalize the input text to the “easy to translate” form– Reordering, tagging, paraphrasing, …… .

ForeignForeign NormalizedNormalizedForeignForeign NativeNative

NormalizationNormalization TranslationTranslation


Techniques to improve

• Post-processing– Translation may have some errors.– Perform error-correction decoding– Convert some trivial errors

– E.g. (morpheme connectivity check)

Foreign NativeNativewith errorwith error NativeNative

TranslationTranslation Error correctionError correction


Add POS tag

• Approach– Add part-of-speech (POS) tags to the training data

• Effect– Distinguish some of the homonyms– Change spacing unit

• Why useful?– For many languages, automatic POS tagging is available.– Spacing unit is changed into unit of meaning


Delete Useless words

• Approach– For some language pairs, there are useless words for translation.– Delete useless words to help word alignment

• Effects– Reduce the number of misaligned pairs

• Example: Korean-English translation– English : the, a, an, -es

Korean has a tendency not to distinguish number in noun– Korean : some kinds of post-positions ( 은, 는, 이, 가, 을, 를, …)

English does not have case-markers


Using Dictionary

• Approach– Just append the dictionary to the end of parallel corpus

• Effects– Add one count for correct phrase pairs in the dictionary– Increase the coverage of vocabulary

• Why useful?– Usually, a dictionary is easily accessible

– Already built in web or other applications– Adding dictionary gives significant improvement.


Dividing Language Model

KoreanKoreanBrokenBrokenEnglishEnglish

EnglishEnglish

Korean/EnglishKorean/EnglishBilingual TextBilingual Text

EnglishEnglishTextText

TranslationTranslationModelModelP(k|eP(k|e))

LanguageLanguageModel1Model1

LanguageLanguageModel2Model2

??

Select LM


Contents





Speech Translation

• ASR– Automatic Speech Recognizer– Generate texts of given speech signal.

• TTS– Text-To-Speech– Synthesis sounds of given text.

• Speech Translation Task– Translate speech signal in a language into another language– Combining ASR, TTS and Machine Translation


Combining ASR, TTS and SMT

• Cascading approach– Connect ASR, SMT and TTS in cascading manner– The ASR result be a input for the SMT system.– Translation result from SMT system be a input for TTS system.– Simple!

ASR SMT TTSOriginalOriginalSpeechSpeech

RecognizedRecognizedTextText

TranslatedTranslatedTextText

TranslatedTranslatedSpeechSpeech


Combining ASR, TTS and SMT

ASRASR

ASRASRResult1Result1




ASRASRResult nResult n

SMTSMTResult1Result1




SMTSMTResult nResult n

HHighest score translationighest score translation

Speech SignalSpeech Signal

TTSTTS

Speech SignalSpeech Signal

SMTSMTSystemSystem

scoringscoring

scoringscoring


References (1/2)

• P.F. Brown, S.A. Della Pietra, V.J. Della Pietra and R.L. Mercer. 1993. Mathematics of Statistical Machine Translation: Parameter Estimation. Computatitional linguistics, vol. 19, no. 2, pages 263-311.

• C. Callison-Burch, P. Koehn and M. Osborne. 2006, Improved Statistical Machine Translation Using Paraphrases. In proceedings NAACL.

• M. Collins, P. Koehn, and I. Kucerova. 2005. Clause Restructuring for Statistical Machine Translation. ACL.

• P. Koehn, F.J. Och and D. Marcu. 2003. Statistical Phrase-Based Translation. In proceedings of HLT, Pages 127-133.

• P. Koehn. 2004. Pharaoh: a Beam Search Decoder for Phrase-Based SMT. In Proceedings of AMTA pages 115-124.

• P. Koehn. 2004. Pharaoh, User Manual and Description for Version 1.2. http://www.isi.deu./licensed-sw/pharaoh/.


References (2/2)

• P. Koehn, A. Axelrod, A. Birch Mayne, C. Callison-Burch, M. Osborne and D. Talbot. 2005. Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. IWSLT.

• J. Lee, D. Lee, and G. G. Lee. 2006. Improving phrase-based Korean-English statistical machine translation. ICSLP-06

• F.J. Och and H. Ney. Improved statistical alignment models. 38th annual meeting of the ACL, pages 440-447.

• K. Papineni, S.Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU a method for automatic evaluation of machine translation. 40th Annual Meeting of the ACL pages 311-318. Philadelphia, PA, Jul.

• R. Zhang, G. Kikui. 2006. Integration of Speech Recognition and Machine Translation: Speech Recognition word Lattice Translation. Speech Communication, Vol. 48, Issues 3-4.


Thanks To

Minwoo Jung• Cheongjae Lee• SangKeun Jung• Seungwon Kim• Jinsik Lee• Jonghun Lee• Kyungdeok Kim• Sukwhan Kim• DonghHyeon Lee• HyungJong Noh• And others…


Thank you!Any Question??

Documents

Building Ubiquitous and Robust Speech and Natural Language ...isoft.postech.ac.kr/publication/presentation/iuitutorial-gary.pdf · IUI 2007 tutorial 11 The Noisy Channel Model •