70
Gary Geunbae Lee Dept of CSE/AIGS POSTECH Deep Learning-based Spoken Dialog Systems

Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Gary Geunbae LeeDept of CSE/AIGS

POSTECH

Deep Learning-based Spoken Dialog Systems

Page 2: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Virtual Personal Assistant: Trends

S/W

Ass

istan

t

Apple Google Now

MS CortanaSamsungS-voice

Amazon Dash & Echo

Social Robot: Jibo

Service Robot : PepperN

ewFo

rmFa

ctor

(IoT,

Wea

rabl

e,…

)Ro

bot

Facebook M

SamsungBixby

IKEA Smart table

Hotel Robot : Botlr

Apple Knowledge Navigator (‘87)

Google Home

~2012 2013 2014 2015 2016~

Page 3: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

AI for Personal Health Advisor

- From Medical Futurist report, “Artificial Intelligence Will Redesign Healthcare”

Page 4: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

AI for Shopping Assistant

- From Information Age report, “3 ways artificial intelligence is transforming e-commerce”

Page 5: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

AI for 24/7 After-sales Care

- From Wikipedia, “Customer support Automation”

Page 6: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

AI for Social Robot

- From Wikipedia, “Social robot”

Page 7: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

traditional pipeline architecture

7

Page 8: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Spoken Language Understanding

• Spoken language understanding is to map natural language speech to frame structure encoding of its meanings.

• What’s difference between NLU and SLU?► Robustness; noise and ungrammatical spoken language► Domain-dependent; further deep-level semantics (e.g. Person vs.

Cast)► Dialog; dialog history dependent and utt. by utt. analysis

• Traditional approaches; natural language to SQL conversion

ASRSpeech

SLU SQLGenerate Database

Text SemanticFrame SQL Response

A typical ATIS system (from [Wang et al., 2005])

8Intelligent Robot Lecture Note

Page 9: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Semantic Representation

• Two common components in semantic frame► Dialog acts (DA); the meaning of an utt. At the discourse level, and it

it approximately the equivalent of intent or subject slot in the practice.► Named entities (NE); the identifier of entity such as person, location,

organization, or time. In SLU, it represents domain-specific meaning of a word (or word group).

• Example (ATIS and EPG domain, simplified representation)

Show me flights from Denver to New York on Nov. 18thDIALOG_ACT = Show_FlightFROMLOC.CITY_NAME = DenverTOLOC.CITY_NAME = New YorkMONTH_NAME = Nov. DAY_NUMBER = 18th

I want to watch LOSTDIALOG_ACT = Search_ProgramPROGRAM = LOST

9Intelligent Robot Lecture Note

Page 10: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Semantic Frame Extraction

Dialog ActIdentification

Frame-SlotExtraction

RelationExtraction

Unification

Feature Extraction / Selection

Info.Source

+

+

+

+ +

Overall architecture for semantic analyzer

롯데월드에어떻게가나요?Domain: NavigationDialog Act: WH-questionMain Action: SearchObject.Location.Destination=롯데월드난롯데월드가너무좋아.Domain: ChatDialog Act: StatementMain Action: LikeObject.Location=롯데월드

Examples of semantic frame structure

• Semantic Frame Extraction (~ Information Extraction Approach)1) Dialog act / Main action Identification ~ Classification2) Frame-Slot Object Extraction ~ Named Entity Recognition3) Object-Attribute Attachment ~ Relation Extraction► 1) + 2) + 3) ~ Unification

10Intelligent Robot Lecture Note

Page 11: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Machine Learning for SLU

• Maximum Entropy (a.k.a logistic regression)► Conditional and discriminative manner► Unstructured! (no dependency in y)► Dialog act classification problem

• Conditional Random Fields [Lafferty et al. 2001]► Structured versions of MaxEnt (argmax search in inference)► Undirected graphical models► Popular in language and text processing► Linear-chain structure for practical implementation► Named entity recognition problem

yt-1 yt yt+1

xt-1 xt xt+1

z

x

fk

gk

hk

11Intelligent Robot Lecture Note

Page 12: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Semantic NER as Sequence Labeling

• Relational Learning for language processing► Left-to-right n-th order Markov model (linear-chain or sequence)► E.g. Part-of-speech tagging, Noun Phrase chunking, Information

Extraction, Speech Recognition, etc.► Very large size of feature space (e.g. state-of-the-art NP chunking >

1M)► Open problem is how to reduce the training cost (even 1st order

Markov)• Transformation to BIO representation [Ramshaw and Marcus, 1995]

► Begin of entity, Inside of entity, and Outside

Show me flight from Denver to New York on Nov. 18thO O O O F.CITY-B O T.CITY-B T.CITY-I O MONTH-B DAY-B

F.CITY T-CITY MONTH DAY

12Intelligent Robot Lecture Note

Page 13: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Long-distance Dependency Problem

… …fly from denver to chicago on dec. 10th 1999

DEPART.MONTH

… …return from denver to chicago on dec. 10th 1999

RETURN.MONTH

• Most practical NLP models employ the local feature set► Local context feature (sliding window)► E.g.) for “dec.”; current = dec., prev-2 = on, prev-1 = chicago, next+1

= 10th, next+2 = 1999, POS-tag = NN, chunk = PP► However, there is exactly same feature set for two states “dec.”

(even different labels)• Non-local feature or high-order dependency should be considered

13Intelligent Robot Lecture Note

Page 14: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Motivation: Joint System

• Joint Prediction of DA and NE [Jeong and Lee, 2006]► DA and NE are mutually dependent► An integration of DA and NE model encoding inter-dependency► GOAL is to improve both performances of DA and NE task.

Joint Inference

Classification(Dialog Act / Intent)

SequenceLabeling

(Named Entity /Frame Slot)

AutomaticSpeech

Recognition

DialogManagement

Joint Model(e.g. TriCRFs)

x x,y,z

14Intelligent Robot Lecture Note

Page 15: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Triangular-chain CRFs

• Modeling the inter-dependency (y ↔ z)

► Factorizing the potential

yt-1 yt yt+1

xt-1 xt xt+1

zxz

hk

gk

f1k

f2k

y,y dependency y,z dependency

In general, fk can be a function of triangulated cliques.However, we assume that NE state transition is independent from DA,i.e., DA operates as an observation feature to identify NE labels.

edge-transition NE-observation DA-observation

15Intelligent Robot Lecture Note

Page 16: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Active Learning

• Certainty-based method► Predict the candidate raw data► Estimate confidence (e.g. Prob) and use it

Raw data

SmallLabeled data Model

Predict &Estimate

Confidence

Selected samplesFilter

Labeledsamples

+

16Intelligent Robot Lecture Note

Page 17: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Semi-supervised Learning

• Augmenting the machine-labeled data

• Augmenting the classification model► Like as adaptation (model interpolation), classifier dependent method

Raw data

SmallLabeled data Model

Predict &Estimate

Confidence

Selected samplesFilter

Labeledsamples

+

17Intelligent Robot Lecture Note

Page 18: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

traditional pipeline architecture

18

Page 19: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dialog Management

Dialogue Management

• A system to provide interface between the user and a computer-based application

• Interact on turn-by-turn basis• Dialogue manager

► Control the flow of the dialogue► Main flow

◦ information gathering from user◦ communicating with external application◦ communicating information back to the user

► Three types of dialogue system◦ finite state- (or graph-) based◦ frame-based ◦ agent-based

19Intelligent Robot Lecture Note

Page 20: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dialog Management

Dialog System Architecture

• The DARPA Communicator program was designed to support the creation of speech-enabled interfaces that scale gracefully across modalities, from speech-only to interfaces that include graphics, maps, pointing and gesture.

20

AT&T

CMU

MIT

CU

BBN

Bell Lab

SRIDARPA

Intelligent Robot Lecture Note

Page 21: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dialog Management

21

<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"><menu>

<prompt>Say one of: <enumerate/></prompt><choice next="http://www.example.com/sports.vxml">

Sports scores</choice><choice next="http://www.example.com/weather.vxml">

Weather information</choice><choice next="#login">

Log in</choice>

</menu></vxml>

Browser : Say one of: Sports scores; Weather information; Log in.

User : Sports scores

Intelligent Robot Lecture Note

Page 22: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dialog Management

22

Frame-based Approach

• Frame-based system► Asks the user questions to fill slots in a template in order to perform a

task (form-filling task)► Permits the user to respond more flexibly to the system’s prompts (as

in Example 2.)► Recognizes the main concepts in the user’s utterance

Example 1)• System: What is your destination?• User: London.• System: What day do you want to

travel?• User: Friday

Example 2) System: What is your destination? User: London on Friday around

10 in the morning. System: I have the following

connection …

Intelligent Robot Lecture Note

Page 23: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dialog Management

23

Agent-based Approach

• Properties► Complex communication using unrestricted natural language► Mixed-Initiative► Co-operative problem solving► Theorem proving, planning, distributed architectures► Conversational agents

• ExamplesUser : I’m looking for a job in the Calais area. Are there any servers?

System : No, there aren’t any employment servers for Calais. However, there is an employment server for Pasde-Calais and an employment server for Lille. Are you interested in one of these?

System attempts to provide a more co-operative response that might address the user’s needs.

Intelligent Robot Lecture Note

Page 24: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dialog Management

24

TRIPS Architecture

The TRIPS System ArchitectureIntelligent Robot Lecture Note

Page 25: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dialog Management

25

Information State Approach (plan-based)

• A method of specifying a dialogue theory that makes it straightforward to implement

• Consisting of following five constituents► Information Components

◦ Including aspects of common context◦ (e.g., participants, common ground, linguistic and intentional structure,

obligations and commitments, beliefs, intentions, user models, etc.)► Formal Representations

◦ How to model the information components◦ (e.g., as lists, sets, typed feature structures, records, etc.)

Intelligent Robot Lecture Note

Page 26: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dialog Management

26

Information State Approach

► Dialogue Moves◦ Trigger the update of the information state◦ Be correlated with externally performed actions

► Update Rules◦ Govern the updating of the information state

► Update Strategy◦ For deciding which rules to apply at a given point from the set of applicable

ones

Intelligent Robot Lecture Note

Page 27: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dialog Management

Dialog as a Markov Decision Process

27

User

SpeechUnderstanding

SpeechGeneration

StateEstimator

DialogPolicy Optimize

∑=k

kk rR γ

us

ua

ma

ua~

ma~ π

>=< duum ssas ~,~,~~MDP

usergoal

userdialog act

noisy estimate ofuser dialog act

dialoghistory

machinestate

machinedialog act

ReinforcementLearning

Reward

),( mm asrms~

ds

[S. Young, 2006]

Intelligent Robot Lecture Note

Page 28: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Spoken Language Understanding

Deep Learning-based Dialog System

28

Page 29: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

AI vs. ML vs. DL

AI: Artificial Intelligence ML: Machine Learning DL: Deep Learning

Artificial Intelligence

Machine Learning

Deep Learning

• Reasoning• Knowledge

Representation• Planning• Learning• Perception• Manipulation

• Supervised Learning with Teachers• Semi-supervised Learning with Teacher• Unsupervised Learning without Teacher• Reinforcement Learning with Rewards

InputText

Sound

Image

Video

Output Gen

erated Infor

mation

/ Autonomo

us Control

• RBM (Restricted Boltzmann Machine)• DBN (Deep Belief Network)• CNN (Convolutional Neural Network)• Deep Reinforcement Learning

Page 30: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Axon

Terminal Branchesof AxonDendrites

Σ

x1

x2w1

w2

wnxn

x3 w3

Artificial Neural Networks (ANN)

Page 31: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Layered Networks

Input nodesHidden nodes

Connections

Output nodes

∑ i j= f ( w x )

y = f ( w x +w x +w x ++w xm )

j

j

miiiOutput : 321

Page 32: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Deep learning Innovation

• Combining Feature Learning and Classification as UnifiedFramework (※ Learning what to learn, how to learn)

Feature learning aspect ofDNN based Image Classification

Page 33: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Vanilla recurrent neural networks (RNNs)

• RNNs have connections from the outputs of previous time steps to inputs of next time steps

• For sequential data, a RNN usually computes hidden state ℎ𝑡𝑡 from the previous hidden state ℎ𝑡𝑡−1 and the input 𝑥𝑥𝑡𝑡

• ℎ𝑡𝑡 = 𝜎𝜎 𝑊𝑊ℎℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 + 𝑏𝑏

33

Page 34: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Vanishing gradient problem

• ℎ𝑡𝑡 = 𝜎𝜎 𝑊𝑊ℎℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 + 𝑏𝑏

• Let’s assume 𝜎𝜎 is the identity function

34If all 𝜕𝜕ℎ

𝑡𝑡

𝜕𝜕ℎ𝑡𝑡−1< 1 𝜕𝜕𝐽𝐽𝑡𝑡

𝜕𝜕ℎ1≈ 0

Page 35: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Long short-term memory networks (LSTMs)

• LSTMs explicitly keep and update cell memory 𝑐𝑐(𝑡𝑡) by

• Removing the previous cell content 𝑐𝑐(𝑡𝑡−1) by multiplying it with 𝑓𝑓(𝑡𝑡)

• Adding the new cell content �̃�𝑐(𝑡𝑡) multiplied by 𝑖𝑖(𝑡𝑡)

• LSTMs produce output ℎ(𝑡𝑡) = 𝑜𝑜(𝑡𝑡)°tanh𝑐𝑐(𝑡𝑡)

35

Page 36: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Gated recurrent units (GRUs)

• GRUs keeps and update ℎ(𝑡𝑡)by two gates:

• Update gate 𝑢𝑢(𝑡𝑡)decides

• How much the old hidden representation ℎ(𝑡𝑡) is removed

• how much the new hidden representation �ℎ(𝑡𝑡) is added

• Reset gate 𝑟𝑟(𝑡𝑡) decides how much old representation ℎ(𝑡𝑡) is neededto compute new representation �ℎ(𝑡𝑡)

• GRUs also use less number of gates and have smaller parameters than LSTMs

36

Page 37: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Bidirectional Multi-Layer RNNs

37

Page 38: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Word Vector

• Represent words as vectors

38

Page 39: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Word Vector

• Distributional semantics : A word’s meaning is given by the words that frequently appear close-by

• “You shall know a word by the company it keeps”

• Word2vec objective function (skip-grams)

39

Page 40: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Contextual word embedding

• A word’s contextual embedding must consider its context

40

GloVe

the actor plays a show

𝜖𝜖(𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝)

Some contextual

method

the actor plays a show

𝜖𝜖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑡𝑡ℎ𝑡𝑡 𝑝𝑝𝑐𝑐𝑡𝑡𝑜𝑜𝑟𝑟 _ 𝑝𝑝 𝑝𝑝ℎ𝑜𝑜𝑠𝑠)

Page 41: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

ELMo: Embeddings from Language Model

• Multi-layer bidirectional LSTM language model

41

𝐡𝐡𝑘𝑘,0𝐿𝐿𝐿𝐿 = 𝑥𝑥𝑘𝑘𝐿𝐿𝐿𝐿 (token representation;

GloVe)

𝐡𝐡𝑘𝑘,𝑗𝑗𝐿𝐿𝐿𝐿 = �⃗�𝐡𝑘𝑘,𝑗𝑗

𝐿𝐿𝐿𝐿, �⃖�𝐡𝑘𝑘,𝑗𝑗𝐿𝐿𝐿𝐿 (LSTM state)

𝑥𝑥𝑡𝑡−1 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡+1

𝑝𝑝𝑡𝑡−1 𝑝𝑝𝑡𝑡 𝑝𝑝𝑡𝑡+1

ℎ𝑡𝑡−1,1 ℎ𝑡𝑡,1 ℎ𝑡𝑡+1,1

ℎ𝑡𝑡−1,2 ℎ𝑡𝑡,2 ℎ𝑡𝑡+1,2

ℎ𝑡𝑡−1,1 ℎ𝑡𝑡,1 ℎ𝑡𝑡+1,1

ℎ𝑡𝑡−1,2 ℎ𝑡𝑡,2 ℎ𝑡𝑡+1,2

Layer 1

Layer 2

Layer 0

𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘: scale (hyper-parameter)

𝑝𝑝𝑗𝑗𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘: weight (learned)

Page 42: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

ELMo for MRC

• ELMo as a word embedding

42

Page 43: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Transformer

• Parallel self-attention

• Looks at self, and determines where to focus

43Vaswani, Ashish, et al. "Attention is all you need." NIPS 2017

Page 44: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

BERT: Bidirectional Encoder Representations from Transformers

• Training 1. Masked words prediction

• 15% of words are [MASK]ed

44

Page 45: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

• Training 2. Next sentence prediction

• To understand texts more than a sentence

45

BERT: Bidirectional Encoder Representations from Transformers

Page 46: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

BERT: Bidirectional Encoder Representations from Transformers

• BERT as universal pre-trained model for NLP

• BERT requires minimal additional layers and fine-tuning

46Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019

Page 47: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

GPT(Generative Pre-Training)

47

In pre-training, optimize 𝐿𝐿1 𝑢𝑢𝑢𝑢: Unlabeled datasetΘ: Model parameters

In fine-tuning, optimize 𝐿𝐿3 𝑐𝑐𝑐𝑐: Labeled dataset𝜆𝜆: Hyper-parameter weight

Page 48: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Parallel computing for Deep Learning

• History of parallel/ distributed systems for Deep Learning computing

Google taps 16k computers to look for cats –for Science!

Univ. of Toronto uses2 GPUs for 1.2M training Images for 1000 classes Image classification(※ ImageNet Large Scale Visual Recognition Challenge)

Stanford uses12 GPUs for Large-scale Video Classification With Convolutional Neural Networks(※ 10M Youtube video)

Google uses16K CPU cores for Training 22-layers Deep neural network (※GoogLeNet, 2014)

Baidu’s Artificial- Intelligence Supercomputer Beats Google at Image Recognition

Page 49: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Model 회사 언어파라미터

(vocab, layer, hidden, heads)성능 학습 데이터

KoBERT SKTbrain pytorch (8002, 12, 3072, 12)SA – 90.1

(google : 87.5)한국어 위키/뉴스

KoGPT SKTbrain pytorch (50000, 12, -, 12)SA – 89.9

(google : 87.5)한국어 위키/뉴스+ 기타(알수없음)

한국어BERT언어모델

ETRIpytorch / tensorflow

(30349, 12, 768, 12)SRL - 85.77

(google : 81.85)23GB 원시 말뭉치

HanBERT TwoBlock AI pytorch (54000, 12, 768, 12)KorQuad

17th문서 (일반, 특허)

SA : Sentiment AnalysisSRL : Semantic Role Labeling

Korean

Page 50: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Multi-domain Task-oriented Dialogue

• The system interacts with user to help the user achieve his/her goal

• e.g. Restaurant reservation, hotel reservation…

• Specific domains & specific goals

• Open domain (chit-chat)

• Multi-turns

• User goals are not limited to just one domain

2021-03-17 50

To book a hotel To book a restaurantnear the hotel

To book a taxi from the hotelto the restaurant

Page 51: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Dataset: MultiWOZ

• Standard benchmark dataset for end-to-end task-oriented dialogue systems

• Multi-domain dataset for task-oriented dialog system

• Virtual conversations between a tourist and a guide in Cambridge

• 7 domains: Hotel, restaurant, attraction, train, taxi, police, hospital

• Number of dialog: 10,000 / Avg. number of turns per dialog: 13

• Evaluation in MultiWOZ

• Inform rate

• How many entries provided by the system meet goals?

• Success rate

• How many user requests are provided by the system?

• BLEU

• Similarity between generated response and true response in the dataset

2021-03-17 51

Page 52: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

A short example in MultiWOZ

52

User: I am planning a trip to town and want to sight see a bit.System: There are 79 attractions in the city. Do you have a specific type of attraction?

User: I’d like a architecture.System: How about All Saints Church? It has fantastic architecture and free entrance.

User: Could I get the address? I would also like an expensive place to eat around it.System: The address is Jesus Lane. What type of food would you like to eat?

User: I'm in the mood for Turkish food.System: Anatolia is an expensive Turkish restaurant in the city centre.

User: Can I have the phone number?System: The phone number is 01223362372.

User: Thank you!System: You're welcome, feel free to contact us about anything else you need.

Page 53: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

DST(Dialogue State Tracking)

• The most important subtask of task-oriented dialogue system

• Dialogue state generally consists of slot-value pairs

• Slot means general class like food, area / value means specific value of slot

• In multi-domain, dialogue state consists of domain-slot-value triplets

• ex) (restaurant-pricerange-cheap), (attraction-type-museum)…

• Infer dialogue state each turn of dialogues between user and system

• There are ground truth dialogue states every turn

• Inferred state is used to generate next system action and response

• TRADE(TRAnsferable Dialogue statE generator)

• 2019 ACL @ Hong Kong Univ. & Salesforce [https://www.aclweb.org/anthology/P19-1078]

• DS-DST(Dual Strategy for DST)

• State-Of-The-Art in DST

• 2019 Arxiv @ Illinois Univ. & Salesforce [https://arxiv.org/abs/1910.03544]

2021-03-17 53

Page 54: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Existing model: SOLOIST

54

• An auto-regressive model for training (Language modeling)• Task 1: Predict dialogue state (slot-value)• Task 2: Predict system response• Task 3 (for Auxiliary loss)

• Replace dialogue state or system response in input sequence into negative samples randomly

• Then, predict input sequence is negative sample or not (binary classification)• Jointly train by sum of 3 losses

Page 55: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

DAMD(Domain Aware Multi-Decoder)

55

GRU encoder

Attention

User context

Action

GRU encoder

Attention

Action context

Belief

GRU encoder

Attention

Belief context

Delexicalization System Response

GRU encoder

Attention

Response context

GRU decoder

Concat

Softmax

New belief

User utterance

GRU encoder

Attention

User context

DB

GRU decoder

Concat

Softmax

New action

Pointer

GRU decoder

Concat

Softmax

New response

Page 56: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Domain state tracking

• For tracking the flow of conversation from domain perspective

• Contains binary values indicating whether domains are activated or not in the current conversation

• Changes over turns

2021-03-17 56

User: I would like to get some information about a restaurant and a hotel.System: Okay, lets start with a hotel, any preference of type,area, or price range?

User: The hotel that I am looking for is called Gonville.System: Do you want book the hotel?

User: I want a Italian restaurant near the hotel.System: How about Prezzo? It places centre city.

ℎ𝑜𝑜𝑡𝑡𝑡𝑡𝑝𝑝:𝑂𝑂𝑂𝑂𝑟𝑟𝑡𝑡𝑝𝑝𝑡𝑡𝑝𝑝𝑢𝑢𝑟𝑟𝑝𝑝𝑟𝑟𝑡𝑡:𝑂𝑂𝑂𝑂

ℎ𝑜𝑜𝑡𝑡𝑡𝑡𝑝𝑝:𝑂𝑂𝑂𝑂𝑟𝑟𝑡𝑡𝑝𝑝𝑡𝑡𝑝𝑝𝑢𝑢𝑟𝑟𝑝𝑝𝑟𝑟𝑡𝑡:𝑂𝑂𝑂𝑂𝑂𝑂

ℎ𝑜𝑜𝑡𝑡𝑡𝑡𝑝𝑝:𝑂𝑂𝑂𝑂𝑂𝑂𝑟𝑟𝑡𝑡𝑝𝑝𝑡𝑡𝑝𝑝𝑢𝑢𝑟𝑟𝑝𝑝𝑟𝑟𝑡𝑡:𝑂𝑂𝑂𝑂

Page 57: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Architecture w/ domain state tracking

• NLU module is shared for DST, POL, and NLG

• Darker blocks mean previous turn

• DB result contains the number of matched entries for each domain

2021-03-17 57

Cross entropy loss

Binary cross entropy loss

Bert

FC (FullyConnected)

GRU GRU GRU

Page 58: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Training / inference

• Inputs

• 𝑢𝑢𝑝𝑝𝑡𝑡𝑟𝑟𝑡𝑡 , 𝑡𝑡𝑢𝑢𝑟𝑟𝑟𝑟 𝑑𝑑𝑜𝑜𝑑𝑑𝑝𝑝𝑖𝑖𝑟𝑟𝑡𝑡−1, 𝑑𝑑𝑖𝑖𝑝𝑝𝑝𝑝𝑜𝑜𝑑𝑑 𝑝𝑝𝑡𝑡𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡−1

• Outputs(Training)

• 𝑡𝑡𝑢𝑢𝑟𝑟𝑟𝑟 𝑑𝑑𝑜𝑜𝑑𝑑𝑝𝑝𝑖𝑖𝑟𝑟𝑡𝑡• 𝑑𝑑𝑖𝑖𝑝𝑝𝑝𝑝𝑜𝑜𝑑𝑑 𝑝𝑝𝑡𝑡𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡• 𝑑𝑑𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡• 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑑𝑑 𝑝𝑝𝑐𝑐𝑡𝑡𝑖𝑖𝑜𝑜𝑟𝑟𝑡𝑡• 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑑𝑑 𝑟𝑟𝑡𝑡𝑝𝑝𝑝𝑝𝑜𝑜𝑟𝑟𝑝𝑝𝑡𝑡𝑡𝑡

• Outputs(inference)

• 𝑡𝑡𝑢𝑢𝑟𝑟𝑟𝑟 𝑑𝑑𝑜𝑜𝑑𝑑𝑝𝑝𝑖𝑖𝑟𝑟𝑡𝑡• 𝑑𝑑𝑖𝑖𝑝𝑝𝑝𝑝𝑜𝑜𝑑𝑑 𝑝𝑝𝑡𝑡𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡• 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑑𝑑 𝑟𝑟𝑡𝑡𝑝𝑝𝑝𝑝𝑜𝑜𝑟𝑟𝑝𝑝𝑡𝑡𝑡𝑡

2021-03-17 58

domain loss

value loss

gate loss

action loss

response loss

Sum

Cross entropy

Adam

Page 59: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

End-to-end ASR

AudioInputsAudio

InputsAudioInputs

AudioInputsAudio

InputsTextOutputs

Beam SearchSpeech

Representation

CTC

Frontend (Preprocessing)

STFT, MEL

TransformerEncoder

TransformerDecoder

(Attention)

*Connectionist Temporal Classification (CTC)

Short-Time Fourier Transform (STFT)

Page 60: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Conformer Encoder

Conformer (Encoder)

Decoder - LSTM

https://arxiv.org/pdf/2005.08100.pdf*convolution-augmented transformer for speech recognition, Conformer

*Specaugment: A simple data augmentation method for automatic speech recognition (to log-mel spectrogram): time warping, frequency & time masking

Page 61: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Conformer + Wav2Vec 2.0 (Encoder)Decoder - LSTMhttps://arxiv.org/pdf/2010.10504v1.pdf

*Pre-training: The contrastive loss between the contextvectors from masked features and the quantization unit is optimized.

Page 62: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

CTC-attention decoder

Joint CTC-Attention Decoding

https://arxiv.org/pdf/1609.06773.pdf

*use intermediate label representationallowing repetitions of labels and blank labelsDo baum-welch optimization for Viterbi alignment S: decoder state

A: attention

*the shared encoder is trained by both CTC andattention model objectives simultaneously

Page 63: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

LAS (Listen, Attend and Spell)

LAS (Listen, Attend and Spell)

https://arxiv.org/pdf/1508.01211.pdf

*accepts filter bank spectra as inputs

*Attention-based recurrent network decoder(character decoding)*LAS does not make independenceassumptions in the label sequence unlike CTC

Page 64: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Tacotron2: Seq2seq with attention RNN + modified WaveNet

*location-sensitive attention: mitigating potential failure modes where some subsequences are repeated or ignored by the decoder*Auto-regressive decoder to generate mel spectrogram frame by frame; Previous frame input to pre-net & post-net (for residual)*stop token to stop generation

*wavenet: invert the mel spectrogram feature representation into time-domainwaveform samples (10 component mixture of logistic distributions (MoL) to generate 16-bitsamples at 24 kHz)

Page 65: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

MelGan

Change WaveNet vocoder to:- MelGan for faster inference

for different frequency range of the audio

Loss: GAN objective + feature matching objective(L1 distance between the discriminator feature maps of real and synthetic audio)

Page 66: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

66

Dialog System for Counseling (TOD Open-ended)

Paraphrasing Reflect Feeling

Correct Information Logical Answering

Listening Memorizing

Questioning Show Interest

Empathy Motivating

Sensibleness Attending

COUNSELING

Page 67: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

CBT Counseling Scenario

67

Initiation

Greeting

Explanation of Counseling

Method

Extraction information of

User

Situation

5W1H details

Automatic thoughts

Ground Belief

EmotionSupporting Evidence

No-Supporting Evidence

Refutation

Alternative thoughts

Emotion

Motivating

(Cognitive Behaviour Therapy)

*Automatic vs alternative

*For next visit

Page 68: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

68

VHRED:RNN Dialog Model

Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., & Bengio, Y. (2016). A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069..

*Variational hierarchical recurrent

Page 69: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

User : How are things?

Meena : They’re good. How about you?

User : I’m trying to decide what to watch

Meena : What are some of your favorite shows?

User : I like the Good Place

Meena : I haven’t watched that. My favorite shows are Supernatural, and Star trek : TNG.

User : What is TNG?

Meena : The Next Generation

Evolved Transformer Encoder Block

Evolved Transformer Decoder Block

x13

Meena : Transformer Dialog Model

Adiwardana, D., Luong, M. T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., ... & Le, Q. V. (2020). Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.

69

x1

*Automatically learned architecture w/ NAS (neural architecture search)

Page 70: Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Summary

- NLP enjoys rapid progress in the last 10 years due to deep learning.

- NLP is reaching the point of having big social impact, making issues like bias and security increasingly important.

- Big model, big computation resources, huge training times are problematic, need to focus more light way of doing NLP (even embedded model)

70