Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung

Gary Geunbae LeeDept of CSE/AIGS

POSTECH

Deep Learning-based Spoken Dialog Systems

Virtual Personal Assistant: Trends

S/W

Ass

istan

t

Apple Google Now

MS CortanaSamsungS-voice

Amazon Dash & Echo

Social Robot: Jibo

Service Robot : PepperN

ewFo

rmFa

ctor

(IoT,

Wea

rabl

e,…

)Ro

bot

Facebook M

SamsungBixby

IKEA Smart table

Hotel Robot : Botlr

Apple Knowledge Navigator (‘87)

Google Home

~2012 2013 2014 2015 2016~

AI for Personal Health Advisor

- From Medical Futurist report, “Artificial Intelligence Will Redesign Healthcare”

AI for Shopping Assistant

- From Information Age report, “3 ways artificial intelligence is transforming e-commerce”

AI for 24/7 After-sales Care

- From Wikipedia, “Customer support Automation”

AI for Social Robot

- From Wikipedia, “Social robot”

traditional pipeline architecture

7

Spoken Language Understanding


• Spoken language understanding is to map natural language speech to frame structure encoding of its meanings.

• What’s difference between NLU and SLU?► Robustness; noise and ungrammatical spoken language► Domain-dependent; further deep-level semantics (e.g. Person vs.

Cast)► Dialog; dialog history dependent and utt. by utt. analysis

• Traditional approaches; natural language to SQL conversion

ASRSpeech

SLU SQLGenerate Database

Text SemanticFrame SQL Response

A typical ATIS system (from [Wang et al., 2005])

8Intelligent Robot Lecture Note


Semantic Representation

• Two common components in semantic frame► Dialog acts (DA); the meaning of an utt. At the discourse level, and it

it approximately the equivalent of intent or subject slot in the practice.► Named entities (NE); the identifier of entity such as person, location,

organization, or time. In SLU, it represents domain-specific meaning of a word (or word group).

• Example (ATIS and EPG domain, simplified representation)

Show me flights from Denver to New York on Nov. 18thDIALOG_ACT = Show_FlightFROMLOC.CITY_NAME = DenverTOLOC.CITY_NAME = New YorkMONTH_NAME = Nov. DAY_NUMBER = 18th

I want to watch LOSTDIALOG_ACT = Search_ProgramPROGRAM = LOST



Semantic Frame Extraction

Dialog ActIdentification

Frame-SlotExtraction

RelationExtraction

Unification

Feature Extraction / Selection

Info.Source

+

+

+

+ +

Overall architecture for semantic analyzer

롯데월드에어떻게가나요?Domain: NavigationDialog Act: WH-questionMain Action: SearchObject.Location.Destination=롯데월드난롯데월드가너무좋아.Domain: ChatDialog Act: StatementMain Action: LikeObject.Location=롯데월드

Examples of semantic frame structure

• Semantic Frame Extraction (~ Information Extraction Approach)1) Dialog act / Main action Identification ~ Classification2) Frame-Slot Object Extraction ~ Named Entity Recognition3) Object-Attribute Attachment ~ Relation Extraction► 1) + 2) + 3) ~ Unification



Machine Learning for SLU

• Maximum Entropy (a.k.a logistic regression)► Conditional and discriminative manner► Unstructured! (no dependency in y)► Dialog act classification problem

• Conditional Random Fields [Lafferty et al. 2001]► Structured versions of MaxEnt (argmax search in inference)► Undirected graphical models► Popular in language and text processing► Linear-chain structure for practical implementation► Named entity recognition problem

yt-1 yt yt+1

xt-1 xt xt+1

z

x

fk

gk

hk



Semantic NER as Sequence Labeling

• Relational Learning for language processing► Left-to-right n-th order Markov model (linear-chain or sequence)► E.g. Part-of-speech tagging, Noun Phrase chunking, Information

Extraction, Speech Recognition, etc.► Very large size of feature space (e.g. state-of-the-art NP chunking >

1M)► Open problem is how to reduce the training cost (even 1st order

Markov)• Transformation to BIO representation [Ramshaw and Marcus, 1995]

► Begin of entity, Inside of entity, and Outside

Show me flight from Denver to New York on Nov. 18thO O O O F.CITY-B O T.CITY-B T.CITY-I O MONTH-B DAY-B

F.CITY T-CITY MONTH DAY



Long-distance Dependency Problem

… …fly from denver to chicago on dec. 10th 1999

DEPART.MONTH

… …return from denver to chicago on dec. 10th 1999

RETURN.MONTH

• Most practical NLP models employ the local feature set► Local context feature (sliding window)► E.g.) for “dec.”; current = dec., prev-2 = on, prev-1 = chicago, next+1

= 10th, next+2 = 1999, POS-tag = NN, chunk = PP► However, there is exactly same feature set for two states “dec.”

(even different labels)• Non-local feature or high-order dependency should be considered



Motivation: Joint System

• Joint Prediction of DA and NE [Jeong and Lee, 2006]► DA and NE are mutually dependent► An integration of DA and NE model encoding inter-dependency► GOAL is to improve both performances of DA and NE task.

Joint Inference

Classification(Dialog Act / Intent)

SequenceLabeling

(Named Entity /Frame Slot)

AutomaticSpeech

Recognition

DialogManagement

Joint Model(e.g. TriCRFs)

x x,y,z



Triangular-chain CRFs

• Modeling the inter-dependency (y ↔ z)

► Factorizing the potential

yt-1 yt yt+1

xt-1 xt xt+1

zxz

hk

gk

f1k

f2k

y,y dependency y,z dependency

In general, fk can be a function of triangulated cliques.However, we assume that NE state transition is independent from DA,i.e., DA operates as an observation feature to identify NE labels.

edge-transition NE-observation DA-observation



Active Learning

• Certainty-based method► Predict the candidate raw data► Estimate confidence (e.g. Prob) and use it

Raw data

SmallLabeled data Model

Predict &Estimate

Confidence

Selected samplesFilter

Labeledsamples

+



Semi-supervised Learning

• Augmenting the machine-labeled data

• Augmenting the classification model► Like as adaptation (model interpolation), classifier dependent method

Raw data

SmallLabeled data Model

Predict &Estimate

Confidence

Selected samplesFilter

Labeledsamples

+


traditional pipeline architecture

18

Dialog Management

Dialogue Management

• A system to provide interface between the user and a computer-based application

• Interact on turn-by-turn basis• Dialogue manager

► Control the flow of the dialogue► Main flow

◦ information gathering from user◦ communicating with external application◦ communicating information back to the user

► Three types of dialogue system◦ finite state- (or graph-) based◦ frame-based ◦ agent-based


Dialog Management

Dialog System Architecture

• The DARPA Communicator program was designed to support the creation of speech-enabled interfaces that scale gracefully across modalities, from speech-only to interfaces that include graphics, maps, pointing and gesture.

20

AT&T

CMU

MIT

CU

BBN

Bell Lab

SRIDARPA

Intelligent Robot Lecture Note

Dialog Management

21

<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"><menu>

<prompt>Say one of: <enumerate/></prompt><choice next="http://www.example.com/sports.vxml">

Sports scores</choice><choice next="http://www.example.com/weather.vxml">

Weather information</choice><choice next="#login">

Log in</choice>

</menu></vxml>

Browser : Say one of: Sports scores; Weather information; Log in.

User : Sports scores


Dialog Management

22

Frame-based Approach

• Frame-based system► Asks the user questions to fill slots in a template in order to perform a

task (form-filling task)► Permits the user to respond more flexibly to the system’s prompts (as

in Example 2.)► Recognizes the main concepts in the user’s utterance

Example 1)• System: What is your destination?• User: London.• System: What day do you want to

travel?• User: Friday

Example 2) System: What is your destination? User: London on Friday around

10 in the morning. System: I have the following

connection …


Dialog Management

23

Agent-based Approach

• Properties► Complex communication using unrestricted natural language► Mixed-Initiative► Co-operative problem solving► Theorem proving, planning, distributed architectures► Conversational agents

• ExamplesUser : I’m looking for a job in the Calais area. Are there any servers?

System : No, there aren’t any employment servers for Calais. However, there is an employment server for Pasde-Calais and an employment server for Lille. Are you interested in one of these?

System attempts to provide a more co-operative response that might address the user’s needs.


Dialog Management

24

TRIPS Architecture

The TRIPS System ArchitectureIntelligent Robot Lecture Note

Dialog Management

25

Information State Approach (plan-based)

• A method of specifying a dialogue theory that makes it straightforward to implement

• Consisting of following five constituents► Information Components

◦ Including aspects of common context◦ (e.g., participants, common ground, linguistic and intentional structure,

obligations and commitments, beliefs, intentions, user models, etc.)► Formal Representations

◦ How to model the information components◦ (e.g., as lists, sets, typed feature structures, records, etc.)


Dialog Management

26

Information State Approach

► Dialogue Moves◦ Trigger the update of the information state◦ Be correlated with externally performed actions

► Update Rules◦ Govern the updating of the information state

► Update Strategy◦ For deciding which rules to apply at a given point from the set of applicable

ones


Dialog Management

Dialog as a Markov Decision Process

27

User

SpeechUnderstanding

SpeechGeneration

StateEstimator

DialogPolicy Optimize

∑=k

kk rR γ

us

ua

ma

ua~

ma~ π

>=< duum ssas ~,~,~~MDP

usergoal

userdialog act

noisy estimate ofuser dialog act

dialoghistory

machinestate

machinedialog act

ReinforcementLearning

Reward

),( mm asrms~

ds

[S. Young, 2006]



Deep Learning-based Dialog System

28

AI vs. ML vs. DL

AI: Artificial Intelligence ML: Machine Learning DL: Deep Learning

Artificial Intelligence

Machine Learning

Deep Learning

• Reasoning• Knowledge

Representation• Planning• Learning• Perception• Manipulation

• Supervised Learning with Teachers• Semi-supervised Learning with Teacher• Unsupervised Learning without Teacher• Reinforcement Learning with Rewards

InputText

Sound

Image

Video

Output Gen

erated Infor

mation

/ Autonomo

us Control

• RBM (Restricted Boltzmann Machine)• DBN (Deep Belief Network)• CNN (Convolutional Neural Network)• Deep Reinforcement Learning

Axon

Terminal Branchesof AxonDendrites

Σ

x1

x2w1

w2

wnxn

x3 w3

Artificial Neural Networks (ANN)

Layered Networks

Input nodesHidden nodes

Connections

Output nodes

∑ i j= f ( w x )

y = f ( w x +w x +w x ++w xm )

j

j

miiiOutput : 321

Deep learning Innovation

• Combining Feature Learning and Classification as UnifiedFramework (※ Learning what to learn, how to learn)

Feature learning aspect ofDNN based Image Classification

Vanilla recurrent neural networks (RNNs)

• RNNs have connections from the outputs of previous time steps to inputs of next time steps

• For sequential data, a RNN usually computes hidden state ℎ𝑡𝑡 from the previous hidden state ℎ𝑡𝑡−1 and the input 𝑥𝑥𝑡𝑡

• ℎ𝑡𝑡 = 𝜎𝜎 𝑊𝑊ℎℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 + 𝑏𝑏

33

Vanishing gradient problem

• ℎ𝑡𝑡 = 𝜎𝜎 𝑊𝑊ℎℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 + 𝑏𝑏

• Let’s assume 𝜎𝜎 is the identity function

34If all 𝜕𝜕ℎ

𝑡𝑡

𝜕𝜕ℎ𝑡𝑡−1< 1 𝜕𝜕𝐽𝐽𝑡𝑡

𝜕𝜕ℎ1≈ 0

Long short-term memory networks (LSTMs)

• LSTMs explicitly keep and update cell memory 𝑐𝑐(𝑡𝑡) by

• Removing the previous cell content 𝑐𝑐(𝑡𝑡−1) by multiplying it with 𝑓𝑓(𝑡𝑡)

• Adding the new cell content �̃�𝑐(𝑡𝑡) multiplied by 𝑖𝑖(𝑡𝑡)

• LSTMs produce output ℎ(𝑡𝑡) = 𝑜𝑜(𝑡𝑡)°tanh𝑐𝑐(𝑡𝑡)

35

Gated recurrent units (GRUs)

• GRUs keeps and update ℎ(𝑡𝑡)by two gates:

• Update gate 𝑢𝑢(𝑡𝑡)decides

• How much the old hidden representation ℎ(𝑡𝑡) is removed

• how much the new hidden representation �ℎ(𝑡𝑡) is added

• Reset gate 𝑟𝑟(𝑡𝑡) decides how much old representation ℎ(𝑡𝑡) is neededto compute new representation �ℎ(𝑡𝑡)

• GRUs also use less number of gates and have smaller parameters than LSTMs

36

Bidirectional Multi-Layer RNNs

37

Word Vector

• Represent words as vectors

38

Word Vector

• Distributional semantics : A word’s meaning is given by the words that frequently appear close-by

• “You shall know a word by the company it keeps”

• Word2vec objective function (skip-grams)

39

Contextual word embedding

• A word’s contextual embedding must consider its context

40

GloVe

the actor plays a show

𝜖𝜖(𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝)

Some contextual

method

the actor plays a show

𝜖𝜖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑡𝑡ℎ𝑡𝑡 𝑝𝑝𝑐𝑐𝑡𝑡𝑜𝑜𝑟𝑟 _ 𝑝𝑝 𝑝𝑝ℎ𝑜𝑜𝑠𝑠)

ELMo: Embeddings from Language Model

• Multi-layer bidirectional LSTM language model

41

𝐡𝐡𝑘𝑘,0𝐿𝐿𝐿𝐿 = 𝑥𝑥𝑘𝑘𝐿𝐿𝐿𝐿 (token representation;

GloVe)

𝐡𝐡𝑘𝑘,𝑗𝑗𝐿𝐿𝐿𝐿 = �⃗�𝐡𝑘𝑘,𝑗𝑗

𝐿𝐿𝐿𝐿, �⃖�𝐡𝑘𝑘,𝑗𝑗𝐿𝐿𝐿𝐿 (LSTM state)

𝑥𝑥𝑡𝑡−1 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡+1

𝑝𝑝𝑡𝑡−1 𝑝𝑝𝑡𝑡 𝑝𝑝𝑡𝑡+1

ℎ𝑡𝑡−1,1 ℎ𝑡𝑡,1 ℎ𝑡𝑡+1,1




Layer 1

Layer 2

Layer 0

𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘: scale (hyper-parameter)

𝑝𝑝𝑗𝑗𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘: weight (learned)

ELMo for MRC

• ELMo as a word embedding

42

Transformer

• Parallel self-attention

• Looks at self, and determines where to focus

43Vaswani, Ashish, et al. "Attention is all you need." NIPS 2017

BERT: Bidirectional Encoder Representations from Transformers

• Training 1. Masked words prediction

• 15% of words are [MASK]ed

44

• Training 2. Next sentence prediction

• To understand texts more than a sentence

45



• BERT as universal pre-trained model for NLP

• BERT requires minimal additional layers and fine-tuning

46Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019

GPT(Generative Pre-Training)

47

In pre-training, optimize 𝐿𝐿1 𝑢𝑢𝑢𝑢: Unlabeled datasetΘ: Model parameters

In fine-tuning, optimize 𝐿𝐿3 𝑐𝑐𝑐𝑐: Labeled dataset𝜆𝜆: Hyper-parameter weight

Parallel computing for Deep Learning

• History of parallel/ distributed systems for Deep Learning computing

Google taps 16k computers to look for cats –for Science!

Univ. of Toronto uses2 GPUs for 1.2M training Images for 1000 classes Image classification(※ ImageNet Large Scale Visual Recognition Challenge)

Stanford uses12 GPUs for Large-scale Video Classification With Convolutional Neural Networks(※ 10M Youtube video)

Google uses16K CPU cores for Training 22-layers Deep neural network (※GoogLeNet, 2014)

Baidu’s Artificial- Intelligence Supercomputer Beats Google at Image Recognition

Model 회사 언어파라미터

(vocab, layer, hidden, heads)성능 학습 데이터

KoBERT SKTbrain pytorch (8002, 12, 3072, 12)SA – 90.1

(google : 87.5)한국어 위키/뉴스

KoGPT SKTbrain pytorch (50000, 12, -, 12)SA – 89.9

(google : 87.5)한국어 위키/뉴스+ 기타(알수없음)

한국어BERT언어모델

ETRIpytorch / tensorflow

(30349, 12, 768, 12)SRL - 85.77

(google : 81.85)23GB 원시 말뭉치

HanBERT TwoBlock AI pytorch (54000, 12, 768, 12)KorQuad

17th문서 (일반, 특허)

SA : Sentiment AnalysisSRL : Semantic Role Labeling

Korean

Multi-domain Task-oriented Dialogue

• The system interacts with user to help the user achieve his/her goal

• e.g. Restaurant reservation, hotel reservation…

• Specific domains & specific goals

• Open domain (chit-chat)

• Multi-turns

• User goals are not limited to just one domain

2021-03-17 50

To book a hotel To book a restaurantnear the hotel

To book a taxi from the hotelto the restaurant

Dataset: MultiWOZ

• Standard benchmark dataset for end-to-end task-oriented dialogue systems

• Multi-domain dataset for task-oriented dialog system

• Virtual conversations between a tourist and a guide in Cambridge

• 7 domains: Hotel, restaurant, attraction, train, taxi, police, hospital

• Number of dialog: 10,000 / Avg. number of turns per dialog: 13

• Evaluation in MultiWOZ

• Inform rate

• How many entries provided by the system meet goals?

• Success rate

• How many user requests are provided by the system?

• BLEU

• Similarity between generated response and true response in the dataset

2021-03-17 51

A short example in MultiWOZ

52

User: I am planning a trip to town and want to sight see a bit.System: There are 79 attractions in the city. Do you have a specific type of attraction?

User: I’d like a architecture.System: How about All Saints Church? It has fantastic architecture and free entrance.

User: Could I get the address? I would also like an expensive place to eat around it.System: The address is Jesus Lane. What type of food would you like to eat?

User: I'm in the mood for Turkish food.System: Anatolia is an expensive Turkish restaurant in the city centre.

User: Can I have the phone number?System: The phone number is 01223362372.

User: Thank you!System: You're welcome, feel free to contact us about anything else you need.

DST(Dialogue State Tracking)

• The most important subtask of task-oriented dialogue system

• Dialogue state generally consists of slot-value pairs

• Slot means general class like food, area / value means specific value of slot

• In multi-domain, dialogue state consists of domain-slot-value triplets

• ex) (restaurant-pricerange-cheap), (attraction-type-museum)…

• Infer dialogue state each turn of dialogues between user and system

• There are ground truth dialogue states every turn

• Inferred state is used to generate next system action and response

• TRADE(TRAnsferable Dialogue statE generator)

• 2019 ACL @ Hong Kong Univ. & Salesforce [https://www.aclweb.org/anthology/P19-1078]

• DS-DST(Dual Strategy for DST)

• State-Of-The-Art in DST

• 2019 Arxiv @ Illinois Univ. & Salesforce [https://arxiv.org/abs/1910.03544]

2021-03-17 53

https://www.aclweb.org/anthology/P19-1078/

https://arxiv.org/abs/1910.03544

Existing model: SOLOIST

54

• An auto-regressive model for training (Language modeling)• Task 1: Predict dialogue state (slot-value)• Task 2: Predict system response• Task 3 (for Auxiliary loss)

• Replace dialogue state or system response in input sequence into negative samples randomly

• Then, predict input sequence is negative sample or not (binary classification)• Jointly train by sum of 3 losses

DAMD(Domain Aware Multi-Decoder)

55

GRU encoder

Attention

User context

Action

GRU encoder

Attention

Action context

Belief

GRU encoder

Attention

Belief context

Delexicalization System Response

GRU encoder

Attention

Response context

GRU decoder

Concat

Softmax

New belief

User utterance

GRU encoder

Attention

User context

DB

GRU decoder

Concat

Softmax

New action

Pointer

GRU decoder

Concat

Softmax

New response

Domain state tracking

• For tracking the flow of conversation from domain perspective

• Contains binary values indicating whether domains are activated or not in the current conversation

• Changes over turns

2021-03-17 56

User: I would like to get some information about a restaurant and a hotel.System: Okay, lets start with a hotel, any preference of type,area, or price range?

User: The hotel that I am looking for is called Gonville.System: Do you want book the hotel?

…

User: I want a Italian restaurant near the hotel.System: How about Prezzo? It places centre city.

…

ℎ𝑜𝑜𝑡𝑡𝑡𝑡𝑝𝑝:𝑂𝑂𝑂𝑂𝑟𝑟𝑡𝑡𝑝𝑝𝑡𝑡𝑝𝑝𝑢𝑢𝑟𝑟𝑝𝑝𝑟𝑟𝑡𝑡:𝑂𝑂𝑂𝑂

ℎ𝑜𝑜𝑡𝑡𝑡𝑡𝑝𝑝:𝑂𝑂𝑂𝑂𝑟𝑟𝑡𝑡𝑝𝑝𝑡𝑡𝑝𝑝𝑢𝑢𝑟𝑟𝑝𝑝𝑟𝑟𝑡𝑡:𝑂𝑂𝑂𝑂𝑂𝑂

ℎ𝑜𝑜𝑡𝑡𝑡𝑡𝑝𝑝:𝑂𝑂𝑂𝑂𝑂𝑂𝑟𝑟𝑡𝑡𝑝𝑝𝑡𝑡𝑝𝑝𝑢𝑢𝑟𝑟𝑝𝑝𝑟𝑟𝑡𝑡:𝑂𝑂𝑂𝑂

Architecture w/ domain state tracking

• NLU module is shared for DST, POL, and NLG

• Darker blocks mean previous turn

• DB result contains the number of matched entries for each domain

2021-03-17 57

Cross entropy loss

Binary cross entropy loss

Bert

FC (FullyConnected)

GRU GRU GRU

Training / inference

• Inputs

• 𝑢𝑢𝑝𝑝𝑡𝑡𝑟𝑟𝑡𝑡 , 𝑡𝑡𝑢𝑢𝑟𝑟𝑟𝑟 𝑑𝑑𝑜𝑜𝑑𝑑𝑝𝑝𝑖𝑖𝑟𝑟𝑡𝑡−1, 𝑑𝑑𝑖𝑖𝑝𝑝𝑝𝑝𝑜𝑜𝑑𝑑 𝑝𝑝𝑡𝑡𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡−1

• Outputs(Training)

• 𝑡𝑡𝑢𝑢𝑟𝑟𝑟𝑟 𝑑𝑑𝑜𝑜𝑑𝑑𝑝𝑝𝑖𝑖𝑟𝑟𝑡𝑡• 𝑑𝑑𝑖𝑖𝑝𝑝𝑝𝑝𝑜𝑜𝑑𝑑 𝑝𝑝𝑡𝑡𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡• 𝑑𝑑𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡• 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑑𝑑 𝑝𝑝𝑐𝑐𝑡𝑡𝑖𝑖𝑜𝑜𝑟𝑟𝑡𝑡• 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑑𝑑 𝑟𝑟𝑡𝑡𝑝𝑝𝑝𝑝𝑜𝑜𝑟𝑟𝑝𝑝𝑡𝑡𝑡𝑡

• Outputs(inference)

• 𝑡𝑡𝑢𝑢𝑟𝑟𝑟𝑟 𝑑𝑑𝑜𝑜𝑑𝑑𝑝𝑝𝑖𝑖𝑟𝑟𝑡𝑡• 𝑑𝑑𝑖𝑖𝑝𝑝𝑝𝑝𝑜𝑜𝑑𝑑 𝑝𝑝𝑡𝑡𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡• 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑑𝑑 𝑟𝑟𝑡𝑡𝑝𝑝𝑝𝑝𝑜𝑜𝑟𝑟𝑝𝑝𝑡𝑡𝑡𝑡

2021-03-17 58

domain loss

value loss

gate loss

action loss

response loss

Sum

Cross entropy

Adam

End-to-end ASR

AudioInputsAudio

InputsAudioInputs

AudioInputsAudio

InputsTextOutputs

Beam SearchSpeech

Representation

CTC

Frontend (Preprocessing)

STFT, MEL

TransformerEncoder

TransformerDecoder

(Attention)

*Connectionist Temporal Classification (CTC)

Short-Time Fourier Transform (STFT)

Conformer Encoder

Conformer (Encoder)

Decoder - LSTM

https://arxiv.org/pdf/2005.08100.pdf*convolution-augmented transformer for speech recognition, Conformer

*Specaugment: A simple data augmentation method for automatic speech recognition (to log-mel spectrogram): time warping, frequency & time masking

https://arxiv.org/pdf/2005.08100.pdf

Conformer + Wav2Vec 2.0 (Encoder)Decoder - LSTMhttps://arxiv.org/pdf/2010.10504v1.pdf

*Pre-training: The contrastive loss between the contextvectors from masked features and the quantization unit is optimized.

https://arxiv.org/pdf/2010.10504v1.pdf

CTC-attention decoder

Joint CTC-Attention Decoding


*use intermediate label representationallowing repetitions of labels and blank labelsDo baum-welch optimization for Viterbi alignment S: decoder state

A: attention

*the shared encoder is trained by both CTC andattention model objectives simultaneously


LAS (Listen, Attend and Spell)

LAS (Listen, Attend and Spell)


*accepts filter bank spectra as inputs

*Attention-based recurrent network decoder(character decoding)*LAS does not make independenceassumptions in the label sequence unlike CTC


Tacotron2: Seq2seq with attention RNN + modified WaveNet

*location-sensitive attention: mitigating potential failure modes where some subsequences are repeated or ignored by the decoder*Auto-regressive decoder to generate mel spectrogram frame by frame; Previous frame input to pre-net & post-net (for residual)*stop token to stop generation

*wavenet: invert the mel spectrogram feature representation into time-domainwaveform samples (10 component mixture of logistic distributions (MoL) to generate 16-bitsamples at 24 kHz)

MelGan

Change WaveNet vocoder to:- MelGan for faster inference

for different frequency range of the audio

Loss: GAN objective + feature matching objective(L1 distance between the discriminator feature maps of real and synthetic audio)

66

Dialog System for Counseling (TOD Open-ended)

Paraphrasing Reflect Feeling

Correct Information Logical Answering

Listening Memorizing

Questioning Show Interest

Empathy Motivating

Sensibleness Attending

COUNSELING

CBT Counseling Scenario

67

Initiation

Greeting

Explanation of Counseling

Method

Extraction information of

User

Situation

5W1H details

Automatic thoughts

Ground Belief

EmotionSupporting Evidence

No-Supporting Evidence

Refutation

Alternative thoughts

Emotion

Motivating

(Cognitive Behaviour Therapy)

*Automatic vs alternative

*For next visit

68

VHRED:RNN Dialog Model

Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., & Bengio, Y. (2016). A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069..

*Variational hierarchical recurrent

User : How are things?

Meena : They’re good. How about you?

User : I’m trying to decide what to watch

Meena : What are some of your favorite shows?

User : I like the Good Place

Meena : I haven’t watched that. My favorite shows are Supernatural, and Star trek : TNG.

User : What is TNG?

Meena : The Next Generation

Evolved Transformer Encoder Block

Evolved Transformer Decoder Block

x13

Meena : Transformer Dialog Model

Adiwardana, D., Luong, M. T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., ... & Le, Q. V. (2020). Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.

69

x1

*Automatically learned architecture w/ NAS (neural architecture search)

Summary

- NLP enjoys rapid progress in the last 10 years due to deep learning.

- NLP is reaching the point of having big social impact, making issues like bias and security increasingly important.

- Big model, big computation resources, huge training times are problematic, need to focus more light way of doing NLP (even embedded model)

70

Documents

Deep Learning-based Spoken Dialog Systems · Deep Learning-based Spoken Dialog Systems. Virtual PersonalAssistant: Trends S/W A ssi st a n t Apple. Google Now. MS Cortana. Samsung