taolusi.github.io · Dialogue State Induction Using Neural Latent Variable Models. Qingkai Min. 1;2 , Libo Qin. 3 , Zhiyang Teng. 1;2 , Xiao Liu. 4. and Yue Zhang. 1;2. 1. School

Dialogue State Induction Using Neural Latent Variable Models

Qingkai Min1;2 , Libo Qin3 , Zhiyang Teng1;2 , Xiao Liu4 and Yue Zhang1;2

1School of Engineering, Westlake University2Institute of Advanced Technology, Westlake Institute for Advanced Study3Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology4School of Computer Science and Technology, Beijing Institute of Technology

目录

Motivation

Method

Experiments

Conclusion

Motivation

CHAPTER 1 What is a task-oriented dialogue system?

Pic from: Gao J, Galley M, Li L. Neural approaches to conversational AI[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018: 1371-1374.

Typical modular architecture

Assist user in solving a task


End-to-end architecture

Assist user in solving a task



Typical modular architecture


CHAPTER 1 What is dialogue state tracking (DST)?

The dialogue state represents what the user is looking for at the current turn of the conversation.

CHAPTER 1

User: I want an expensive restaurant that serves Turkish food.

System: Anatolia serves Turkish food.User: What is the area?

Turn 1:inform(price=expensive, food=Turkish)

Turn 2:inform(price=expensive, food=Turkish); request(area)

What is dialogue state tracking (DST)?

The dialogue state represents what the user is looking for at the current turn of the conversation.

CHAPTER 1 Current DST scenarios

CHAPTER 1 Current DST scenarios

NLU DST

Propagatinguncertainty

Traditional DST:

utterance dialogue state

CHAPTER 1

NLU DST

Propagatinguncertainty

Traditional DST:

A single DST component

End-to-end DST:

utterance dialogue state

dialogue stateutterance

Current DST scenarios

CHAPTER 1 End-to-end DST





User: I want an expensive restaurant that serves Turkish food.Manual labeling: inform(price=expensive, food=Turkish)

System: Anatolia serves Turkish food.User: What is the area?Manual labeling: inform(price=expensive, food=Turkish); request(area)

Ontology(optional): price: [cheap, expensive, moderate, ...] food: [Turkish, Italian, polish, ...] area: [south, north, center, ...]





DST






Turn 2: inform(price=expensive, food=Turkish); request(area)

DST

CHAPTER 1 Limitation of end-to-end DST


End-to-end DST paradigm:

manual labeling train testDST modelcorpus




Limitations of end-to-end DST:





• Costly and slow: 8438 dialogues with 1249 workers in MultiWOZ 2.0 dataset






• Error-prone:Annotation errors

MultiWOZ 2.0 around 40% [Eric et al., 2019]

MultiWOZ 2.1 over 30% [Zhang et al., 2019]

[Eric et al., 2019] Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines. arXiv, 2019.[Zhang et al., 2019] Jian-Guo Zhang, Kazuma Hashimoto, Chien-Sheng Wu, Yao Wan, Philip S Yu, Richard Socher, and Caiming Xiong. Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv, 2019.






• Error-prone:Annotation errors

MultiWOZ 2.0 around 40% [Eric et al., 2019]

MultiWOZ 2.1 over 30% [Zhang et al., 2019]

[Eric et al., 2019] Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines. arXiv, 2019.[Zhang et al., 2019] Jian-Guo Zhang, Kazuma Hashimoto, Chien-Sheng Wu, Yao Wan, Philip S Yu, Richard Socher, and Caiming Xiong. Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv, 2019.

now updated toMultiWOZ 2.2

CHAPTER 1 What is the problem?

CHAPTER 1

Successful in narrow domains with large annotated datasets

What is the problem?

CHAPTER 1


√


CHAPTER 1

Limited to the domain trained on and do not afford generalization to new domains.


√


CHAPTER 1 What is the problem?


Limited to the domain trained on and do not afford generalization to new domains.

√

CHAPTER 1 Dialogue State Induction (DSI)

Definition:

What is given?

A set of customer service records without annotation.

User: I want an expensive restaurant that serves Turkish food.System: Anatolia serves Turkish food.User: What is the area?

CHAPTER 1

Definition:

What is given?

A set of customer service records without annotation.

What is the target?

Automatically discover information that the user is looking for at each turn.

User: I want an expensive restaurant that serves Turkish food.System: Anatolia serves Turkish food.User: What is the area?

inform(price=expensive, food=Turkish)

inform(price=expensive, food=Turkish); request(area)

Dialogue State Induction (DSI)

CHAPTER 1 Dialogue State Induction vs DST













































CHAPTER 1 DSI VS zero-shot DST

Zero-shot DST: support unseen domains (services)

CHAPTER 1


Motivation: Different domains (services) with similar schemas

DSI VS zero-shot DST

CHAPTER 1


service_name: “Flights" Service

name: "origin" Slots

name: "destination"

service_name: "Trains" Service

name: "from" Slots

name: "to"

Motivation: Different domains (services) with similar schemas


CHAPTER 1


Example from the SGD dataset [Rastogi et al., 2019].

Motivation: Different domains (services) with similar schemasslots along with their natural language description

service_name: “Flights" Servicedescription: "Search for cheap flights across multiple providers"

name: "origin" Slotsdescription: “City of origin for the flight"

name: "destination"description: “City of destination for the flight"

service_name: "Trains" Servicedescription: “Service to find train journeys between cities"

name: "from" Slotsdescription: “Starting city for train journey"

name: "to"description: “Ending city for the train journey"

[Rastogi et al., 2019]Rastogi A, Zang X, Sunkara S, et al. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset[J]. arXivpreprint arXiv:1909.05855, 2019.


CHAPTER 1

service_name: “Flights" Servicedescription: "Search for cheap trains across multiple providers"







CHAPTER 1







Zero-shot DST Limitations:


CHAPTER 1

• High qualified (consistent) human annotation









CHAPTER 1







• High qualified (consistent) human annotationZero-shot DST Limitations:


CHAPTER 1









CHAPTER 1







consistency



CHAPTER 1






name: "to"description: "Ending city for the train journey"

consistency


• Transfer to distant domain (service)



CHAPTER 1

service_name: "Financial" Servicedescription: “Search for financial products across multiple providers"

name: “deposit_rate" Slotsdescription: "the standard for calculating deposit interest"

name: "loan_rate"description: "the interest rate that banks charge to borrowers"


name: "from" Slotsdescription: "Starting city for train journey"


consistency





CHAPTER 1







consistency



XZero-shot DST Limitations:


CHAPTER 1







consistency




XDSI features:• Release human burden

• Data-driven: automatically discover


Method

CHAPTER 2 How we solve DSI?

Two steps: Utterance: I need to take a train out of Chicago, I will be leaving Dallas on Wednesday.


train, Chicago, Dallas, Wednesday

• Candidates (values) extraction (POS tag, NER, coreference)

Two steps: Utterance: I need to take a train out of Chicago, I will be leaving Dallas on Wednesday.


train, Chicago, Dallas, Wednesday

• Candidates (values) extraction (POS tag, NER, coreference)

Inform{train-departure=Chicago, train-destination=Dallas, train-leave at=Wednesday}train=None

Two steps:

• Slot assignment: two neural latent variable models(DSI-base and DSI-GM)

Utterance: I need to take a train out of Chicago, I will be leaving Dallas on Wednesday.

CHAPTER 2 What is VAE?

Pics from https://www.jianshu.com/p/ffd493e10751

AutoEncoder Variational AutoEncoder

CHAPTER 2 Model: generative DSI-base

I need to take a train out of Chicago, I will be leaving Dallas on Wednesday.

Encoder



0 1 0 1 … 0 0 1 1

vocab length (all candidates)

one-hot

Encoder



0 1 0 1 … 0 0 1 1


one-hot

LT+SoftPlus+Dropout

Encoder



0 1 0 1 … 0 0 1 1


one-hot

LT+SoftPlus+Dropout

encoded oh

Encoder



0 1 0 1 … 0 0 1 1


one-hot Pre-trained ELMo

train Chicago Dallas Wednesdaycontextualized

embedding

LT+SoftPlus+Dropout

encoded oh

Encoder



0 1 0 1 … 0 0 1 1




embedding

LT+SoftPlus+Dropout

AvgPooling+LT+SoftPlus+Dropout

encoded oh

Encoder



0 1 0 1 … 0 0 1 1




embedding

LT+SoftPlus+Dropout


encoded oh encoded ce

Encoder



0 1 0 1 … 0 0 1 1




embedding

LT+SoftPlus+Dropout


encoded oh encoded ce

LT+BatchNorm LT+BatchNorm

posterior µ 2posterior logσ

Encoder

CHAPTER 2


Model: generative DSI-base

CHAPTER 2

Sampling



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι

reparameterization trick

Sampling



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι


zlatent state vector

Sampling



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



Sampling

Decoder



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



LT+Softmax

Sampling

Decoder



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



LT+Softmax

slatent slot vector

(dimension: slot_num)

Sampling

Decoder



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



LT+Softmax

slatent slot vector


LT+BatchNorm

Sampling

Decoder



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



LT+Softmax

slatent slot vector


LT+BatchNorm

0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed ohvocab length (all candidates)

Sampling

Decoder



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



LT+Softmax

slatent slot vector



0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed ohvocab length (all candidates)

Sampling

Decoder



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



LT+Softmax

slatent slot vector



0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed oh

recon_µ

vocab length (all candidates)contextualized

feature dimension

Sampling

Decoder



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



LT+Softmax

slatent slot vector


LT+BatchNorm LT+BatchNorm LT+BatchNorm

0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed oh

recon_µ


feature dimension

Sampling

Decoder



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



LT+Softmax

slatent slot vector



0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed oh

recon_µ 2recon_ logσ


feature dimensioncontextualized

feature dimension

Sampling

Decoder



CHAPTER 2

( ), 0,µ σ ε ε+ ∗ Ι



LT+Softmax

slatent slot vector



0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed oh




feature dimensionreconstructed parameters for ce

Sampling

Decoder



CHAPTER 2

reconstruction term

regularization term

Loss

CHAPTER 2

0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4 reconstructed oh


0 1 0 1 … 0 0 1 1 ohCross Entropy

Loss

reconstruction term

regularization term

Loss

CHAPTER 2


recon_µ

2recon_ logσ


contextualized feature dimension



Loss

Multivariate Gaussian Distribution

reconstruction term

regularization term

Loss

CHAPTER 2


recon_µ

2recon_ logσ





Loss


train Chicago Dallas Wednesdayreconstruction

term

regularization term

Loss

CHAPTER 2


recon_µ

2recon_ logσ





Loss


Pic from https://www.datalearner.com/blog/1051485590815771

train Chicago Dallas Wednesday

-dist.log_prob(ce[train])

-dist.log_prob(ce[Wednesday])

-dist.log_prob(ce[Chicago])

-dist.log_prob(ce[Dallas])

reconstruction term

regularization term

Loss

CHAPTER 2


recon_µ

2recon_ logσ





Loss




-dist.log_prob(ce[train])sum




reconstruction term

regularization term

Loss

CHAPTER 2


recon_µ

2recon_ logσ





Loss




-dist.log_prob(ce[train])sum




reconstruction term

( ) ( )( , , 0,1 )KL µ σ KL Divergence between posterior and prior: regularization term

Loss

CHAPTER 2 DSI-base inferenceI need to take a train out of Chicago, I will be leaving Dallas on Wednesday.


Encoder


Encoder



Encoder


Sampling


Encoder



Sampling


Encoder



Sampling

LT+Softmax


Encoder



Sampling

LT+Softmax

s


Encoder



Sampling

LT+Softmax

s

p(s)[slot_num,1]


Encoder



Sampling

LT+Softmax

s

p(s)

For each candidate in {train, Chicago, Dallas, Wednesday}

[slot_num,1]

CHAPTER 2

slatent slot vector



0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed oh





What does the model learn ?

Decoder

CHAPTER 2

slatent slot vector



0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed oh





…

…

…

…

…

slot_num

vocab len

W: parameter matrix


Decoder

CHAPTER 2

0.10.7

0.01

0.08

……

What does the model learn ?s: slot vector

…

…

…

…

…

slot_num

vocab_len

W: parameter matrix

slot_num 0.01 0.2 0.01 0.8 … 0.03 0.02 0.3 0.4× =

CHAPTER 2

0.10.7

0.01

0.08

……


…

…

…

…

…

slot_num

vocab_len

W: parameter matrix

slot_num 0.01 0.2 0.01 0.8 … 0.03 0.02 0.3 0.4

0 1 0 1 … 0 0 1 1

× =Cross Entropy Loss

CHAPTER 2

…….

0.10.7

0.01

0.08

……


…

…

…

…

…

slot_num

vocab_len

W: parameter matrix

slot_num 0.01 0.2 0.01 0.8 … 0.03 0.02 0.3 0.4

0 1 0 1 … 0 0 1 1

× =Cross Entropy Loss

slot_num different slot-candidate distributions: p(c|si), i=1,2,…, slot_num

CHAPTER 2 DSI-base inference


Encoder



Sampling

LT+Softmax

s

p(s)


[slot_num,1]



Encoder



Sampling

LT+Softmax

s

p(s)


0 0 0 1 … 0 0 0 0oh

[slot_num,1]



Encoder



Sampling

LT+Softmax

s

p(s)


0 0 0 1 … 0 0 0 0oh

slot_num

vocab_len

[slot_num,1]



Encoder



Sampling

LT+Softmax

s

p(s)


0 0 0 1 … 0 0 0 0oh

slot_num

vocab_len

p(oh|s)[slot_num,1] [slot_num,1]



Encoder



Sampling

LT+Softmax

s

p(s)


0 0 0 1 … 0 0 0 0oh

slot_num

vocab_len

p(oh|s)[slot_num,1] [slot_num,1]

Chicagoce

CHAPTER 2

slatent slot vector



0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed oh




feature dimensionreconstructed parameters to generate ce

DSI-base inference

feature_dim

Decoder

CHAPTER 2

slatent slot vector



0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed oh





feature_dim

DSI-base inference

…

…

…

…

…

slot_num

feature_dim

Decoder

CHAPTER 2

slatent slot vector



0.01 0.2 0.01 0.08 … 0.03 0.02 0.3 0.4

reconstructed oh





feature_dim

DSI-base inference

…

…

…

…

…

slot_num

feature_dim

…

…

…

…

…

slot_num

Decoder

CHAPTER 2

0.10.7

0.01

0.08

……


s: slot vector

…

…

…

…

…

slot_num

featrure_dim

W: parameter matrix

slot_num ×

=

…

…

…

…

…

slot_num

featrure_dim

recon_µ

2recon_ logσ=

CHAPTER 2

0.10.7

0.01

0.08

……


s: slot vector

…

…

…

…

…

slot_num

featrure_dim

W: parameter matrix

slot_num ×

=

…

…

…

…

…

slot_num

featrure_dim

recon_µ

2recon_ logσ=

1µ

1Sµ −

2µ

Sµ

…….

1σ

1Sσ −

2σ

Sσ

……

CHAPTER 2

0.10.7

0.01

0.08

……


s: slot vector

…

…

…

…

…

slot_num

featrure_dim

W: parameter matrix

slot_num ×

=

…

…

…

…

…

slot_num

featrure_dim

recon_µ

2recon_ logσ=

1µ

1Sµ −

2µ

Sµ

…….

1σ

1Sσ −

2σ

Sσ

……

slot_num different Multivariate Gaussian Distributions

( , ), 1, 2, ,i i i Sµ σ =

CHAPTER 2 Multivariate Gaussian probability density

µ

σ


contextualized feature dimension Multivariate Gaussian Distribution



-dist.log_prob(ce[train])




CHAPTER 2


Encoder+Sampling


LT+Softmax

slatent slot vector(dimension: slot_num)

p(s)


0 0 0 1 … 0 0 0 0

p(oh|s)

oh

[slot_num,1]

slot_num

vocab_len

[slot_num,1]

Chicagoce

DSI-base inference

CHAPTER 2


Encoder+Sampling


LT+Softmax


p(s)


0 0 0 1 … 0 0 0 0

p(oh|s)

oh

[slot_num,1]

slot_num

vocab_len

[slot_num,1]

Chicagoce

( , ),1, 2, , _

i i

i slot numµ σ

=

DSI-base inference

CHAPTER 2


Encoder+Sampling


LT+Softmax


p(s)


0 0 0 1 … 0 0 0 0

p(oh|s)

oh

[slot_num,1]

slot_num

vocab_len

[slot_num,1]

Chicagoce

( , ),1, 2, , _

i i

i slot numµ σ

=

p(ce|s)[slot_num,1]

DSI-base inference

CHAPTER 2


Encoder+Sampling


LT+Softmax


p(s)


0 0 0 1 … 0 0 0 0

p(oh|s)

oh

[slot_num,1]

slot_num

vocab_len

[slot_num,1]

Chicagoce

( , ),1, 2, , _

i i

i slot numµ σ

=

p(ce|si)[slot_num,1]

* *

DSI-base inference

CHAPTER 2


Encoder+Sampling


LT+Softmax


p(s)


0 0 0 1 … 0 0 0 0

p(oh|s)

oh

[slot_num,1]

slot_num

vocab_len

[slot_num,1]

Chicagoce

( , ),1, 2, , _

i i

i slot numµ σ

=

p(ce|si)[slot_num,1]

* *p(s|Chicago)[slot_num,1]

DSI-base inference

CHAPTER 2 Post-processing: slot mapping

Chicago slot 0

Houston

Washington

Boston

slot 0

slot 0

slot 0

predictions

train-departure

references

Mapping from slot indexes to labels?

{0: train-departure}

Majority vote

train-departure

train-departure

train-departure

Chicago

Houston

Washington

Boston

Magdalene College slot 0 Magdalene College attraction-name

4 times

1 time

CHAPTER 2 Post-processing: slot mapping

Chicago slot 0

Houston

Washington

Boston

slot 0

slot 0

slot 0

predictions

train-departure

references

Mapping from slot indexes to labels?

train-departure

train-departure

train-departure

Chicago

Houston

Washington

Boston

Magdalene College slot 0 Magdalene College train-departure

{0: train-departure}

CHAPTER 2 What is the problem with DSI-base?

Multiple domains train

departuredestinationday

taxi

leave atarrive bySimilar slots

CHAPTER 2

Multiple domains train

departuredestinationday

taxi

leave atarrive by

I need to arrive by 19:45 and departs from Cambridge.

Similar slots

Similar contexts

What is the problem with DSI-base?

CHAPTER 2 How does DSI-GM work?

train-arrive by

train-leave at

train-daytrain-destination

train-departure

taxi-arrive by

taxi-leave at

taxi-destinationtaxi-departure

Z

A single Gaussian prior

µ σ

CHAPTER 2 How does DSI-GM work?

train-arrive by

train-leave at

train-daytrain-destination

train-departure

taxi-arrive by

taxi-leave at

taxi-destinationtaxi-departure

Z

A single Gaussian prior

µ σ

train-arrive by

train-leave at

train-day

train-destinationtrain-departure

taxi-arrive by

taxi-leave at

taxi-destination

taxi-departureZ

A Mixture-of-Gaussians prior

1 Kµ µ 1 Kσ σ

Z

π

CHAPTER 2


How does DSI-GM work?

CHAPTER 2


DSI-base


CHAPTER 2


Encoder+Sampling

DSI-base


CHAPTER 2


Encoder+Sampling

z

DSI-base


CHAPTER 2


Encoder+Sampling

z

Prediction

DSI-base


CHAPTER 2


Encoder+Sampling

z

Prediction

Chicago: taxi-departureDSI-base


CHAPTER 2


Encoder+Sampling

z

Prediction

Chicago: taxi-departureDSI-base DSI-GM


CHAPTER 2


Encoder+Sampling

z

Prediction


Encoder+Sampling

DSI-GM


CHAPTER 2


Encoder+Sampling

z

Prediction


Encoder+Sampling

z

DSI-GM


CHAPTER 2


Encoder+Sampling

z

Prediction


Encoder+Sampling

z

DSI-GM

A Mixture-of-Gaussians

Domain: train


CHAPTER 2


Encoder+Sampling

z

Prediction


Encoder+Sampling

z

Prediction

DSI-GM


Domain: train


CHAPTER 2


Encoder+Sampling

z

Prediction


Encoder+Sampling

z

Prediction Chicago: departure

DSI-GM


Domain: train


CHAPTER 2


Encoder+Sampling

z

Prediction


Encoder+Sampling

z

Prediction Chicago: departure

DSI-GM


Domain: train

Chicago: train-departure


请输入您的标题NINDEBIAOTIExperiments

CHAPTER 3 DSI results

CHAPTER 3 DSI results

CHAPTER 3 DSI-Based Response Generation


[Chen et al., 2019] Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan, and William Yang Wang. Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In ACL, 2019.





CHAPTER 3 Analysis

CHAPTER 3

Domain level comparison of the latent representation z.

Analysis

请输入您的标题NINDEBIAOTIConclusion

CHAPTER 4 Conclusion

CHAPTER 4

• Dialogue state induction: a novel task to automatically identify dialogue states

Conclusion

CHAPTER 4


Conclusion

• DSI-base/DSI-GM: two neural generative models with latent variables

CHAPTER 4


Conclusion


• Challenging and promising: unsupervised setting is very practical

CHAPTER 4


Conclusion


• Challenging and promising: unsupervised setting is very practical

• IJCAI review: this problem is important and interesting, this area should attract more attention. This work has great potential of motivating follow-up research.

Contact: [email protected]

Paper: https://www.ijcai.org/Proceedings/2020/0532.pdf

GitHub: https://github.com/taolusi/dialogue-state-induction

GitHubpaper

Documents

taolusi.github.io · Dialogue State Induction Using Neural Latent Variable Models. Qingkai Min. 1;2 , Libo Qin. 3 , Zhiyang Teng. 1;2 , Xiao Liu. 4. and Yue Zhang. 1;2. 1. School