Conversational Semantic Parsing for Dialog State Tracking · Table 1: An example conversation in TreeDST with annotations. User and system utterances are marked in red and blue respectively

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8107–8117,November 16–20, 2020. c©2020 Association for Computational Linguistics

8107

Conversational Semantic Parsing for Dialog State Tracking

Jianpeng Cheng Devang Agrawal Hector Martınez Alonso Shruti Bhargava Joris DriesenFederico Flego Dain Kaplan Dimitri Kartsaklis Lin Li Dhivya Piraviperumal

Jason D Williams Hong Yu Diarmuid O Seaghdha Anders JohannsenApple

{jianpeng.cheng,hmartinezalonso,shruti bhargava,joris driesen,fflego,dain kaplan,dkartsaklis,lli9,dhivyaprp,jason williams4,

hong yu,doseaghdha,ajohannsen}@apple.com

Abstract

We consider a new perspective on dialogstate tracking (DST), the task of estimatinga user’s goal through the course of a dialog.By formulating DST as a semantic parsingtask over hierarchical representations, we canincorporate semantic compositionality, cross-domain knowledge sharing and co-reference.We present TreeDST, a dataset of 27kconversations annotated with tree-structureddialog states and system acts.1 We describean encoder-decoder framework for DST withhierarchical representations, which leads to20% improvement over state-of-the-art DSTapproaches that operate on a flat meaningspace of slot-value pairs.

1 Introduction

Task-based dialog systems, for example digitalpersonal assistants, provide a linguistic userinterface for all kinds of applications: fromsearching a database, booking a hotel, checkingthe weather to sending a text message. In orderto understand the user, the system must be able toboth parse the meaning of an utterance and relate itto the context of the conversation so far. Whilelacking the richness of a conversation betweentwo humans, the dynamics of human-machineinteraction can still be complex: the user maychange their mind, correct a misunderstanding orrefer back to previously-mentioned information.

Language understanding for task-based dialogis often termed “dialog state tracking” (DST)(Williams et al., 2016), the mental model beingthat the intent of the user is a partially-observedstate that must be re-estimated at every turn givennew information. The dialog state is typicallymodelled as a set of independent slots, and astandard DST system will maintain a distribution

1The dataset is available at https://github.com/apple/ml-tree-dst.

over values for each slot. In contrast, languageunderstanding for other NLP applications is oftenformulated as semantic parsing, which is the taskof converting a single-turn utterance to a graph-structured meaning representation. Such meaningrepresentations include logical forms, databasequeries and other programming languages.

These two perspectives on languageunderstanding—DST and semantic parsing—have complementary strengths and weaknesses.DST targets a fuller range of conversationaldynamics but typically uses a simple and limitingmeaning representation. Semantic parsingembraces a compositional view of meaning. Bybasing meaning on a space of combinable, reusableparts, compositionality can make the NLU problemspace more tractable (repeated concepts must onlybe learned once) and more expressive (it becomespossible to represent nested intents). At the sametime, most semantic parsing research treats asentence as an isolated observation, detached fromconversational context.

This work unifies the two perspectives byreformulating DST as conversational semanticparsing. As in DST, the task is to track auser’s goal as it accumulates over the course ofa conversation. The goal is represented using astructured formalism like those used in semanticparsing. Specifically, we adopt a hierarchicalrepresentation which captures domains, verbs,operators and slots within a rooted graph groundedto an ontology. The structured dialog state iscapable of tracking nested intents and representingcompositions in a single graph (Turn 5 Table1). The formalism also naturally supports cross-domain slot sharing and cross-turn co-referencethrough incorporating the shared slots or thereferences as sub-graphs in the representation (Turn3 Table 1).

Using a reverse annotation approach inspired

https://github.com/apple/ml-tree-dst

https://github.com/apple/ml-tree-dst

8108

Turn Utterance and Annotation

1

Hi can you book me a flight to Paris please.user.flight.book.object.equals.destination.equals.location.equals.Paris

Sure, when and where will you depart?system.prompt.flight.book.object.equals.source.departureDateTime

2

Tomorrow from Londonuser.flight.book.object.equals.destination.equals.location.equals.Paris.source.equals.location.equals.London.departureDateTime.equals.date.equals.definedValue.equals.Tomorrow

I found 5 flights for you. The earliest one departsat 10 AM with a cost of £105. Would you like it?system.prompt.flight.book.object.equals.departureDateTime.equals.time

.inform.flight.find.count.equals.5.object.equals.departureDateTime.equals.time.equals.hour.equals.10.meridiem.equals.AM

.price.equals.105

3

Do I have any calendar event on that day?user.calendarEvent.checkExistence.object.equals.dateTimeRange.equals.date.equals.definedValue.equals.Tomorrow

No you don’t have any event.system.inform.calendarEvent.find.notExisted

4

I would book the 10 AM flight for me pleaseuser.flight.book.object.equals.destination.equals.location.equals.Paris.source.equals.location.equals.London.departureDateTime.equals.date.equals.definedValue.equals.Tomorrow.time.equals.hour.equals.10.meridiem.equals.AM

Here is your booking information. Please confirm.system.offer.flight.book

5

Direction to my next meeting.user.navigation.find.object.equals.destination.equals.reference.calendarEvent.object.equals.listOffset.equals.1

Here is the direction to your meeting at theGlassHouse.system.navigation.inform.find.calendarEvent.inform.find.object.equals.location.equals.GlassHouse

Table 1: An example conversation in TreeDST withannotations. User and system utterances are markedin red and blue respectively. We use dot to representtree edges and increased indentation levels to revealmultiple children attached to the same parent node. Aside-by-side comparison between a dotted tree and itsfull drawing can be found in Appendix A.

by Shah et al. (2018) and Rastogi et al.(2019), we have collected a large dataset oftask-oriented dialogs annotated with hierarchicalmeaning representations. Each dialog wasgenerated through a two-step process. First, agenerative dialog simulator produces a meaningfulconversational flow and a template-based utterancefor each turn in the conversation. Then theutterances are paraphrased by human annotators

to render more realistic and natural conversations.The resulting dataset, which we call TreeDST,covers 27k conversations across 10 domains.Conversations in TreeDST are non-linear: theycontain glitches which represent system failuresand uncooperative user behaviors such as under-and over-specifying slot information. There arealso use cases not addressed in existing slot-fillingdatasets, including compositional intents and multi-intent utterances.

The second contribution of this work is aconversational semantic parser that tackles DSTas a graph generation problem. At each turn,the model encodes the current user utterance andrepresentations of dialog history, based upon whichthe decoder generates the updated dialog state witha mixture of generation and copy mechanism. Inorder to track practical conversations with intentswitching and resumption (Lee and Stent, 2016;El Asri et al., 2017), we adopt a stack (Rudnickyand Xu, 1999) to represent multiple tasks in thedialog history, and a parent-pointer decoder tospeed up decoding. We conducted controlledexperiments to factor out the impact of hierarchicalrepresentations from model architectures. Overallour approach leads to 20% improvement over state-of-the-art DST approaches that operate on a flatmeaning space.

2 Related Work

Modeling Traditional DST models applydiscriminative classifiers over the space ofslot-value combinations (Crook and Lemon, 2010;Henderson et al., 2014b; Williams et al., 2016).These models require feature extraction from userutterances based on manually constructed semanticdictionaries, making them vulnerable to languagevariations. Neural classification models (Mrksicet al., 2017; Mrksic and Vulic, 2018) alleviate theproblem by learning distributed representations ofuser utterances. However, they still lack scalabilityto large unbounded output space (Xu and Hu, 2018;Lee et al., 2019) and structured representations. Toaddress the limitations, some recent work treatsslot filling as a sequence generation task (Renet al., 2019; Wu et al., 2019).

On the other hand, single-turn semantic parsershave long used structured meaning representationsto address compositionality (Liang et al., 2013;Banarescu et al., 2013; Kollar et al., 2018; Yuet al., 2018; Gupta et al., 2018). Solutions range

8109

from chart-based constituency parsers (Berant et al.,2013) to more recent neural sequence-to-sequencemodels (Jia and Liang, 2016). The generalchallenge of scaling semantic parsing to DST is thatdialog state, as an accumulation of conversationhistory, requires expensive context-dependentannotation. It is also unclear how utterancesemantics can be aggregated and maintained in astructured way. In this work we provide a solutionto unify DST with semantic parsing.

Data Collection The most straightforwardapproach to building datasets for task-orienteddialog is to directly annotate human-systemconversations (Williams et al., 2016). A limitationis that the approach requires a working systemat hand, which causes a classic chicken-and-eggproblem for improving user experience. Theissue can be avoided with Wizard-of-Oz (WoZ)experiments to collect human-human conversations(El Asri et al., 2017; Budzianowski et al., 2018;Peskov et al., 2019; Byrne et al., 2019; Radlinskiet al., 2019). However, dialog state annotationremains challenging and costly in WoZ, and theresulting distribution could be different from thatof human-machine conversations (Budzianowskiet al., 2018). One approach that avoids directmeaning annotation is to use a dialog simulator(Schatzmann et al., 2007; Li et al., 2016). Recently,Shah et al. (2018) and Rastogi et al. (2019)generate synthetic conversations which aresubsequently paraphrased by crowdsourcing. Thisapproach has been proven to provide a bettercoverage while reducing the error and cost ofdialog state annotation (Rastogi et al., 2019). Weadopt a similar approach in our work, but focusingon a structured meaning space.

3 Setup

3.1 Problem Statement

We use the following notions throughout thepaper. A conversation X has representation Ygrounded to an ontology K: at turn t, everyuser utterance xut is annotated with a user dialogstate yut , which represents an accumulated usergoal up to the time step t. Meanwhile, everysystem utterance xst is annotated with a systemdialog act yst , which represents the system actionin response to yut . Both yut and yst adopt thesame structured semantic formalism to encourageknowledge sharing between the user and the system.

From the perspective of the system, yst is observed(the system knows what it has just said) and yutmust be inferred from the user’s utterance. Forthe continuation of an existing goal, the old dialogstate will keep being updated; however, when theuser proposes a completely new goal during theconversation, a new dialog state will overwrite theold one. To track goal switching and resumption,a stack is used to store non-accumulable dialogstates in the entire conversation history (in bothdata simulation and dialog state tracking).

There are two missions of this work: 1) buildinga conversational dataset with structured annotationsthat can effectively represent the joint distributionP (X,Y ); and 2) building a dialog state trackerwhich estimates the conditional distribution ofevery dialog state given the current user input anddialog history P (yut |xut , X<t, Y<t).

3.2 Representation

We adopt a hierarchical and recursive semanticrepresentation for user dialog states and systemdialog acts. Every meaning is rooted ateither a user or system node to distinguishbetween the two classes. Non-terminals of therepresentation include domains, user verbs,system actions, slots, and operators.A domain is a group of activities such ascreation and deletion of calendar events. Auser verb represents the predicate of the userintent sharable across domains, such as create,delete, update, and find. A systemaction represents a type of system dialog actin response to a user intent. For example, thesystem could prompt for a slot value; informthe user about the information they asked for;and confirm if an intent or slot is interpretedcorrectly. Nested properties (e.g., time range) arerepresented as a hierarchy of slot-operator-argument triples, where the argument canbe either a sub-slot, a terminal value node ora special reference node. The value nodeaccepts either a categorical label (e.g., day ofweek) or an open value (e.g., content of textmessage). The reference node allows a wholeintent to be attached as a slot value, enablingthe construction of cross-domain use cases (e.g.,Turn 5 of Table 1). Meanwhile, co-referenceto single slots is directly achieved by subtreecopying (e.g., Turn 3 of Table 1). Finally,conjunction is implicitly supported by allowing

8110

a set of arguments to be attached to the same slot.Overall, the representation presented above focuseson ungrounded utterance semantics to decoupleunderstanding from execution. By incorporatingdomain-specific logic, the representation can bemapped to executable programs at a later stage.

4 Data Elicitation

4.1 Overview

We sidestep the need for collecting real human-system conversations and annotating them withcomplex semantics by adopting a reverse dataelicitation approach (Shah et al., 2018; Rastogiet al., 2019). We model the generative processP (X,Y ) = P (Y )P (X|Y ), where representationsof conversation flows (Y ) are firstly rendered bya dialog simulator, and then realised into naturaldialog (X) by annotators. Two central aspectswhich directly impact the quality of the resultingdata are: (1) the dialog simulator which controlsthe coverage and naturalness of the conversations;(2) the annotation instructions and quality controlmechanism that ensures X and Y are semanticallyaligned after annotation.

4.2 Simulator

The most common approach of simulating aconversation flow is agenda-based (Schatzmannet al., 2007; Li et al., 2016; Shah et al., 2018;Rastogi et al., 2019). At the beginning of thisapproach, a new goal is defined in the form ofslot-value pairs describing user’s requests andconstraints; and an agenda is constructed bydecomposing the user goal into a sequence of useractions. Although the approach ensures the userbehaves in a goal-oriented manner, it constrainsthe output space with pre-defined agendas, whichis hard to craft for complex user goals (Shi et al.,2019).

Arguably, a more natural solution to dialogsimulation for complex output space is a fullygenerative method. It complies with the behaviorthat a real user may only have an initial goal atthe start of conversation, while the final dialogstate cannot be foreseen in advance. The wholeconversation can be defined generatively as follows:

P (Y ) = P (yu0 )n∑

t=0

P (yst |yut )n∑

t=1

P (yut |ys<t, yu<t)

(1)

where Y is the conversation flow, yut is the userdialog state at turn t and yst the system dialog act.The decomposed probability of P (Y ) captures thefunctional space of dialog state transitions withthree components: 1) a module generating theinitial user goal P (yu0 ), 2) a module generatingsystem act P (yst |yut ), and 3) a module for user stateupdate based on the dialog history P (yut |ys<t, y

u<t).

The conversation terminates at time step n whichmust be a finishing state (system success or failure).

Initial intent module P (yu0 ) The dialog stateyu0 representing the initial user goal is generatedwith a probabilistic tree substitution grammar(Cohn et al., 2010, PTSG) based on our semanticformalism. Non-terminal symbols in the PTSGcan rewrite entire tree fragments to encode non-local context such as hierarchical and co-occurringslot combinations. As explained in the examplebelow, the algorithm generates dialog states fromtop down by recursively substituting non-terminals(marked in green) with subtrees.

$createEvent →user.calendarEvent.create. $newEvent

$newEvent →object.equals. $attendees

. $location

This example generates a calendar event creationintent that contains two slots of attendees andlocation. In a statistical approach, the samplingprobability of each production rule could be learnedfrom dialog data. For the purpose of bootstrappinga new model, any prior distribution can be applied.

Response module P (yst |yut ) At turn t, thesystem act yst is generated based on the currentuser dialog state yut and domain-specific logic.We represent the probability P (yst |yut ) with aprobabilistic tree transformation grammar, whichcaptures how an output tree (yst ) is generated fromthe input (yut ) through a mixture of generation and acopy mechanism. As shown in the example below,every production rule in the grammar is in theform A → B, where A is an input pattern thatcould contain both observed nodes and unobservednodes (marked in red), and B is an output patternto generate.

8111

user.calendarEvent.create.object.equals

. -dateTimeRangeysystem.prompt.calendarEvent

.create.object.equals.dateTimeRange

Given a user dialog state, the simulator looksup production rules which result in a match ofpattern A, and then derives a system act basedon the pattern B. Like in PTSG, probabilities oftransformations can be either learned from data orspecified a priori.

State update module P (yut |ys<t, yu<t) The

generation of the updated dialog state is dependenton the dialog history. While the full space of yut isunbounded, we focus on simulating three commontypes of updates, where a user introduces a newgoal, continues with the previous goal yut−1, orresumes an earlier unfinished goal (see Turn 2-4of Table 1 for examples, respectively). To modeldialog history, we introduce an empty stack whenthe conversation starts. A dialog state and therendered system act are pushed onto the stackupon generation, dynamically updated during theconversation, and popped from the stack upon taskcompletion. Therefore, the top of the stack alwaysrepresents the most recent unfinished task ytop,ut−1

and the corresponding system act ytop,st−1 .We consider the top elements of the stack as

the effective dialog history and use it to generatethe next dialog state yut . The generation interfaceis modeled with a similar tree transformationgrammar, but every production rule has two inputsin the form A,B → C:

user.calendarEvent.create.object.equals

. -dateTimeRange

system.prompt.calendarEvent.create.object.equals.dateTimeRangey

user .calendarEvent.create.object.equals

. $dateTimeRange

where A specifies a matching pattern of the usergoal ytop,ut−1 , B is a matching pattern of the systemact ytop,st−1 , and C is an output pattern that representshow the updated dialog state is obtained through amixture of grammar expansion (marked in green)and copy mechanism from either of the two sources(marked in yellow).

4.3 Annotation

Following Shah et al. (2018); Rastogi et al.(2019), every grammar production in the simulatoris paired with a template whose slots aresynchronously expanded. As a result, eachdialog state or system act is associated witha template utterance. The purpose is to offerminimum understandability to each conversationflow, based on which annotators will generatenatural conversations.

Instructions Annotators generate a conversationbased on the given templated utterances. Thetask proceeds turn by turn. For each turn, weinstruct annotators to convey exactly the sameintents and slots in each user or system utterance,in order to make sure the obtained utteranceagrees with the programmatically generatedsemantic annotation. The set of open values(specially marked in brackets, such as eventtitles and text messages) must be preservedtoo. Besides the above restrictions, we giveannotators the freedom to generate an utterancewith paraphrasing, compression and expansion inthe given dialog context, to make conversationsas natural as possible. While system utterancesare guided to be expressed in a professionaltone, we encourage annotators to introduceadequate syntactic variations and chit-chats in userutterances as long as they do not change the intentto be delivered.

Quality control We enforce two quality controlmechanisms before and after the dialog renderingtask. Before the task, we ask annotators to providea binary label to each conversation flow. Thelabel indicates if the conversation contains any non-realistic interactions. We can therefore filter outlow-quality data outputted by the simulator. Afterthe task, we ask a different batch of annotatorsto evaluate if each human-generated utterancepreserves the meaning of the templated utterance;any conversation that fails this check is removed.

4.4 Statistics

The resulting TreeDST dataset consists of 10domains: flight, train, hotel, restaurant, taxi,calendar, message, reminder, navigation, andphone, exhibiting nested properties for people,time, and location that are shared across domains.Table 2 shows a comparison of TreeDST with thefollowing pertinent datasets, DSTC2 (Henderson

8112

DSTC2 WOZ2.0 FRAMES M2M MultiWOZ SGD TreeDSTRepresentation Flat Hierarchical#Dialogs 1,612 600 1,369 1,500 8,438 16,142 27,280Total #turns 23,354 4,472 19,986 14,796 113,556 329,964 167,507Avg. #turns/dialog 14.5 7.45 14.60 9.86 13.46 20.44 7.14Avg. #tokens/utterance 8.54 11.24 11.24 8.24 13.13 9.75 7.59#slots 8 4 61 13 24 214 287#values 212 99 3,871 138 4,510 14,139 20,612#multi-domain dialog - - - - 7,032 16,142 14,999#compositional utterance - - - - - - 10,133#cross-turn co-reference - - - - - - 9,609

Table 2: Comparison of TreeDST with pertinent datasets for task-oriented dialogue.

et al., 2014a), WOZ2.0 (Wen et al., 2017),FRAMES (El Asri et al., 2017), M2M (Shah et al.,2018), MultiWOZ (Budzianowski et al., 2018)and SGD (Rastogi et al., 2019). Similar to ourwork, both M2M and SGD use a simulator togenerate conversation flows; and both MultiWOZand SGD contain multi-domain conversations. Thedifference is that all the previous work representsdialog states as flat slot-value pairs, which arenot able to capture complex relations such ascompositional intents.

5 Dialog State Tracking

The objective in the conversational semanticparsing task is to predict the updated dialog stateat each turn given the current user input anddialog history P (yut |xut , X<t, Y<t). We tackle theproblem with an encoder-decoder model: at turnt, the model encodes the current user utterancexut and dialog history, conditioned on which thedecoder predicts the target dialog state yut . We callthis model the Tree Encoder-Decoder, or TED.

Dialog history When a dialog session involvestask switching, there will be multiple, non-accumulable dialog states in the conversationhistory. Since it is expensive to encodethe entire history X<t, Y<t whose size growswith the conversation, we compute a fixed-sizehistory representation derived from the previousconversation flow (Y<t). Specifically, we reusethe notation of a stack to store past dialog states,and the top of the stack ytop,ut−1 tracks the mostrecent uncompleted task. The dialog history isthen represented with the previous dialog state yut−1,the dialog state on top of the stack ytop,ut−1 , and theprevious system dialog act yst−1. We merge the twodialog states yut−1 and ytop,ut−1 into a single tree Y u

t−1

for featurization.

Encoding We adopt three encoders for utterancexut , system act yst−1 and dialog state Y u

t−1

respectively. For the user utterance xut , abidirectional LSTM encoder is used to convertthe word sequence into an embedding list Hx =[hx

1,hx2, · · · ,hx

n], where n is the length of the wordsequence. For both the previous system act yst−1

and user state Y ut−1, we linearize them into strings

through depth-first traversal (see Figure 1). Thenthe linearized yst−1 and Y u

t−1 are encoded withtwo separate bidirectional LSTMs. The outputsare two embedding lists: Hs = [hs

1,hs2, · · · ,hs

m]where m is the length of the linearized system actsequence, and Hu = [hu

1 ,hu2 , · · · ,hu

l ] where l isthe length of the linearized dialog state sequence.The final outputs of encoding are Hx, Hs and Hu.

Decoding After encoding, the next dialog stateyut is generated with an LSTM decoder as alinearized string which captures the depth-firsttraversal of the target graph (see Figure 1).

At decoding step i, the decoder feeds theembedding of the previously generated token yut,i−1

and updates the decoder LSTM state to gi:

gi = LSTM(gi−1,yut,i−1) (2)

An attention vector is computed between the stategi and each of the three encoder outputs Hx, Hs

and Hu. For each of the encoder memory H, thecomputation is defined as follows:

ai,j = attn(gi,H)

wi,j = softmax(ai,j)

hi =

n∑j=1

wi,jhj

(3)

where attn represents the feed-forward attentiondefined in Bahdanau et al. (2015) and the softmaxis taken over index j. By applying the attentionmechanism to all three sources, we get three

8113

Figure 1: An overview of the TED encoder-decoder architecture.

attention vectors hxi , hs

i , and hui . The vectors are

concatenated together with the state gi to form afeature vector fi, which is used to compute theprobability of the next token though a mixture ofgeneration and copy mechanism (Gu et al., 2016):

λ = σ(Wifi + bi)

Pgen = softmax(Wvfi + bv)

Pcopy = softmax(ai, ci, ei)

P (yut,i) = λPgen + (1− λ)Pcopy

(4)

where W and b are all model parameters. λ is asoft gate controlling the proportion of generationand copy. Pgen is computed with a softmax over thegeneration vocabulary. a, c and e denote attentionlogits computed for the three encoders. Sincethere are three input sources, we concatenate alllogits and normalize them to compute the copydistribution Pcopy. The model is optimised on thelog-likelihood of output distribution P (yut,i). Anoverview of the model is shown in Figure 1.

5.1 Parent Pointer: a Faster Graph DecoderOne observation about the standard decoder is thatit has to predict long strings with closing bracketsto represent a tree structure in the linearization.Therefore the total number of decoding LSTMrecursions is the number of tree nodes plus thenumber of non-terminals. We propose a modifiedparent pointer (PP) decoder which reduces thenumber of autoregressions to the number of treenodes. This optimisation is not applicable only toour DST model, but to any decoder that treats treedecoding as sequence prediction in the spirit ofVinyals et al. (2015).

The central idea of the PP decoder is that at eachdecoding step, two predictions will be made: onegenerates the next tree node, and the other selectsits parent from the existing tree nodes. Eventuallyyut can be constructed from a list of tree nodes nutand a list of parent relations rut . More specifically,at time step i, the decoder takes in the embeddings

of the previous node nut,i−1 and its parent rut,i−1 togenerate the hidden state gi.

gi = LSTM(gi−1,nut,i−1, r

ut,i−1) (5)

This state is then used as as the input for twoprediction layers. The first layer predicts the nextnode probability P (nut,i) with Equation 3 to 4, andthe second layer selects the parent of the node byattending gi to the previously generated nodes,which are represented with the decoder memoryGi−1 = [g1, · · · ,gi−1]:

fi,j = attn(gi,Gi−1)

P (rut,i) = softmax(fi,j)(6)

The model is optimised on the average negativelog-likelihood of distributions P (nut,i) and P (rut,i).

6 Experiments

Setup We split the TreeDST data into train(19,808), test (3,739) and development (3,733)sets. For evaluation, we measure turn-level dialogstate exact match accuracy averaged over all turnsin the test set. We evaluate the proposed modelwith its “vanilla” decoder (TED-VANILLA) and itsparent-pointer variant (TED-PP). In both cases, theutterance encoder has 2 layers of 500 dimensions;the system act encoder and dialog state encoderhave 2 layers of 200 dimensions; and the decoderhas 2 LSTM layers of 500 dimensions. Dimensionsof word and tree node embeddings are 200 and50 respectively. Training uses a batch size of50 and Adam optimizer (Kingma and Ba, 2015).Validation is performed every 2 epochs and thetraining stops when the validation error does notdecrease in four consecutive evaluations. Thehyper-parameters were selected empirically basedon an additional dataset that does not overlap withTreeDST.

Baselines In order to factor out the contributionof meaning representations from model changes in

8114

Decoders AccuracyTED-VANILLA 0.622

TED-PP 0.622TED-FLAT 0.535

COMER (Ren et al., 2019) 0.509TRADE (Wu et al., 2019) 0.513

Table 3: Results on the TreeDST test set

experiments, we additionally derive a version ofour dataset where all meaning representations areflattened into slot-value pairs (details are describedin the next paragraph). We then introduce abaseline TED-FLAT by training the same model(as TED-VANILLA) on the flattened dataset.

We also introduce as baselines two state-of-the-art slot-filling DST models based on encoder-decoders: they include COMER (Ren et al., 2019)which encodes the previous system responsetranscription and the previous user dialog state anddecodes slot values; and TRADE (Wu et al., 2019)which encodes all utterances in the history. Sinceboth TED-FLAT and the two baselines are trainedwith flattened slot-value representations, we cancompare various models in this setup.

TreeDST flattening To flatten TreeDST wecollapse each path from the domain to leafnodes into a single slot. Verb nodes in the pathare excluded to avoid slot explosion. Take thefollowing tree as an example:

user.flight.book.object.equals.source.equals.location.equals.London.departureDateTime.equals.date.equals.definedValue.equals.Tomorrow

.time.equals.hour.equals.5

Three slot-value pairs can be extracted:

(flight+object+source+location, London)(flight+object+departureDateTime+date+definedValue, Tomorrow)(flight+object+departureDateTime+time+hour, 5)

The operator equals is not shown in the slotnames to make the names more concise.

Results The results are shown in Table 3.Overall, both TED-VANILLA and TED-PP modelsbring 20% relative improvement over existing slot-filling-based DST models. By further factoringout the impact of representation and modeldifferences, we see that representation plays a

Pattern AccuracyAll turns 0.647Turns with intent change 0.552Turns with compositional utterances 0.602Turns with multi-intent utterances 0.478

Table 4: Results on the TreeDST development set,broken down by dialog phenomena

2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1

oracle0.84

0.79 0.770.71 0.67

0.51

0.370.29

0.84 0.82 0.8 0.79 0.81

0.640.72 0.68

Turn ID

Acc

urac

y

Figure 2: Validation exact match accuracy by turn ID

more important role: the TED-FLAT variant, whichdiffers only that it was trained on flattened parses,is clearly outperformed by TED-VANILLA andTED-PP. We conclude that even if dialog statescan be flattened into slot-value pairs, it is stillmore favorable to use a compact, hierarchicalmeaning representation. The advantage of therepresentation is that it improves knowledgesharing across different domains (e.g., messageand phone), verbs (e.g., create and update),and dialog participators (user and system).The second set of comparison is among differentmodeling approaches using the same flat meaningrepresentation. TED-FLAT slightly outperformsCOMER and TRADE. The major difference isthat our model encodes both past user and systemrepresentations; while the other models used pasttranscriptions. We believe the gain of encodingrepresentations is that they are unambiguous; andthe encoding helps knowledge sharing between theuser and the system.

The vanilla and PP decoders achieve the sameexact match accuracy. The average training timeper epoch for PP is 1,794 seconds compared to2,021 for vanilla, i.e. PP leads to 10% reductionin decoding time without any effect on accuracy.We believe that the efficiency of PP can be furtheroptimized by parallellizing the two predictionlayers of nodes and parents.

Analysis First, Table 4 shows a breakdown of theTED-PP model performance by dialog behavior onthe development set. While the model does fairlywell on compositional utterances, states with intent

8115

switching and multiple intents are harder to predict.We believe the reason is that the prediction of intentswitching requires task reference resolution withinthe model; while multi-intent utterances tend tohave more complex trees.

Second, Figure 2 shows the vanilla model resultsby turn index on the development set (black curve).This shows the impact of error propagation as themodel predicts the target dialog state based on pastrepresentations. To better understand the issue, wecompare to an oracle model which always uses thegold previous dialog state for encoding (red curve).Vanilla model accuracy decreases with turn index,resulting in a gap with the oracle model. The errorpropagation problem can be alleviated by providingmore complete dialog history to the encoder forerror recovery (Henderson et al., 2014b), which weconsider as future work.

Finally, we would like to point out a limitationof our approach in tracking dialog history with astack-based memory. While the stack is capableof memorizing and returning to a previouslyunfinished task, there are patterns which cannotbe represented such as switching between twoongoing tasks. We aim to explore a richer datastructure for dialog history in the future.

7 Conclusion

This work reformulates dialog state tracking as aconversational semantic parsing task to overcomethe limitations of slot filling. Dialog states arerepresented as rooted relational graphs to encodecompositionality, and encourage knowledgesharing across different domains, verbs, slot typesand dialog participators. We demonstrated how adialog dataset with structured labels for both userand system utterances can be collected with the aidof a generative dialog simulator. We then proposeda conversational semantic parser that performsDST with an encoder-decoder model and a stack-based memory. A parent-pointer decoder is furtherproposed to speed up tree prediction. Experimentalresults show that our DST solution outperformsslot-filling-based trackers by a large margin.

References

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd InternationalConference on Learning Representations.

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In Proceedings of the 7thlinguistic annotation workshop and interoperabilitywith discourse, pages 178–186.

Jonathan Berant, Andrew Chou, Roy Frostig, andPercy Liang. 2013. Semantic parsing on Freebasefrom question-answer pairs. In Proceedings of the2013 Conference on Empirical Methods in NaturalLanguage Processing, pages 1533–1544, Seattle,Washington, USA. Association for ComputationalLinguistics.

Paweł Budzianowski, Tsung-Hsien Wen, Bo-HsiangTseng, Inigo Casanueva, Stefan Ultes, OsmanRamadan, and Milica Gasic. 2018. Multiwoz-alarge-scale multi-domain wizard-of-oz dataset fortask-oriented dialogue modelling. In Proceedingsof the 2018 Conference on Empirical Methods inNatural Language Processing, pages 5016–5026.

Bill Byrne, Karthik Krishnamoorthi, ChinnadhuraiSankar, Arvind Neelakantan, Ben Goodrich, DanielDuckworth, Semih Yavuz, Amit Dubey, Kyu-YoungKim, and Andy Cedilnik. 2019. Taskmaster-1:Toward a realistic and diverse dialog dataset. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP), pages4506–4517.

Trevor Cohn, Phil Blunsom, and Sharon Goldwater.2010. Inducing tree-substitution grammars. Journalof Machine Learning Research, 11(Nov):3053–3096.

Paul A Crook and Oliver Lemon. 2010. Representinguncertainty about complex user goals in statisticaldialogue systems. In Proceedings of the SIGDIAL2010 Conference, pages 209–212.

Layla El Asri, Hannes Schulz, Shikhar Kr Sarma,Jeremie Zumer, Justin Harris, Emery Fine, RahulMehrotra, and Kaheer Suleman. 2017. Frames: acorpus for adding memory to goal-oriented dialoguesystems. In Proceedings of the 18th Annual SIGdialMeeting on Discourse and Dialogue, pages 207–219.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OKLi. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 1631–1640.

Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Kumar,and Mike Lewis. 2018. Semantic parsing for taskoriented dialog using hierarchical representations.Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics.

https://www.aclweb.org/anthology/D13-1160

https://www.aclweb.org/anthology/D13-1160

8116

Matthew Henderson, Blaise Thomson, and Jason D.Williams. 2014a. The second dialog statetracking challenge. In Proceedings of the 15thAnnual Meeting of the Special Interest Group onDiscourse and Dialogue (SIGDIAL), pages 263–272, Philadelphia, PA, U.S.A. Association forComputational Linguistics.

Matthew Henderson, Blaise Thomson, and SteveYoung. 2014b. Word-based dialog state trackingwith recurrent neural networks. In Proceedingsof the 15th Annual Meeting of the Special InterestGroup on Discourse and Dialogue (SIGDIAL),pages 292–299.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Proceedings ofthe 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 12–22.

Diederik P Kingma and Jimmy Ba. 2015.Adam: A method for stochastic optimization.3rd International Conference on LearningRepresentations.

Thomas Kollar, Danielle Berry, Lauren Stuart,Karolina Owczarzak, Tagyoung Chung, LambertMathias, Michael Kayser, Branrd Snow, and SpyrosMatsoukas. 2018. The alexa meaning representationlanguage. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 3 (Industry Papers), pages177–184.

Hwaran Lee, Jinsik Lee, and Tae-Yoon Kim. 2019.Sumbt: Slot-utterance matching for universal andscalable belief tracking. In Proceedings ofthe 57th Annual Meeting of the Association forComputational Linguistics, pages 5478–5483.

Sungjin Lee and Amanda Stent. 2016. Task lineages:Dialog state tracking for flexible interaction. InProceedings of the 17th Annual Meeting of theSpecial Interest Group on Discourse and Dialogue,pages 11–21.

Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, LihongLi, Jianfeng Gao, and Yun-Nung Chen. 2016. Auser simulator for task-completion dialogues. arXivpreprint arXiv:1612.05688.

Percy Liang, Michael I Jordan, and Dan Klein.2013. Learning dependency-based compositionalsemantics. Computational Linguistics, 39(2):389–446.

Nikola Mrksic, Diarmuid O Seaghdha, Tsung-HsienWen, Blaise Thomson, and Steve Young. 2017.Neural belief tracker: Data-driven dialogue statetracking. In Proceedings of the 55th Annual Meetingof the Association for Computational Linguistics(Volume 1: Long Papers), pages 1777–1788.

Nikola Mrksic and Ivan Vulic. 2018. Fullystatistical neural belief tracking. In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: ShortPapers), pages 108–113.

Denis Peskov, Nancy Clarke, Jason Krone, Brigi Fodor,Yi Zhang, Adel Youssef, and Mona Diab. 2019.Multi-domain goal-oriented dialogues (multidogo):Strategies toward curating and annotating largescale dialogue data. In Proceedings of the2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 4518–4528.

Filip Radlinski, Krisztian Balog, Bill Byrne,and Karthik Krishnamoorthi. 2019. Coachedconversational preference elicitation: A case studyin understanding movie preferences. In Proceedingsof the 20th Annual SIGdial Meeting on Discourseand Dialogue, pages 353–360.

Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara,Raghav Gupta, and Pranav Khaitan. 2019. Towardsscalable multi-domain conversational agents: Theschema-guided dialogue dataset. arXiv preprintarXiv:1909.05855.

Liliang Ren, Jianmo Ni, and Julian McAuley.2019. Scalable and accurate dialogue statetracking via hierarchical sequence generation. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP), pages1876–1885.

Alexander Rudnicky and Wei Xu. 1999. An agenda-based dialog management architecture for spokenlanguage systems. In IEEE Automatic SpeechRecognition and Understanding Workshop.

Jost Schatzmann, Blaise Thomson, Karl Weilhammer,Hui Ye, and Steve Young. 2007. Agenda-baseduser simulation for bootstrapping a pomdp dialoguesystem. In Human Language Technologies 2007:The Conference of the North American Chapterof the Association for Computational Linguistics;Companion Volume, Short Papers, pages 149–152.Association for Computational Linguistics.

Pararth Shah, Dilek Hakkani-Tur, Gokhan Tur,Abhinav Rastogi, Ankur Bapna, Neha Nayak, andLarry Heck. 2018. Building a conversational agentovernight with dialogue self-play. arXiv preprintarXiv:1801.04871.

Weiyan Shi, Kun Qian, Xuewei Wang, and ZhouYu. 2019. How to build user simulators to trainrl-based dialog systems. In Proceedings of the2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 1990–2000.

https://doi.org/10.3115/v1/W14-4337

https://doi.org/10.3115/v1/W14-4337

8117

Oriol Vinyals, Lukasz Kaiser, Terry Koo, SlavPetrov, Ilya Sutskever, and Geoffrey Hinton. 2015.Grammar as a foreign language. In Proceedings ofNIPS 2015.

Tsung Hsien Wen, David Vandyke, Nikola Mrki,Milica Gasic, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. In Proceedings of the 15th Conferenceof the European Chapter of the Association forComputational Linguistics: Volume 1, Long Papers.

Jason Williams, Antoine Raux, and MatthewHenderson. 2016. The dialog state trackingchallenge series: A review. Dialogue & Discourse,7(3):4–33.

Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and PascaleFung. 2019. Transferable multi-domain stategenerator for task-oriented dialogue systems. InProceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages808–819.

Puyang Xu and Qi Hu. 2018. An end-to-end approachfor handling unknown slot values in dialogue statetracking. In Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics(Volume 1: Long Papers), pages 1448–1457.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li,Qingning Yao, Shanelle Roman, et al. 2018. Spider:A large-scale human-labeled dataset for complexand cross-domain semantic parsing and text-to-sqltask. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 3911–3921.

A Appendix. Dotted Tree

A dotted tree format is used throughout the paper toreduce the space of drawing and the mental spaceof readers. Dots represent edges between two treenodes. When a node has multiple children attachedto it, indentation is applied to reveal the hierarchy.

The following representation is the dotted formatfor the tree in Figure 3.

user.flight.book.object.equals.source.equals.location.equals.London.destination.equals.location.equals.Paris.departureDateTime.equals.date.equals.definedValue.equals.Tomorrow

.time.equals.hour.equals.10.meridiem.equals.AM

Figure 3: A tree representing the user intent Book aticket from London to Paris tomorrow 10 AM.

Documents

Conversational Semantic Parsing for Dialog State Tracking · Table 1: An example conversation in TreeDST with annotations. User and system utterances are marked in red and blue respectively