Value-Agnostic Conversational Semantic Parsing · 2021. 6. 14. · Value-Agnostic Conversational Semantic Parsing Emmanouil Antonios Platanios, Adam Pauls, Subhro Roy, Yuchen Zhang,

Value-Agnostic Conversational Semantic Parsing

Emmanouil Antonios Platanios, Adam Pauls, Subhro Roy, Yuchen Zhang, Alex Kyte, Alan Guo,Sam Thomson, Jayant Krishnamurthy, Jason Wolfe, Jacob Andreas, Dan Klein

Microsoft Semantic [email protected]

AbstractConversational semantic parsers map user ut-terances to executable programs given dia-logue histories composed of previous utter-ances, programs, and system responses. Ex-isting parsers typically condition on rich repre-sentations of history that include the completeset of values and computations previously dis-cussed. We propose a model that abstractsover values to focus prediction on type- andfunction-level context. This approach providesa compact encoding of dialogue histories andpredicted programs, improving generalizationand computational efficiency. Our model in-corporates several other components, includ-ing an atomic span copy operation and struc-tural enforcement of well-formedness con-straints on predicted programs, that are par-ticularly advantageous in the low-data regime.Trained on the SMCALFLOW and TREEDSTdatasets, our model outperforms prior workby 7.3% and 10.6% respectively in terms ofabsolute accuracy. Trained on only a thou-sand examples from each dataset, it outper-forms strong baselines by 12.4% and 6.4%.These results indicate that simple representa-tions are key to effective generalization in con-versational semantic parsing.

1 Introduction

Conversational semantic parsers, which translatenatural language utterances into executable pro-grams while incorporating conversational context,play an increasingly central role in systems forinteractive data analysis (Yu et al., 2019), instruc-tion following (Guu et al., 2017), and task-orienteddialogue (Zettlemoyer and Collins, 2009). An ex-ample of this task is shown in Figure 1. Typicalmodels are based on an autoregressive sequenceprediction approach, in which a detailed represen-tation of the dialogue history is concatenated to theinput sequence, and predictors condition on this se-quence and all previously generated components of

the output (Suhr et al., 2018). While this approachcan capture arbitrary dependencies between inputsand outputs, it comes at the cost of sample- andcomputational inefficiency.

We propose a new “value-agnostic” approach tocontextual semantic parsing driven by type-basedrepresentations of the dialogue history and function-based representations of the generated programs.Types and functions have long served as a founda-tion for formal reasoning about programs, but theiruse in neural semantic parsing has been limited,e.g., to constraining the hypothesis space (Krishna-murthy et al., 2017), guiding data augmentation (Jiaand Liang, 2016), and coarsening in coarse-to-finemodels (Dong and Lapata, 2018). We show thatrepresenting conversation histories and partial pro-grams via the types and functions they contain en-ables fast, accurate, and sample-efficient contextualsemantic parsing. We propose a neural encoder–decoder contextual semantic parsing model which,in contrast to prior work:

1. uses a compact yet informative representationof discourse context in the encoder that con-siders only the types of salient entities thatwere predicted by the model in previous turnsor that appeared in the execution results of thepredicted programs, and

2. conditions the decoder state on the sequenceof function invocations so far, without con-ditioning on any concrete values passed asarguments to the functions.

Our model substantially improves upon the bestpublished results on the SMCALFLOW (SemanticMachines et al., 2020) and TREEDST (Cheng et al.,2020) conversational semantic parsing datasets, im-proving model performance by 7.3% and 10.6%, re-spectively, in terms of absolute accuracy. In furtherexperiments aimed at quantifying sample efficiency,

ENTITY PROPOSERS

Number(2)Month.May

Propose entities from the current user utterance.

PREVIOUS PROGRAM EXTRACTED DIALOGUE HISTORY TYPES

[0] Constraint[Event]()[1] Constraint[Any]()[2] like(value = "shopping")[3] Time(hour = 2, meridiem = PM)[4] Constraint[Event](subject = [2], start = [3])[5] revise(oldLoc = [0], rootLoc = [1], new = [4])

Function Argument

Invocation Copy

EntityConstant

Reference(to the previous function invocation)

Value

PREDICTED PROGRAM

LINEARIZED REPRESENTATION

Can you delete my eventcalled holiday shopping ?

PREVIOUS USER UTTERANCE

I can’t find an eventwith that name.

PREVIOUS AGENT UTTERANCE

CURRENT USER UTTERANCE

Oh, it’s just called shopping.It may be at 2.

parsing

parsing

execution

delete(find(

Constraint[Event](subject = like("holiday shopping")

))

)

revise(oldLoc = Constraint[Event](),rootLoc = Constraint[Any](),new = Constraint[Event](

subject = like("shopping"),start = Time(

hour = 2, meridiem = PM)

))

Unit

Constraint[String]Constraint[Event]Event

String

EventNotFoundError

Set of salient types extractedfrom the dialogue history andused by the parser as acompact representation ofthe history to condition on.

The last type comes from theprogram execution results.

Representation predicted by the proposed model.

PARSER

Figure 1: Illustration of the conversational semantic parsing problem that we focus on and the representations thatwe use. The previous turn user utterance and the previous program are shown in blue on the top. The dialoguehistory representation extracted using our approach is shown on the top right. The current turn user utterance isshown in red on the bottom left. The current utterance, the set of proposed entities, and the extracted dialoguehistory representation form the input to our parser. Given this input, the parser predicts a program that is shown onthe bottom right (in red rectangles).

it improves accuracy by 12.4% and 6.4% respec-tively when trained on only a thousand examplesfrom each dataset. Our model is also effective atnon-contextual semantic parsing, matching state-of-the-art results on the JOBS, GEOQUERY, and ATIS

datasets (Dong and Lapata, 2016). This is achievedwhile also reducing the test time computationalcost by a factor of 10 (from 80ms per utterancedown to 8ms when running on the same machine;more details are provided in Appendix H), whencompared to our fastest baseline, which makes itusable as part of a real-time conversational system.

One conclusion from these experiments is thatmost semantic parses have structures that dependonly weakly on the values that appear in the dia-logue history or in the programs themselves. Ourexperiments find that hiding values alone resultsin a 2.6% accuracy improvement in the low-dataregime. By treating types and functions, rather thanvalues, as the main ingredients in learned represen-tations for semantic parsing, we improve modelaccuracy and sample efficiency across a diverse setof language understanding problems, while alsosignificantly reducing computational costs.

2 Proposed Model

Our goal is to map natural language utterances toprograms while incorporating context from dia-logue histories (i.e., past utterances and their asso-

ciated programs and execution results). We modela program as a sequenceo of function invocations,each consisting of a function and zero or moreargument values, as illustrated at the lower rightof Figure 1. The argument values can be eitherliteral values or references to results of previousfunction invocations. The ability to reference pre-vious elements of the sequence, sometimes calleda target-side copy, allows us to construct programsthat involve re-entrancies. Owing to this referen-tial structure, a program can be equivalently repre-sented as a directed acyclic graph (see e.g., Joneset al., 2012; Zhang et al., 2019).

We propose a Transformer-based (Vaswaniet al., 2017) encoder–decoder model that predictsprograms by generating function invocations se-quentially, where each invocation can draw its argu-ments from an inventory of values (§2.5)—possiblycopied from the utterance—and the results of pre-vious function invocations in the current program.The encoder (§2.2) transforms a natural languageutterance and a dialogue history to a continuousrepresentation. Subsequently, the decoder (§2.3)uses this representation to define an autoregressivedistribution over function invocation sequences andchooses a high-probability sequence by performingbeam search. As our experiments (§3) will show,a naıve encoding of the complete dialogue historyand program results in poor model accuracy.

CURRENT PROGRAM with revisionrevise(

oldLoc = Constraint[Event](),rootLoc = RoleConstraint(start),new = Time(hour = 3, meridiem = PM)

)

PREVIOUS PROGRAMdelete(find(

Constraint[Event](subject = like("holiday shopping"),start = Time(hour = 2, meridiem = PM),end = Time(hour = 5, meridiem = PM)

)))

CURRENT PROGRAM without revisiondelete(find(

Constraint[Event](subject = like("holiday shopping"),start = Time(hour = 3, meridiem = PM),end = Time(hour = 5, meridiem = PM)

)))

"It actually starts at 3pm."

Contains information thatis not mentioned in thecurrent utterance

Only contains informationthat is mentioned in thecurrent utterance

Figure 2: Illustration of the revise meta-computationoperator (§2.1) used in our program representations.This operator can remove the need to copy programfragments from the dialogue history.

2.1 PreliminariesOur approach assumes that programs have type an-notations on all values and function calls, similar tothe setting of Krishnamurthy et al. (2017).1 Further-more, we assume that program prediction is localin that it does not require program fragments to becopied from the dialogue history (but may still de-pend on history in other ways). Several formalisms,including the typed references of Zettlemoyer andCollins (2009) and the meta-computation opera-tors of Semantic Machines et al. (2020), make itpossible to produce local program annotations evenfor dialogues like the one depicted in Figure 2,which reuse past computations. We transformedthe datasets in our experiments to use such meta-computation operators (see Appendix C).

We also optionally make use of entity proposers,similar to Krishnamurthy et al. (2017), which an-notate spans from the current utterance with typedvalues. For example, the span “one” in “Changeit to one” might be annotated with the value 1

of type Number. These values are scored by thedecoder along with other values that it considers(§2.5) when predicting argument values for func-tion invocations. Using entity proposers aims to

1This requirement can be trivially satisfied by assigningall expressions the same type, but in practice defining a set oftype declarations for the datasets in our experiments was notdifficult (refer to Appendix C for details).

help the model generalize better to previously un-seen values that can be recognized in the utteranceusing hard-coded heuristics (e.g., regular expres-sions), auxiliary training data, or other runtime in-formation (e.g., a contact list). In our experimentswe make use of simple proposers that recognizenumbers, months, holidays, and days of the week,but one could define proposers for arbitrary values(e.g., song titles). As described in §2.5, certainvalues can also be predicted directly without theuse of an entity proposer.

2.2 Encoder

The encoder, shown in Figure 3, maps a naturallanguage utterance to a continuous representation.Like many neural sequence-to-sequence models,we produce a contextualized token representationof the utterance, Hutt ∈ RU×henc , where U is thenumber of tokens and henc is the dimensionality oftheir embeddings. We use a Transformer encoder(Vaswani et al., 2017), optionally initialized usingthe BERT pretraining scheme (Devlin et al., 2019).Next, we need to encode the dialogue history andcombine its representation with Hutt to producehistory-contextualized utterance token embeddings.

Prior work has incorporated history informationby linearizing it and treating it as part of the inpututterance (Cheng et al., 2018; Semantic Machineset al., 2020; Aghajanyan et al., 2020). While flexi-ble and easy to implement, this approach presentsa number of challenges. In complex dialogues, his-tory encodings can grow extremely long relativeto the user utterance, which: (i) increases the riskof overfitting, (ii) increases computational costs(because attentions have to be computed over longsequences), and (iii) necessitates using small batchsizes during training, making optimization difficult.

Thanks to the predictive locality of our repre-sentations (§2.1), our decoder (§2.3) never needsto retrieve values or program fragments fromthe dialogue history. Instead, context enters intoprograms primarily when programs use referringexpressions that point to past computations, orrevision expressions that modify them. Eventhough this allows us to dramatically simplifythe dialogue history representation, effectivegeneration of referring expressions still requiresknowing something about the past. For example,for the utterance “What’s next?” the model needsto determine what “What” refers to. Perhapsmore interestingly, the presence of dates in recent

DIALOGUE HISTORY TYPES

Unit

Constraint[String]Constraint[Event]Event

String

EventNotFoundError

embed

decoder

UTTERANCE ENCODER

DIALOGUE HISTORY ENCODER

USER UTTERANCE

"Oh, it's just called shopping.It may be at 2."

attentionK V Q

Figure 3: Illustration of our encoder (§2.2), using theexample of Figure 1. The utterance is processed by aTransformer-based (Vaswani et al., 2017) encoder andcombined with information extracted from the set ofdialogue history types using multi-head attention.

turns (or values that have dates, such as meetings)should make the decoder more eager to generatereferring calls that retrieve dates from the dialoguehistory; especially so if other words in the currentutterance hint that dates may be useful and yetdate values cannot be constructed directly from thecurrent utterance. Subsequent steps of the decoderwhich are triggered by these other words canproduce functions that consume the referred dates.

We thus hypothesize that it suffices to strip thedialogue history down to its constituent types, hid-ing all other information.2 Specifically, we extracta set T of types that appear in the dialogue historyup to m turns back, where m = 1 in our experi-ments.3 Our encoder then transformsHutt into a se-quence of history-contextualized embeddingsHencby allowing each token to attend over T . This ismotivated by the fact that, in many cases, dialoguehistory is important for determining the meaningof specific tokens in the utterance, rather than thewhole utterance. Specifically, we learn embeddingsT ∈ R|T |×htype for the extracted types, where htypeis the embedding size, and use the attention mecha-nism of Vaswani et al. (2017) to contextualizeHutt:

Henc ,Hutt + MHA(Hutt︸︷︷︸Queries

, T︸︷︷︸Keys

, T︸︷︷︸Values

), (1)

where “MHA” stands for multi-head attention, andeach head applies a separate linear transforma-tion to the queries, keys, and values. Intuitively,

2For the previous example, if the type List[Event] ap-peared in the history then we may infer that “What” probablyrefers to an Event.

3We experimented with different values of m and foundthat increasing it results in worse performance, presumablydue to overfitting.

[0] +( 1, 2)[1] +([0], 3)[2] +([1], 4)[3] +([2], 5)

[0] +(Number, Number)[1] +(Number, Number)[2] +(Number, Number)

Consider the following program representingthe expression 1 + 2 + 3 + 4 + 5:

While generating this invocation, thedecoder only gets to condition on thefollowing program prefix:

Argument values are masked out!

Figure 4: Illustration showing the way in which our de-coder is value-agnostic. Specifically, it shows whichpart of the generated program prefix, our decoder con-ditions on while generating programs (§2.3).

each utterance-contextualized token is further con-textualized in (1) by adding to it a mixture ofembeddings of elements in T , where the mix-ture coefficients depends only on that utterance-contextualized token. This encoder is illustratedin Figure 3. As we show in §3.1, using this mech-anism performs better than the naıve approach ofappending a set-of-types vector toHutt.

2.3 Decoder: ProgramsThe decoder uses the history-contextualized repre-sentationHenc of the current utterance to predict adistribution over the program π that correspondsto that utterance. Each successive “line” πi of πinvokes a function fi on an argument value tuple(vi1, vi2, . . . , viAi), where Ai is the number of (for-mal) arguments of fi. Applying fi to this orderedtuple results in the invocation fi(ai1 = vi1, ai2 =vi2, . . .), where (ai1, ai2, . . . , aiAi) name the for-mal arguments of fi. Each predicted value vij canbe the result of a previous function invocation, aconstant value, a value copied from the current ut-terance, or a proposed entity (§2.1), as illustrated inthe lower right corner of Figure 1. These differentargument sources are described in §2.5. Formally,the decoder defines a distribution of programs π:

p(π |Henc) =P∏i=1

p(πi | f<i,Henc), (2)

where P is the number of function invocations inthe program, and f<i , {f1, . . . , fi−1}. Addition-ally, we assume that argument values are condi-tionally independent given fi and f<i, resulting in:

p(πi | f<i) = p(fi |f<i)︸︷︷︸functionscoring

Ai∏j=1

p(vij |f<i, fi)︸︷︷︸argument value

scoring

, (3)

where we have elided the conditioning on Henc.Here, functions depend only on previous functions

FUNCTION EMBEDDER

from: CityNAME TYPE

argumentembedding

FUNCTION SIGNATURE

NAME TYPE TYPEARGUMENT ARGUMENTBook[Flight](from: City, to: City): Booking[Flight]

POOLING

function embedding

ARGUMENT EMBEDDER

Figure 5: Illustration of our function encoder (§2.4),using a simplified example function signature.

(not their argument values or results) and argumentvalues depend only on their calling function (noton one another or any of the previous argumentvalues).4 This is illustrated in Figure 4. In addi-tion to providing an important inductive bias, theseindependence assumptions allow our inference pro-cedure to efficiently score all possible function in-vocations at step i, given the ones at previous steps,at once (i.e., function and argument value assign-ments together), resulting in an efficient searchalgorithm (§2.6). Note that there is also a corre-sponding disadvantage (as in many machine trans-lation models) that a meaningful phrase in the utter-ance could be independently selected for multiplearguments, or not selected at all, but we did notencounter this issue in our experiments; we rely onthe model training to evade this problem throughthe dependence onHenc.

2.4 Decoder: Functions

In Equation 3, the sequence of functionsf1, f2, . . . in the current program is modeled by∏i p(fi |f<i,Henc). We use a standard autoregres-

sive Transformer decoder that can also attend tothe utterance encoding Henc (§2.2), as done byVaswani et al. (2017). Our decoder generates se-quences over the vocabulary of functions. Thismeans that each function fi needs an embeddingfi (used as both an input to the decoder and anoutput), which we construct compositionally.

We assume that each unique function f hasa type signature that specifies a name n, a listof type parameters {τ1, . . . , τT } (to support poly-morphism),5 a list of argument names and types((a1, t1), . . . , (aA, tA)), and a result type r. An

4We also tried defining a jointly normalized distributionover entire function invocations (Appendix A), but found thatit results in a higher training cost for no accuracy benefits.

5The type parameters could themselves be parameterized,but we ignore this here for simplicity of exposition.

example is shown in Figure 5. We encode thefunction and argument names using the utter-ance encoder of §2.2 and learn embeddings forthe types, to obtain (n, r), {τ1, . . . , τT }, and{(a1, t1), . . . , (aA, tA)}. Then, we construct anembedding for each function as follows:

a = Pool(a1 + t1, . . . ,aA + tA), (4)

f = n+ Pool(τ1, . . . , τT ) + a+ r, (5)

where “Pool” is the max-pooling operation whichis invariant to the arguments’ order.

Our main motivation for this function em-bedding mechanism is the ability to take cuesfrom the user utterance (e.g., due to a functionbeing named similarly to a word appearing in theutterance). If the functions and their argumentshave names that are semantically similar tocorresponding utterance parts, then this approachenables zero-shot generalization.6 However, thereis an additional potential benefit from parametersharing due to the compositional structure of theembeddings (see e.g., Baroni, 2020).

2.5 Decoder: Argument ValuesThis section describes the implementation of theargument predictor p(vij | f<i, fi). There are fourdifferent kinds of sources that can be used to filleach available argument slot: references to previ-ous function invocations, constants from a staticvocabulary, copies that copy string values from theutterance, and entities that come from entity pro-posers (§2.1). Many sources might propose thesame value, including multiple sources of the samekind. For example, there may be multiple spansin the utterance that produce the same string valuein a program, or an entity may be proposed that isalso available as a constant. To address this, wemarginalize over the sources of each value:

p(vij | f<i, fi)=∑

s∈S(vij)

p(vij , s |f<i, fi), (6)

where vij represents a possible value for the argu-ment named aij , and s ∈ S(vij) ranges over thepossible sources for that value. For example, giventhe utterance “Change that one to 1:30pm” and thevalue 1, the set S(1) may contain entities that cor-respond to both “one” and “1” from the utterance.

6The data may contain overloaded functions that have thesame name but different type signatures (e.g., due to optionalarguments). The overloads are given distinct identifiers f , butthey often share argument names, resulting in at least partiallyshared embeddings.

The argument scoring mechanism considers thelast-layer decoder state hidec that was used to pre-dict fi via p(fi |f<i) ∝ exp(f>i h

idec). We special-

ize this decoder state to argument aij as follows:

hi,aijdec , hidec � tanh(fi + aij), (7)

where � represents elementwise multiplication, fiis the embedding of the current function fi, aij isthe encoding of argument aij as defined in §2.4,and hdec is a projection of hdec to the necessarydimensionality. Intuitively, tanh(fi + aij) acts asa gating function over the decoder state, decidingwhat is relevant when scoring values for argumentaij . This argument-specific decoder state is thencombined with a value embedding to produce aprobability for each (sourced) value assignment:

p(v, s | f<i, fi) ∝

exp{v>(hi,adec +w

kind(s)a ) + bkind(s)

a

}, (8)

where a is the argument name aij , kind(s) ∈{REFERENCE, CONSTANT, COPY, ENTITY}, v is theembedding of (v, s) which is described next, andwka and bka are model parameters that are specific

to a and the kind of the source s.

References. References are pointers to the re-turn values of previous function invocations. Ifthe source s for the proposed value v is the resultof the kth invocation (where k < i), we take its em-bedding v to be a projection of hkdec that was usedto predict that invocation’s function and arguments.

Constants. Constants are values that are alwaysproposed, so the decoder always has the option ofgenerating them. If the source s for the proposedvalue v is a constant, we embed it by applying theutterance encoder on a string rendering of the value.The set of constants is automatically extracted fromthe training data (see Appendix B).

Copies. Copies are string values that correspondto substrings of the user utterance (e.g., personnames). String values can only enter the programthrough copying, as they are not in the set of con-stants (i.e., they cannot be “hallucinated” by themodel; see Pasupat and Liang, 2015; Nie et al.,2019). One might try to construct an approachbased on a standard token-based copy mechanism(e.g., Gu et al., 2016). However, this would al-low copying non-contiguous spans and would alsorequire marginalizing over identical tokens as op-posed to spans, resulting in more ambiguity. In-stead, we propose a mechanism that enables the

decoder to copy contiguous spans directly fromthe utterance. Its goal is to produce a score foreach of the U(U + 1)/2 possible utterance spans.Naıvely, this would result in a computational costthat is quadratic in the utterance length U , andso we instead chose a simple scoring model thatavoids it. Similar to Stern et al. (2017) and Kurib-ayashi et al. (2019), we assume that the score for aspan factorizes, and define the embedding of eachspan value as the concatenation of the contextualembeddings of the first and last tokens of the span,v = [hkstart

utt ;hkendutt ]. To compute the copy scores we

also concatenate hi,adec with itself in Equation 8.

Entities. Entities are treated the same way ascopies, except that instead of scoring all spans ofthe input, we only score spans proposed by theexternal entity proposers discussed in §2.1. Specif-ically, the proposers provide the model with a listof candidate entities that are each described by anutterance span and an associated value. The can-didates are scored using an identical mechanismto the one used for scoring copies. This meansthat, for example, the string “sept” could be linkedto the value Month.September even though thestring representations do not match perfectly.

Type Checking. When scoring argument valuesfor function fi, we know the argument types, asthey are specified in the function’s signature. Thisenables us to use a type checking mechanism thatallows the decoder to directly exclude values withmismatching types. For references, the value typescan be obtained by looking up the result types of thecorresponding function signatures. Additionally,the types are always pre-specified for constantsand entities, and copies are only supported for asubset of types (e.g., String, PersonName; seeAppendix B). The type checking mechanism setsp(vij | f<i, fi) = 0 whenever vij has a differenttype than the expected type for aij . Finally, becausecopies can correspond to multiple types, we alsoadd a type matching term to the copy score. Thisterm is defined as the inner product of the argumenttype embedding and a (learnable) linear projectionof hkstart

utt and hkendutt concatenated, where kstart and

kend denote the span start and end indices.

2.6 Decoder: SearchSimilar to other sequence-to-sequence models, weemploy beam search over the sequence of functioninvocations when decoding. However, in contrastto other models, our assumptions (§2.3) allow us to

Dataset SMCALFLOW TREEDSTV1.1 V2.0

Best Reported Result 66.5 68.2 62.2Our Model 73.8 75.3 72.8

Table 1: Test set exact match accuracy comparing ourmodel to the best reported results for SMCALFLOW(Seq2Seq model from the public leaderboard; SemanticMachines et al., 2020) and TREEDST (TED-PP model;Cheng et al., 2020). The evaluation on each datasetin prior work requires us to repeat some idiosyncrasiesthat we describe in Appendix D.

efficiently implement beam search over completefunction invocations, by leveraging the fact that:

maxπi

p(πi)=maxfi

{p(fi)

Ai∏j=1

maxvij

p(vij |fi)}, (9)

where we have omitted the dependence on f<i.This computation is parallelizable and it also allowsthe decoder to avoid choosing a function if there areno high scoring assignments for its arguments (i.e.,we are performing a kind of lookahead). This alsomeans that the paths explored during the search areshorter for our model than for models where eachstep corresponds to a single decision, allowing forsmaller beams and more efficient decoding.

3 Experiments

We first report results on SMCALFLOW (SemanticMachines et al., 2020) and TREEDST (Cheng et al.,2020), two recently released large-scale conversa-tional semantic parsing datasets. Our model makesuse of type information in the programs, so wemanually constructed a set of type declarations foreach dataset and then used a variant of the Hindley-Milner type inference algorithm (Damas and Mil-ner, 1982) to annotate programs with types. Asmentioned in §2.1, we also transformed TREEDSTto introduce meta-computation operators for ref-erences and revisions (more details can be foundin Appendix C).7 We also report results on non-conversational semantic parsing datasets in §3.2.We use the same hyperparameters across all ex-periments (see Appendix E), and we use BERT-medium (Turc et al., 2019) to initialize our encoder.

3.1 Conversational Semantic ParsingTest set results for SMCALFLOW and TREEDSTare shown in Table 1. Our model significantly out-performs the best published numbers in each case.

7The transformed datasets are available at https://github.com/microsoft/task_oriented_dialogue_as_dataflow_synthesis/tree/master/datasets.

Dataset SMCALFLOW TREEDST# Training Dialogues 1k 10k 33k 1k 10k 19k

Seq2Seq 36.8 69.8 74.5 28.2 47.9 50.3Seq2Tree 43.6 69.3 77.7 23.6 46.9 48.8Seq2Tree++ 48.0 71.9 78.2 74.8 75.4 86.9

w/o

BE

RT

Our Model 53.8 73.2 78.5 78.6 87.6 88.5Seq2Seq 44.6 64.1 67.8 28.6 40.2 47.2Seq2Tree 50.8 74.6 78.6 30.9 50.6 51.6

w/B

ER

T

Our Model 63.2 77.2 80.4 81.2 87.1 88.3

(a) Baseline comparison.

Dataset SMCALFLOW TREEDST# Training Dialogues 1k 10k 33k 1k 10k 19k

Our Model 63.2 77.2 80.4 81.2 87.1 88.3Value Dependence 60.6 76.4 79.4 79.3 86.2 86.5No Name Embedder 62.8 76.7 80.3 81.1 87.0 88.1No Types 62.4 76.5 79.9 80.6 87.1 88.3No Span Copy 60.2 76.2 79.8 79.0 86.7 87.4No Entity Proposers 59.6 76.4 79.8 80.5 86.9 88.2

Pars

er

All of the Above 58.9 75.8 77.3 72.9 80.2 80.6No History 59.0 70.0 73.8 68.3 75.0 76.5Previous Turn 61.3 75.9 77.4 80.5 86.9 87.4

His

tory

Linear Encoder 63.0 76.5 80.2 81.2 87.1 88.3

(b) Ablation study.

Table 2: Validation set exact match accuracy acrossvarying amounts of training data (each subset is sam-pled uniformly at random). The best results in eachcase are shown in bold red and are underlined.

In order to further understand the performance char-acteristics of our model and quantify the impact ofeach modeling contribution, we also compare to avariety of other models and ablated versions of ourmodel. We implemented the following baselines:

– Seq2Seq: The OpenNMT (Klein et al., 2017) im-plementation of a pointer-generator network (Seeet al., 2017) that predicts linearized plans repre-sented as S-expressions and is able to copy to-kens from the utterance while decoding. Thismodel is very similar to the model used by Se-mantic Machines et al. (2020) and represents thecurrent state-of-the-art for SMCALFLOW.8

– Seq2Tree: The same as Seq2Seq, except that itgenerates invocations in a top-down, pre-orderprogram traversal. Each invocation is embeddedas a unique item in the output vocabulary. Notethat SMCALFLOW contains re-entrant programsrepresented with LISP-style let bindings. Boththe Seq2Tree and Seq2Seq are unaware of thespecial meaning of let and predict calls to let

as any other function, and references to bound8Semantic Machines et al. (2020) used linearized plans to

represent the dialogue history, but our implementation usesprevious user and agent utterances. We found no difference inperformance.

https://github.com/microsoft/task_oriented_dialogue_as_dataflow_synthesis/tree/master/datasets



variables as any other literal.– Seq2Tree++: An enhanced version of the model

by Krishnamurthy et al. (2017) that predictstyped programs in a top-down fashion. UnlikeSeq2Seq and Seq2Tree, this model can only pro-duce well-formed and well-typed programs. Italso makes use of the same entity proposers(§2.1) similar to our model, and it can atomi-cally copy spans of up to 15 tokens by treatingthem as additional proposed entities. Further-more, it uses the linear history encoder that isdescribed in the next paragraph. Like our model,re-entrancies are represented as references to pre-vious outputs in the predicted sequence.

We also implemented variants of Seq2Seq andSeq2Tree that use BERT-base9 (Devlin et al., 2019)as the encoder. Our results are shown in Ta-ble 2a. Our model outperforms all baselines onboth datasets, showing particularly large gains inthe low data regime, even when using BERT. Fi-nally, we implemented the following ablations,with more details provided in Appendix G:

– Value Dependence: Introduces a unique functionfor each value in the training data (except forcopies) and transforms the data so that valuesare always produced by calls to these functions,allowing the model to condition on them.

– No Name Embedder: Embeds functions and con-stants atomically instead of using the approachof §2.4 and the utterance encoder.

– No Types: Collapses all types to a single type,which effectively disables type checking (§2.5).

– No Span Copy: Breaks up span-level copies intotoken-level copies which are put together us-ing a special concatenate function. Note thatour model is value-agnostic and so this ablatedmodel cannot condition on previously copiedtokens when copying a span token-by-token.

– No Entity Proposers: Removes the entity pro-posers, meaning that previously entity-linkedvalues have to be generated as constants.

– No History: SetsHenc =Hutt (§2.2).– Previous Turn: Replaces the type-based history

encoding with the previous turn user and systemutterances or linearized system actions.

– Linear Encoder: Replaces the history attention

9We found that BERT-base worked best for these baselines,but was no better than the smaller BERT-medium when usedwith our model. Also, unfortunately, incorporating BERT inSeq2Tree++ turned out to be challenging due to the way thatmodel was originally implemented.

Method DatasetJOBS GEO ATIS

Zettlemoyer and Collins (2007) — 86.1 84.6Wang et al. (2014) 90.7 90.4 91.3Zhao and Huang (2015) 85.0 88.9 84.2Saparov et al. (2017) 81.4 83.9 —

Dong and Lapata (2016) 90.0 87.1 84.6Rabinovich et al. (2017) 92.9 87.1 85.9Yin and Neubig (2018) — 88.2 86.2Dong and Lapata (2018) — 88.2 87.7Aghajanyan et al. (2020) — 89.3 —Our Model 91.4 91.4 90.2N

eura

lMet

hods

xNo BERT 91.4 90.0 91.3

Table 3: Validation set exact match accuracy for single-turn semantic parsing datasets. Note that Aghajanyanet al. (2020) use BART (Lewis et al., 2020), a large pre-trained encoder. The best results for each dataset areshown in bold red and are underlined.

mechanism with a linear function over a multi-hot embedding of the history types.

The results, shown in Table 2b, indicate that allof our features play a role in improving accuracy.Perhaps most importantly though, the “value de-pendence” ablation shows that our function-basedprogram representations are indeed important, andthe “previous turn” ablation shows that our type-based program representations are also important.Furthermore, the impact of both these modelingdecisions grows larger in the low data regime, asdoes the impact of the span copy mechanism.

3.2 Non-Conversational Semantic Parsing

Our main focus is on conversational semanticparsing, but we also ran experiments on non-conversational semantic parsing benchmarks toshow that our model is a strong parser irrespec-tive of context. Specifically, we manually anno-tated the JOBS, GEOQUERY, and ATIS datasetswith typed declarations (Appendix C) and ran ex-periments comparing with multiple baseline andstate-of-the-art methods. The results, shown in Ta-ble 3, indicate that our model meets or exceedsstate-of-the-art performance in each case.

4 Related Work

Our approach builds on top of a significant amountof prior work in neural semantic parsing and alsocontext-dependent semantic parsing.

Neural Semantic Parsing. While there was abrief period of interest in using unstructured se-quence models for semantic parsing (e.g., Andreas

et al., 2013; Dong and Lapata, 2016), most researchon semantic parsing has used tree- or graph-shapeddecoders that exploit program structure. Most suchapproaches use this structure as a constraint whiledecoding, filling in function arguments one-at-a-time, in either a top-down fashion (e.g., Dong andLapata, 2016; Krishnamurthy et al., 2017) or abottom-up fashion (e.g., Misra and Artzi, 2016;Cheng et al., 2018). Both directions can sufferfrom exposure bias and search errors during decod-ing: in top-down when there’s no way to realizean argument of a given type in the current context,and in bottom-up when there are no functions in theprogramming language that combine the predictedarguments. To this end, there has been some workon global search with guarantees for neural seman-tic parsers (e.g., Lee et al., 2016) but it is expensiveand makes certain strong assumptions. In contrastto this prior work, we use program structure notjust as a decoder constraint but as a source of in-dependence assumptions: the decoder explicitlydecouples some decisions from others, resulting ingood inductive biases and fast decoding algorithms.

Perhaps closest to our work is that of Dong andLapata (2018), which is also about decoupling de-cisions, but uses a dataset-specific notion of anabstracted program sketch along with different in-dependence assumptions, and underperforms ourmodel in comparable settings (§3.2). Also closeare the models of Cheng et al. (2020) and Zhanget al. (2019). Our method differs in that our beamsearch uses larger steps that predict functions to-gether with their arguments, rather than predictingthe argument values serially in separate dependentsteps. Similar to Zhang et al. (2019), we use atarget-side copy mechanism for generating refer-ences to function invocation results. However, weextend this mechanism to also predict constants,copy spans from the user utterance, and link ex-ternally proposed entities. While our span copymechanism is novel, it is inspired by prior attemptsto copy spans instead of tokens (e.g., Singh et al.,2020). Finally, bottom-up models with similaritiesto ours include SMBOP (Rubin and Berant, 2020)and BUSTLE (Odena et al., 2020).

Context-Dependent Semantic Parsing. Priorwork on conversational semantic parsing mainlyfocuses on the decoder, with few efforts on incor-porating the dialogue history information in the en-coder. Recent work on context-dependent semanticparsing (e.g., Suhr et al., 2018; Yu et al., 2019)

conditions on explicit representations of user utter-ances and programs with a neural encoder. Whilethis results in highly expressive models, it also in-creases the risk of overfitting. Contrary to this,Zettlemoyer and Collins (2009), Lee et al. (2014)and Semantic Machines et al. (2020) do not usecontext to resolve references at all. They insteadpredict context-independent logical forms that areresolved in a separate step. Our approach occu-pies a middle ground: when combined with localprogram representations, types, even without anyvalue information, provide enough information toresolve context-dependent meanings that cannot bederived from isolated sentences. The specific mech-anism we use to do this “infuses” contextual typeinformation into input sentence representations, ina manner reminiscent of attention flow models fromthe QA literature (e.g., Seo et al., 2016).

5 Conclusion

We showed that abstracting away values whileencoding the dialogue history and decoding pro-grams significantly improves conversational seman-tic parsing accuracy. In summary, our goal in thiswork is to think about types in a new way. Similarto previous neural and non-neural methods, typesare an important source of constraints on the behav-ior of the decoder. Here, for the first time, they arealso the primary ingredient in the representation ofboth the parser actions and the dialogue history.

Our approach, which is based on type-centricencodings of dialogue states and function-centricencodings of programs (§2), outperforms priorwork by 7.3% and 10.6%, on SMCALFLOW andTREEDST, respectively (§3), while also beingmore computationally efficient than competingmethods. Perhaps more importantly, it results ineven more significant gains in the low-data regime.This indicates that choosing our representationscarefully and making appropriate independenceassumptions can result in increased accuracy andcomputational efficiency.

6 Acknowledgements

We thank the anonymous reviewers for their helpfulcomments, Jason Eisner for his detailed feedbackand suggestions on an early draft of the paper, Abul-hair Saparov for helpful conversations and pointersabout semantic parsing baselines and prior work,and Theo Lanman for his help in scaling up someof our experiments.

ReferencesArmen Aghajanyan, Jean Maillard, Akshat Shrivas-

tava, Keith Diedrick, Michael Haeger, Haoran Li,Yashar Mehdad, Veselin Stoyanov, Anuj Kumar,Mike Lewis, and Sonal Gupta. 2020. Conversa-tional Semantic Parsing. In Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 5026–5035. As-sociation for Computational Linguistics.

Jacob Andreas, Andreas Vlachos, and Stephen Clark.2013. Semantic Parsing as Machine Translation. InProceedings of the 51st Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers), pages 47–52, Sofia, Bulgaria. Associ-ation for Computational Linguistics.

Marco Baroni. 2020. Linguistic Generalization andCompositionality in Modern Artificial Neural Net-works. Philosophical Transactions of the Royal So-ciety B, 375(1791):20190307.

Jianpeng Cheng, Devang Agrawal, HectorMartınez Alonso, Shruti Bhargava, Joris Driesen,Federico Flego, Dain Kaplan, Dimitri Kartsaklis,Lin Li, Dhivya Piraviperumal, Jason D. Williams,Hong Yu, Diarmuid O Seaghdha, and AndersJohannsen. 2020. Conversational Semantic Parsingfor Dialog State Tracking. In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 8107–8117.Association for Computational Linguistics.

Jianpeng Cheng, Siva Reddy, Vijay Saraswat, andMirella Lapata. 2018. Learning an Executable Neu-ral Semantic Parser. Computational Linguistics,45(1):59–94. Publisher: MIT Press.

Luis Damas and Robin Milner. 1982. Principal Type-Schemes for Functional Programs. In Proceedingsof the 9th ACM SIGPLAN-SIGACT Symposium onPrinciples of Programming Languages, POPL ’82,page 207–212, New York, NY, USA. Association forComputing Machinery.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Li Dong and Mirella Lapata. 2016. Language toLogical Form with Neural Attention. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 33–43, Berlin, Germany. Associationfor Computational Linguistics.

Li Dong and Mirella Lapata. 2018. Coarse-to-Fine De-coding for Neural Semantic Parsing. In Proceedings

of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 731–742, Melbourne, Australia. Associationfor Computational Linguistics.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating Copying Mechanism inSequence-to-Sequence Learning. In Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 1631–1640, Berlin, Germany. Association forComputational Linguistics.

Kelvin Guu, Panupong Pasupat, E. Liu, and PercyLiang. 2017. From language to programs: Bridg-ing reinforcement learning and maximum marginallikelihood. ArXiv, abs/1704.07926.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages12–22, Berlin, Germany. Association for Computa-tional Linguistics.

Bevan Jones, Jacob Andreas, Daniel Bauer,Karl Moritz Hermann, and Kevin Knight. 2012.Semantics-Based Machine Translation with Hyper-edge Replacement Grammars. In Proceedings ofCOLING 2012, pages 1359–1376, Mumbai, India.The COLING 2012 Organizing Committee.

Diederik P. Kingma and Jimmy Ba. 2017.Adam: A Method for Stochastic Optimization.arXiv:1412.6980 [cs.LG]. ArXiv: 1412.6980.

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel-lart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. InProceedings of ACL 2017, System Demonstrations,pages 67–72, Vancouver, Canada. Association forComputational Linguistics.

Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural Semantic Parsing with Type Con-straints for Semi-Structured Tables. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 1516–1526,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Tatsuki Kuribayashi, Hiroki Ouchi, Naoya Inoue, PaulReisert, Toshinori Miyoshi, Jun Suzuki, and Ken-taro Inui. 2019. An Empirical Study of Span Rep-resentations in Argumentation Structure Parsing. InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4691–4698, Florence, Italy. Association for ComputationalLinguistics.

Kenton Lee, Yoav Artzi, Jesse Dodge, and Luke Zettle-moyer. 2014. Context-dependent Semantic Parsingfor Time Expressions. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages

https://doi.org/10.18653/v1/2020.emnlp-main.408


https://www.aclweb.org/anthology/P13-2009



https://doi.org/10.1162/coli_a_00342

https://doi.org/10.1162/coli_a_00342

https://doi.org/10.1145/582153.582176

https://doi.org/10.1145/582153.582176

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/P16-1004

https://doi.org/10.18653/v1/P16-1004

https://doi.org/10.18653/v1/P18-1068

https://doi.org/10.18653/v1/P18-1068

https://doi.org/10.18653/v1/P16-1154

https://doi.org/10.18653/v1/P16-1154

https://doi.org/10.18653/v1/P16-1002

https://doi.org/10.18653/v1/P16-1002

https://www.aclweb.org/anthology/C12-1083

https://www.aclweb.org/anthology/C12-1083

http://arxiv.org/abs/1412.6980



https://doi.org/10.18653/v1/D17-1160

https://doi.org/10.18653/v1/D17-1160

https://doi.org/10.18653/v1/P19-1464

https://doi.org/10.18653/v1/P19-1464

https://doi.org/10.3115/v1/P14-1135

https://doi.org/10.3115/v1/P14-1135

1437–1447, Baltimore, Maryland. Association forComputational Linguistics.

Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2016.Global Neural CCG Parsing with Optimality Guar-antees. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Process-ing, pages 2366–2376, Austin, Texas. Associationfor Computational Linguistics.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Transla-tion, and Comprehension. In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics, pages 7871–7880. Associationfor Computational Linguistics.

Dipendra Kumar Misra and Yoav Artzi. 2016. NeuralShift-Reduce CCG Semantic Parsing. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 1775–1786,Austin, Texas. Association for Computational Lin-guistics.

Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, andChin-Yew Lin. 2019. A Simple Recipe towards Re-ducing Hallucination in Neural Surface Realisation.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages2673–2679, Florence, Italy. Association for Compu-tational Linguistics.

Augustus Odena, Kensen Shi, David Bieber, RishabhSingh, and Charles Sutton. 2020. BUSTLE: Bottom-up Program Synthesis Through Learning-guided Ex-ploration. arXiv:2007.14381 [cs, stat]. ArXiv:2007.14381.

Panupong Pasupat and Percy Liang. 2015. Composi-tional Semantic Parsing on Semi-Structured Tables.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages1470–1480, Beijing, China. Association for Compu-tational Linguistics.

Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017. Abstract Syntax Networks for Code Gener-ation and Semantic Parsing. In Proceedings of the55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1139–1149, Vancouver, Canada. Associationfor Computational Linguistics.

Ohad Rubin and Jonathan Berant. 2020. SmBoP:Semi-autoregressive Bottom-up Semantic Parsing.arXiv:2010.12412 [cs]. ArXiv: 2010.12412.

Abulhair Saparov, Vijay Saraswat, and Tom Mitchell.2017. Probabilistic Generative Grammar for Se-mantic Parsing. In Proceedings of the 21st Confer-ence on Computational Natural Language Learning

(CoNLL 2017), pages 248–259, Vancouver, Canada.Association for Computational Linguistics.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computa-tional Linguistics.

Semantic Machines, Jacob Andreas, John Bufe, DavidBurkett, Charles Chen, Josh Clausman, Jean Craw-ford, Kate Crim, Jordan DeLoach, Leah Dorner, Ja-son Eisner, Hao Fang, Alan Guo, David Hall, KristinHayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Sm-riti Jha, Dan Klein, Jayant Krishnamurthy, TheoLanman, Percy Liang, Christopher H. Lin, IlyaLintsbakh, Andy McGovern, Aleksandr Nisnevich,Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth,Subhro Roy, Jesse Rusak, Beth Short, Div Slomin,Ben Snyder, Stephon Striplin, Yu Su, ZacharyTellman, Sam Thomson, Andrei Vorobev, IzabelaWitoszko, Jason Wolfe, Abby Wray, Yuchen Zhang,and Alexander Zotov. 2020. Task-Oriented Dia-logue as Dataflow Synthesis. Transactions of the As-sociation for Computational Linguistics, 8:556–571.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2016. BidirectionalAttention Flow for Machine Comprehension.arXiv:1611.01603 [cs.CL]. ArXiv: 1611.01603.

Abhinav Singh, Patrick Xia, Guanghui Qin, MahsaYarmohammadi, and Benjamin Van Durme. 2020.CopyNext: Explicit Span Copying and Alignmentin Sequence to Sequence Models. In Proceedingsof the Fourth Workshop on Structured Prediction forNLP, pages 11–16. Association for ComputationalLinguistics.

Mitchell Stern, Jacob Andreas, and Dan Klein. 2017.A Minimal Span-Based Neural Constituency Parser.In Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 818–827, Vancouver, Canada.Association for Computational Linguistics.

Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018.Learning to Map Context-Dependent Sentences toExecutable Formal Queries. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers), pages 2238–2249, New Orleans, Louisiana.Association for Computational Linguistics.

Iulia Turc, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Well-Read Students Learn Better:On the Importance of Pre-training Compact Models.arXiv:1908.08962 [cs.CL]. ArXiv: 1908.08962.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is All

https://doi.org/10.18653/v1/D16-1262

https://doi.org/10.18653/v1/D16-1262

https://doi.org/10.18653/v1/2020.acl-main.703



https://doi.org/10.18653/v1/D16-1183

https://doi.org/10.18653/v1/D16-1183

https://doi.org/10.18653/v1/P19-1256

https://doi.org/10.18653/v1/P19-1256




https://doi.org/10.3115/v1/P15-1142

https://doi.org/10.3115/v1/P15-1142

https://doi.org/10.18653/v1/P17-1105

https://doi.org/10.18653/v1/P17-1105



https://doi.org/10.18653/v1/K17-1026

https://doi.org/10.18653/v1/K17-1026

https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.1162/tacl_a_00333

https://doi.org/10.1162/tacl_a_00333



https://doi.org/10.18653/v1/2020.spnlp-1.2

https://doi.org/10.18653/v1/2020.spnlp-1.2

https://doi.org/10.18653/v1/P17-1076

https://doi.org/10.18653/v1/N18-1203

https://doi.org/10.18653/v1/N18-1203

https://arxiv.org/abs/1908.08962

https://arxiv.org/abs/1908.08962

https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

you Need. In Advances in Neural Information Pro-cessing Systems, volume 30, pages 5998–6008. Cur-ran Associates, Inc.

Adrienne Wang, Tom Kwiatkowski, and Luke Zettle-moyer. 2014. Morpho-syntactic Lexical Generaliza-tion for CCG Semantic Parsing. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1284–1295, Doha, Qatar. Association for ComputationalLinguistics.

Pengcheng Yin and Graham Neubig. 2018. TRANX: ATransition-based Neural Abstract Syntax Parser forSemantic Parsing and Code Generation. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing: System Demon-strations, pages 7–12, Brussels, Belgium. Associa-tion for Computational Linguistics.

Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi ChernTan, Xi Victoria Lin, Suyi Li, Heyang Er, IreneLi, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit,David Proctor, Sungrok Shim, Jonathan Kraft, Vin-cent Zhang, Caiming Xiong, Richard Socher, andDragomir Radev. 2019. SParC: Cross-Domain Se-mantic Parsing in Context. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics, pages 4511–4523, Florence,Italy. Association for Computational Linguistics.

Luke Zettlemoyer and Michael Collins. 2007. OnlineLearning of Relaxed CCG Grammars for Parsing toLogical Form. In Proceedings of the 2007 JointConference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning (EMNLP-CoNLL), pages 678–687,Prague, Czech Republic. Association for Computa-tional Linguistics.

Luke Zettlemoyer and Michael Collins. 2009. Learn-ing Context-Dependent Mappings from Sentences toLogical Form. In Proceedings of the Joint Confer-ence of the 47th Annual Meeting of the ACL andthe 4th International Joint Conference on NaturalLanguage Processing of the AFNLP, pages 976–984,Suntec, Singapore. Association for ComputationalLinguistics.

Sheng Zhang, Xutai Ma, Kevin Duh, and BenjaminVan Durme. 2019. AMR Parsing as Sequence-to-Graph Transduction. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 80–94, Florence, Italy. Associa-tion for Computational Linguistics.

Kai Zhao and Liang Huang. 2015. Type-Driven Incre-mental Semantic Parsing with Polymorphism. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1416–1421, Denver, Colorado. Associationfor Computational Linguistics.

https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

https://doi.org/10.3115/v1/D14-1135

https://doi.org/10.3115/v1/D14-1135

https://doi.org/10.18653/v1/D18-2002

https://doi.org/10.18653/v1/D18-2002

https://doi.org/10.18653/v1/D18-2002

https://doi.org/10.18653/v1/P19-1443

https://doi.org/10.18653/v1/P19-1443

https://www.aclweb.org/anthology/D07-1071






https://doi.org/10.18653/v1/P19-1009

https://doi.org/10.18653/v1/P19-1009

https://doi.org/10.3115/v1/N15-1162

https://doi.org/10.3115/v1/N15-1162

A Invocation Joint Normalization

Instead of the distribution in Equation 3, we can de-fine a distribution over fi and {vij}Ai

j=1 that factor-izes in the same way but is also jointly normalized:

p(πi | f<i) ∝ h(fi)Ai∏j=1

g(fi, vij), (10)

where h and g are defined as presented in §2.4and §2.5, respectively, before normalization. Thismodel has the same cost as the locally normalizedmodel at test time but is significantly more expen-sive at training time as we need to score all possiblefunction invocations, as opposed to always condi-tioning on the gold functions. It can in principleavoid some of the exposure bias problems of thelocally normalized model, but we observed no ac-curacy improvements in our experiments.

B Value Sources

In our model, the type of a value determines whatsources it can be generated from. We enforce thatvalues of certain types can only be copied or entity-linked. Any values that do not fall under theseconstraints are added to a static vocabulary of con-stants, and the model is always permitted to gener-ate them, as long as they pass type checking. Val-ues that fall under these constraints are not addedto this vocabulary so that they cannot be “halluci-nated” by the model. The specific constraints thatwe use are described in the following paragraphs.

Types that must be copied: Types for which themodel is only allowed to construct values directlyfrom string literals copied from the utterance. In§2.5 we noted that strings can be copied fromthe utterance to become string literals in the gen-erated program. For certain types t, argumentsof type t may also be willing to accept copiedstrings; in this case we generate a constructorcall that constructs a t object from the string lit-eral. For SMCALFLOW, these copyable typesare String, PersonName, RespondComment, andLocationKeyphrase. For the other datasets it isjust String. We declare training examples wherea value of a copyable type appears in the pro-gram, but is not a substring of the correspondingutterance, as likely annotation errors and ignorethem during training (but not during evaluation).Even though such examples are very rare for SM-CALFLOW (∼0.5% of the examples), they turned

out to be relatively frequent in TREEDST (∼6% ofthe examples), as we discuss in Appendix C.

Types that must be entity-linked: Types forwhich argument values can only be picked fromthe set of proposed entities (§2.1) and cannot beotherwise hallucinated from the model, or directlycopied from the utterance. The Number type istreated in a special way for all datasets, wherenumbers 0, 1, and 2 are allowed to be halluci-nated, but all other numbers must be entity-linked.Furthermore, for SMCALFLOW the set of typesthat must be entity-linked also contains the Month,DayOfWeek, and Holiday types. Based on this,we can detect probable annotation errors.

C Dataset Preparation

We now describe how we processed the datasetsto satisfy the requirements mentioned in §2.1. Wehave made the processed datasets available at https://github.com/microsoft/task_oriented_dialogue_

as_dataflow_synthesis/tree/master/datasets.

C.1 Type Declarations

We manually specified the necessary type declara-tions by inspection of all functions in the trainingdata. In some cases, we found it helpful to trans-form the data into an equivalent set of functioncalls that simplified the resulting programs, whilemaintaining a one-to-one mapping with the origi-nal representations. For example, SMCALFLOW

contains a function called get that takes in an ob-ject of some type and a Path, which specifies afield of that object, and acts as an accessor. Forexample, the object could be an Event and thespecified path may be "subject". We transformsuch invocations into invocations of functions thatare instantiated separately for each unique combi-nation of the object type and the provided path. Forthe aforementioned example, the correspondingnew function would be defined as:

def Event.subject(obj: Event): String

All such transformations are invertible, so we canconvert back to the original format after prediction.

C.2 Meta-Computation Operators

The meta-computation operators are only requiredfor the conversational semantic parsing datasets,and SMCALFLOW already makes use of them.Therefore, we only had to convert TREEDST. Tothis end, we introduced two new operators:




def refer[T](): T

def revise[T, R](root: Root[T],path: Path[R],revision: R => R,

): T

refer goes through the programs and system ac-tions in the dialogue history, starting at the mostrecent turn, finds the first sub-program that evalu-ates to type T, and replaces its invocation with thatsub-program. Similarly, revise finds the first pro-gram whose root matches the specified root, walksdown the tree along the specified path, and appliesthe provided revision on the sub-program rootedat the end of that path. It then replaces its invoca-tion with this revised program. We performed anautomated heuristic transformation of TREEDSTso that it makes use of these meta-operators. Weonly applied the extracted transformations whenexecuting them on the transformed programs usingthe gold dialogue history resulted in the originalprogram (i.e., before applying any of our transfor-mations). Therefore, when using the gold dialoguehistory, this transformation is also guaranteed tobe invertible. We emphasize that we execute thesemeta-computation operators before computing ac-curacy so that our final evaluation results are com-parable to prior work.

C.3 Annotation ErrorsWhile preparing the datasets for our experimentsusing our automated transformations, we noticedthat they contain some inconsistencies. For exam-ple, in TREEDST, the tree fragment:...restaurant.book.restaurant.book...

seemed to be interchangeable with:...restaurant.object.equals...

The annotation and checking mechanisms we em-ploy impose certain regularity requirements on thedata that are violated by such examples. There-fore, we had three choices for such examples: (i)we could add additional type declarations, (ii) wecould discard them, or (iii) we could collapse thetwo annotations together, resulting in a lossy con-version. We used our best judgment when choosingamong these options, preferring option (iii) whereit was possible to do so automatically. We believethat all such cases are annotation errors, but wecannot know for certain without more informationabout how the TREEDST dataset was constructed.Overall, about 122 dialogues (0.4%) did not pass

our checks for SMCALFLOW, and 585 dialogues(3.0%) for TREEDST. When converting back to theoriginal format, we tally an error for each discardedexample, and select the most frequent version ofany lossily collapsed annotation.

Our approach also provides two simple yet ef-fective consistency checks for the training data:(i) running type inference using the provided typedeclarations to detect ill-typed examples, and (ii)using the constraints described Appendix B to de-tect other forms of annotation errors. We foundthat these two checks together caught 68 poten-tial annotation errors (<0.5%) in SMCALFLOW

and ∼1,000 potential errors (∼6%) in TREEDST.TREEDST was particularly interesting as we founda whole class of examples where user utteranceswere replaced with system utterances.

Note that our model does not technically requireany of these checks. It is possible to generate typesignatures that permit arbitrary function/argumentpairs based on observed data and to configure ourmodel so that any observed value may be generatedas a constant (i.e., not imposing the constraintsdescribed in Appendix B). In practice we foundthat constraining the space of programs providesuseful sanity checks in addition to accuracy gains.

C.4 Non-Conversational Semantic Parsing

We obtained the JOBS, GEOQUERY, and ATIS

datasets from the repository of Dong and Lapata(2016). For each dataset, we defined a library thatspecifies function and type declarations.

D Evaluation Details

To compare with prior work for SMCALFLOW

(Semantic Machines et al., 2020) and TREEDST(Cheng et al., 2020), we replicated their setups. ForSMCALFLOW, we predict plans always condition-ing on the gold dialogue history for each utterance,but we consider any predicted plan wrong if therefer are correct flag is set to false. Thisflag is meant to summarize the accuracy of a hypo-thetical model for resolving calls to refer, but isnot relevant to the problem of program prediction.We also canonicalize plans by sorting keyword ar-guments and normalizing numbers (so that 30.0and 30 are considered equivalent, for example).For TREEDST, our model predicts programs thatuse the refer and revise operators, and we exe-cute them against the dialogue history that consistsof predicted programs and gold (oracle) system

actions (following Cheng et al. (2020)) when con-verting back to the original tree representation. Wecanonicalize the resulting trees by lexicographi-cally sorting the children of each node.

For our baseline comparisons and ablations(shown in Tables 2a and 2b), we decided to ignorethe refer are correct flag for SMCALFLOW

because it assumes that refer is handled by someother model and for these experiments we are onlyinterested in evaluating program prediction. Also,for TREEDST we use the gold plans for the di-alogue history in order to focus on the semanticparsing problem, as opposed to the dialogue statetracking problem. For the non-conversational se-mantic parsing datasets we replicated the evalua-tion approach of Dong and Lapata (2016), and sowe also canonicalize the predicted programs.

E Model Hyperparameters

We use the same hyperparameters for all of ourconversational semantic parsing experiments. Forthe encoder, we use either BERT-medium (Turcet al., 2019) or a non-pretrained 2-layer Trans-former (Vaswani et al., 2017) with a hidden sizeof 128, 4 heads, and a fully connected layer sizeof 512, for the non-BERT experiments. For the de-coder we use a 2-layer Transformer with a hiddensize of 128, 4 heads, and a fully connected layersize of 512, and set htype to 128, and harg to 512.For the non-conversational semantic parsing exper-iments we use a hidden size of 32 throughout themodel as the corresponding datasets are very small.We also use a dropout of 0.2 for all experiments.

For training, we use the Adam optimizer(Kingma and Ba, 2017), performing global gradi-ent norm clipping with the maximum allowed normset to 10. For batching, we bucket the training ex-amples by utterance length and adapt the batch sizeso that the total number of tokens in each batchis 10,240. Finally, we average the log-likelihoodfunction over each batch, instead of summing it.

Experiments with BERT. We use a pre-trainingphase for 2,000 training steps, where we freezethe parameters of the utterance encoder and onlytrain the dialogue history encoder and the decoder.Then, we train the whole model for another 8,000steps. This because our model is not simply addinga linear layer on top of BERT, and so, unless ini-tialized properly, we may end up losing some ofthe information contained in the pre-trained BERTmodel. During the pre-training phase, we linearly

warm up the learning rate to 2× 10−3 during thefirst 1,000 steps. We then decay it exponentiallyby a factor of 0.999 every 10 steps. During the fulltraining phase, we linearly warm up the learningrate to 1 × 10−4 during the first 1,000 steps, andthen decay it exponentially in the same fashion.

Experiments without BERT. We use a singletraining phase for 30,000 steps, where we linearlywarm up the learning rate to 5× 10−3 during thefirst 1,000 steps, and then we decay it exponen-tially by a factor of 0.999 every 10 steps. We needa larger number of training steps in this case be-cause none of the model components have beenpre-trained. Also, the encoder is now much smaller,meaning that we can afford a higher learning rate.

Even though these hyperparameters may seem veryspecific, we emphasize that our model is robust tothe choice of hyperparameters and this setup waschosen once and shared across all experiments.

F Baseline Models

Seq2Seq. This model predicts linearized, tok-enized S-expressions using the OpenNMT imple-mentation of a Transformer-based (Vaswani et al.,2017) pointer-generator network (See et al., 2017).For example, the following program:

+(length("some string"), 1)

would correspond to the space-separated sequence:

( + ( length " some string " ) 1 )

In contrast to the model proposed in this paper, inthis case tokens that belong to functions and values(i.e., that are outside of quotes) can also be copieddirectly from the utterance. Furthermore, thereis no guarantee that this baseline will produce awell-formed program.

Seq2Tree. This model uses the same underly-ing implementation as our Seq2Seq baseline—alsowith no guarantee that it will produce a well-formedprogram—but it predicts a different sequence. Forexample, the following program:

+(+(1, 2), 3)

would be predicted as the sequence:

+(<NT>, 3)+(1, 2)

Each item in the sequence receives a unique em-bedding in the output vocabulary and so, "+(1,2)"and "+(<NT>, 3)" share no parameters. <NT> is

a special placeholder symbol that represents a sub-stitution point when converting the linearized se-quence back to a tree. Furthermore, copies are notinlined into invocations, but broken out into tokensequences. For example, the following program:+(length("some string"), 1)

would be predicted as the sequence:+(<NT>, 1)length(<NT>)"somestring"

Seq2Tree++. This is a re-implementation of Kr-ishnamurthy et al. (2017) with some differences:(i) our implementation’s entity linking embeddingsare computed over spans, including type informa-tion (as in the original paper) and a span embed-ding computed based on the LSTM hidden stateat the start and end of each entity span, (ii) copiesare treated as entities by proposing all spans upto length 15, and (iii) we use the linear dialoguehistory encoder described in §3.1.

G Ablations

The “value dependence” and the “no span copy”ablations are perhaps the most important in ourexperiments, and so we provide some more detailsabout them in the following paragraphs.

Value Dependence. The goal of this ablation isto quantify the impact of the dependency structurewe propose in Equation 3. To this end, we firstconvert all functions to a curried form, where eachargument is provided as part of a separate functioninvocation. For example, the following invocation:[0] event(subject = s0, start = t0, end = t1)

is transformed to the following program fragment:[0] value(s0)[1] event_0(subject = [0])[2] value(t0)[3] event_1(curried = [1], start = [2])[4] value(t1)[5] event_2(curried = [3], end = [4])

When choosing a function, our decoder does notcondition on the argument values of the previousinvocations. In order to enable such condition-ing without modifying the model implementation,we also transform the value function invocationswhose underlying values are not copies, such thatthere exists a unique function for each unique value.This results in the following program:

[0] value_s0(s0)[1] event_0(subject = [0])[2] value_t0(t0)[3] event_1(curried = [1], start = [2])[4] value_t1(t1)[5] event_2(curried = [3], end = [4])

Note that we keep the value s0, value t0, andvalue t1 function arguments because they allowthe model to marginalize over multiple possiblevalue sources (§2.5). The reason we do not trans-form the value functions that correspond to copiesis because we attempted doing that on top of thespan copy ablation, but it performed poorly and wedecided that it may be a misrepresentation. Overall,this ablation offers us a way to obtain a bottom-upparser that maintains most properties of the pro-posed model, except for its dependency structure.

No Span Copy. In order to ablate the proposedspan copy mechanism we implemented a data trans-formation that replaces all copied values with refer-ences to the result of a copy function (for spans oflength 1) or the result of a concatenate functioncalled on the results of 2 or more calls to copy. Forexample, the function invocation:[0] event(subject = "water the plant")

is converted to:[0] copy("water")[1] copy("the")[2] concatenate([0], [1])[3] copy("plant")[4] concatenate([2], [3])[5] event(subject = [4])

When applied on its own and not combined withother ablation, the single token copies are furtherinlined to produce the following program:[0] concatenate("water", "the")[1] concatenate([0], "plant")[2] event(subject = [1])

H Computational Efficiency

For comparing model performance we computedthe average utterance processing time across all ofthe SMCALFLOW validation set, using a singleNvidia V100 GPU. The fastest baseline requiredabout 80ms per utterance, while our model onlyrequired about 8ms per utterance. This can beattributed to multiple reasons, such as the factsthat: (i) our independence assumptions allow us topredict the argument value distributions in parallel,(ii) we avoid enumerating all possible utterancespans when computing the normalizing constant forthe argument values, and (iii) we use ragged tensorsto avoid unnecessary padding and computation.

Documents

Value-Agnostic Conversational Semantic Parsing · 2021. 6. 14. · Value-Agnostic Conversational Semantic Parsing Emmanouil Antonios Platanios, Adam Pauls, Subhro Roy, Yuchen Zhang,