Generating Logical Forms from Graph Representations of ...Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 95 Generating Logical Forms from

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 95–106Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

95

Generating Logical Forms from Graph Representations of Text andEntities

Peter Shaw1, Philip Massey1, Angelica Chen1, Francesco Piccinno2, Yasemin Altun2

1Google 2Google Research{petershaw,pmassey,angelicachen,piccinno,altun}@google.com

AbstractStructured information about entities is criticalfor many semantic parsing tasks. We presentan approach that uses a Graph Neural Net-work (GNN) architecture to incorporate infor-mation about relevant entities and their rela-tions during parsing. Combined with a de-coder copy mechanism, this approach providesa conceptually simple mechanism to generatelogical forms with entities. We demonstratethat this approach is competitive with the state-of-the-art across several tasks without pre-training, and outperforms existing approacheswhen combined with BERT pre-training.

1 Introduction

Semantic parsing maps natural language utter-ances into structured meaning representations.The representation languages vary between tasks,but typically provide a precise, machine inter-pretable logical form suitable for applications suchas question answering (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2007; Liang et al., 2013;Berant et al., 2013). The logical forms typicallyconsist of two types of symbols: a vocabulary ofoperators and domain-specific predicates or func-tions, and entities grounded to some knowledgebase or domain.

Recent approaches to semantic parsing havecast it as a sequence-to-sequence task (Dong andLapata, 2016; Jia and Liang, 2016; Ling et al.,2016), employing methods similar to those de-veloped for neural machine translation (Bahdanauet al., 2014), with strong results. However, spe-cial consideration is typically given to handling ofentities. This is important to improve generaliza-tion and computational efficiency, as most tasksrequire handling entities unseen during training,and the set of unique entities can be large.

Some recent approaches have replaced surfaceforms of entities in the utterance with placehold-

ers (Dong and Lapata, 2016). This requires a pre-processing step to completely disambiguate enti-ties and replace their spans in the utterance. Ad-ditionally, for some tasks it may be beneficial toleverage relations between entities, multiple entitycandidates per span, or entity candidates without acorresponding span in the utterance, while gener-ating logical forms.

Other approaches identify only types and sur-face forms of entities while constructing the log-ical form (Jia and Liang, 2016), using a separatepost-processing step to generate the final logicalform with grounded entities. This ignores poten-tially useful knowledge about relevant entities.

Meanwhile, there has been considerable re-cent interest in Graph Neural Networks (GNNs)(Scarselli et al., 2009; Li et al., 2016; Kipf andWelling, 2017; Gilmer et al., 2017; Velickovicet al., 2018) for effectively learning representa-tions for graph structures. We propose a GNNarchitecture based on extending the self-attentionmechanism of the Transformer (Vaswani et al.,2017) to make use of relations between input el-ements.

We present an application of this GNN ar-chitecture to semantic parsing, conditioning ona graph representation of the given natural lan-guage utterance and potentially relevant entities.This approach is capable of handling ambigu-ous and potentially conflicting entity candidatesjointly with a natural language utterance, relax-ing the need for completely disambiguating a setof linked entities before parsing. This graph for-mulation also enables us to incorporate knowledgeabout the relations between entities where avail-able. Combined with a copy mechanism while de-coding, this approach also provides a conceptuallysimple method for generating logical forms withgrounded entities.

We demonstrate the capability of the pro-

96

Dataset Example

GEO x : which states does the mississippi run through ?y : answer ( state ( traverse 1( riverid ( mississippi ) ) ) )

ATIS x : in denver what kind of ground transportation is there from the airport to downtowny : ( _lambda $0 e ( _and ( _ground_transport $0 ) ( _to_city

$0 denver : ci ) ( _from_airport $0 den : ap ) ) )

SPIDER x : how many games has each stadium held ?y : SELECT T1 . id , count ( ∗ ) FROM stadium AS T1 JOIN game AS

T2 ON T1 . id = T2 . stadium id GROUP BY T1 . id

Table 1: Example input utterances, x, and meaning representations, y, with entities underlined.

posed architecture by achieving competitive re-sults across 3 semantic parsing tasks. Further im-provements are possible by incorporating a pre-trained BERT (Devlin et al., 2018) encoder withinthe architecture.

2 Task Formulation

Our goal is to learn a model for semantic pars-ing from pairs of natural language utterances andstructured meaning representations. Let the nat-ural language utterance be represented as a se-quence x = (x1, . . . , x|x|) of |x| tokens, and themeaning representation be represented as a se-quence y = (y1, . . . , y|y|) of |y| elements.

The goal is to estimate p(y | x), the conditionalprobability of the meaning representation y givenutterance x, which is augmented by a set of poten-tially relevant entities.

Input Utterance Each token xi ∈ V in is from avocabulary of input tokens.

Entity Candidates Given the input utterance x,we retrieve a set, e = {e1, . . . , e|e|}, of potentiallyrelevant entity candidates, with e ⊆ Ve, where Veis in the set of all entities for a given domain. Weassume the availability of an entity candidate gen-erator for each task to generate e given x, withdetails given in § 5.2.

For each entity candidate, e ∈ Ve, we requirea set of task-specific attributes containing one ormore elements from Va. These attributes can beNER types or other characteristics of the entity,such as “city” or “river” for some of the entitieslisted in Table 1. Whereas Ve can be quite largefor open domains, or even infinite if it includes setssuch as the natural numbers, Va is typically muchsmaller. Therefore, we can effectively learn repre-sentations for entities given their set of attributes,

from our set of example pairs.

Edge Labels In addition to x and e for a particu-lar example, we also consider the (|x|+|e|)2 pair-wise relations between all tokens and entity candi-dates, represented as edge labels.

The edge label between tokens xi and xj corre-sponds to the relative sequential position, j − i, ofthe tokens, clipped to within some range.

The edge label between token xi and entity ej ,and vice versa, corresponds to whether xi is withinthe span of the entity candidate ej , or not.

The edge label between entities ei and ej cap-tures the relationship between the entities. Theseedge labels can have domain-specific interpreta-tions, such as relations in a knowledge base, or anyother type of entity interaction features. For taskswhere this information is not available or useful, asingle generic label between entity candidates canbe used.

Output We consider the logical form, y, to bea linear sequence (Vinyals et al., 2015b). We to-kenize based on the syntax of each domain. Ourformulation allows each element of y to be eitheran element of the output vocabulary, Vout, or anentity copied from the set of entity candidates e.Therefore, yi ∈ Vout ∪ Ve. Some experiments in§5.2 also allow elements of y to be tokens ∈ V infrom x that are copied from the input.

3 Model Architecture

Our model architecture is based on the Trans-former (Vaswani et al., 2017), with the self-attention sub-layer extended to incorporate rela-tions between input elements, and the decoder ex-tended with a copy mechanism.

97

x1 x2 x8

e4e1 e2

how

stadium/table...

games has each stadium held ?many

game/table...

x3 x4 x5 x6 x7

stadium_id/column

...

…e3

id/column

/primary_key...

Figure 1: We use an example from SPIDER to illustrate the model inputs: tokens from the given utterance, x, a setof potentially relevant entities, e, and their relations. We selected two edge label types to highlight: edges denotingthat an entity spans a token, and edges between entities that, for SPIDER, indicate a foreign key relationshipbetween columns, or an ownership relationship between columns and tables.

3.1 GNN Sub-layerWe extend the Transformer’s self-attention mecha-nism to form a Graph Neural Network (GNN) sub-layer that incorporates a fully connected, directedgraph with edge labels.

The sub-layer maps an ordered sequence ofnode representations, u = (u1, . . . , u|u|), to anew sequence of node representations, u′ =(u′1, . . . , u

′|u|), where each node is represented ∈

Rd. We use rij to denote the edge label corre-sponding to ui and uj .

We implement this sub-layer in terms of a func-tion f(m, l) over a node representation m ∈ Rd

and an edge label l that computes a vector repre-sentation in Rd′ . We use nheads parallel attentionheads, with d′ = d/nheads. For each head k, thenew representation for the node ui is computed by

uk′i =

|u|∑j=1

αijf(uj , rij), (1)

where each coefficient αij is a softmax over thescaled dot products sij ,

sij =(Wqui)

ᵀf(uj , rij)√d′

, (2)

and Wq is a learned matrix. Finally, we concate-nate representations from each head,

u′i = Wh[u1′i | · · · | u

nheads′i

], (3)

where Wh is another learned matrix and [ · · · ]denotes concatenation.

If we implement f as,

f(m, l) = Wrm, (4)

where Wr ∈ Rd′×d is a learned matrix, thenthe sub-layer would be effectively identical toself-attention as initially proposed in the Trans-former (Vaswani et al., 2017).

We focus on two alternative formulations of fthat represent edge labels as learned matrices andlearned vectors.

Edge Matrices The first formulation representsedge labels as linear transformations, a commonparameterization for GNNs (Li et al., 2016),

f(m, l) = Wlm, (5)

where Wl ∈ Rd′×d is a learned embedding matrixper edge label.

Edge Vectors The second formulation repre-sents edge labels as additive vectors using thesame formulation as Shaw et al. (2018),

f(m, l) = Wrm+wl, (6)

where Wr ∈ Rd′×d is a learned matrix shared byall edge labels, and wl ∈ Rd is a learned embed-ding vector per edge label l.

3.2 Encoder

Input Representations Before the initial en-coder layer, tokens are mapped to initial repre-sentations using either a learned embedding tablefor V in, or the output of a pre-trained BERT (De-vlin et al., 2018) encoder. Entity candidates aremapped to initial representations using the meanof the embeddings for each of the entity’s at-tributes, based on a learned embedding table for

98

Previous Outputs

NencLayers

TokensEntities

Embed

GNN Sub-layer

Feed-Forward Network

Ndec Layers

GNN Sub-layer

Embed

Encoder-Decoder Attention

Select Action

Generate or Copy

Output

Feed-Forward Network

Relations

Embed / BERTEmbed

Figure 2: Our model architecture is based on the Transformer (Vaswani et al., 2017), with two modifications. First,the self-attention sub-layer has been extended to be a GNN that incorporates edge representations. In the encoder,the GNN sub-layer is conditioned on tokens, entities, and their relations. Second, the decoder has been extendedto include a copy mechanism (Vinyals et al., 2015a). We can optionally incorporate a pre-trained model such asBERT to generate contextual token representations.

Va. We also concatenate an embedding represent-ing the node type, token or entity, to each inputrepresentation.

We assume some arbitrary ordering for entitycandidates, generating a combined sequence ofinitial node representations for tokens and entities.We have edge labels between every pair of nodesas described in § 2.

Encoder Layers Our encoder layers are essen-tially identical to the Transformer, except with theproposed extension to self-attention to incorpo-rate edge labels. Therefore, each encoder layerconsists of two sub-layers. The first is the GNNsub-layer, which yields new sets of token and en-tity representations. The second sub-layer is anelement-wise feed-forward network. Each sub-layer is followed by a residual connection andlayer normalization (Ba et al., 2016). We stackNenc encoder layers, yielding a final set of tokenrepresentations, wx

(Nenc), and entity representa-tions, we

(Nenc).

3.3 Decoder

The decoder auto-regressively generates outputsymbols, y1, . . . , y|y|. It is similarly based onthe Transformer (Vaswani et al., 2017), with theself-attention sub-layer replaced by the GNN sub-layer. Decoder edge labels are based only on therelative timesteps of the previous outputs. Theencoder-decoder attention layer considers both en-coder outputs wx

(Nenc) and we(Nenc), jointly nor-

malizing attention weights over tokens and entity

candidates. We stack Ndec decoder layers to pro-duce an output vector representation at each outputstep, zj ∈ Rdz , for j ∈ {1, . . . , |y|}.

We allow the decoder to copy tokens or entitycandidates from the input, effectively combininga Pointer Network (Vinyals et al., 2015a) with astandard softmax output layer for selecting sym-bols from an output vocabulary (Gu et al., 2016;Gulcehre et al., 2016; Jia and Liang, 2016). Wedefine a latent action at each output step, aj forj ∈ {1, . . . , |y|}, using similar notation as Jia etal. (2016). We normalize action probabilities witha softmax over all possible actions.

Generating Symbols We can generate a sym-bol, denoted Generate[i],

P (aj =Generate[i] | x,y1:j−1) ∝exp(zᵀjw

outi ),

(7)

where wouti is a learned embedding vector for

the element ∈ Vout with index i. If aj =Generate[i], then yj will be the element ∈ Voutwith index i.

Copying Entities We can also copy an entitycandidate, denoted CopyEntity[i],

P (aj = CopyEntity[i] | x,y1:j−1) ∝exp((zjW

e)ᵀw(Nenc)ei ),

(8)

where We is a learned matrix, and i ∈{1, . . . , |e|}. If aj = CopyEntity[i], then yj =ei.

99

4 Related Work

Various approaches to learning semantic parsersfrom pairs of utterances and logical forms havebeen developed over the years (Tang andMooney, 2000; Zettlemoyer and Collins, 2007;Kwiatkowski et al., 2011; Andreas et al., 2013).More recently, encoder-decoder architectures havebeen applied with strong results (Dong and Lap-ata, 2016; Jia and Liang, 2016).

Even for tasks with relatively small domainsof entities, such as GEO and ATIS, it has beenshown that some special consideration of entitieswithin an encoder-decoder architecture is impor-tant to improve generalization. This has includedextending decoders with copy mechanisms (Jiaand Liang, 2016) and/or identifying entities in theinput as a pre-processing step (Dong and Lapata,2016).

Other work has considered open domain tasks,such as WEBQUESTIONSSP (Yih et al., 2016).Recent approaches have typically relied on a sepa-rate entity linking model, such as S-MART (Yangand Chang, 2015), to provide a single disam-biguated set of entities to consider. In principle,a learned entity linker could also serve as an en-tity candidate generator within our framework, al-though we do not explore such tasks in this work.

Considerable recent work has focused onconstrained decoding of various forms withinan encoder-decoder architecture to leverage theknown structure of the logical forms. This hasled to approaches that leverage this structure dur-ing decoding, such as using tree decoders (Dongand Lapata, 2016; Alvarez-Melis and Jaakkola,2017) or other mechanisms (Dong and Lapata,2018; Goldman et al., 2017). Other approachesuse grammar rules to constrain decoding (Xiaoet al., 2016; Yin and Neubig, 2017; Krishnamurthyet al., 2017; Yu et al., 2018b). We leave investiga-tion of such decoder constraints to future work.

Many formulations of Graph Neural Networks(GNNs) that propagate information over localneighborhoods have recently been proposed (Liet al., 2016; Kipf and Welling, 2017; Gilmer et al.,2017; Velickovic et al., 2018). Recent work has of-ten focused on large graphs (Hamilton et al., 2017)and effectively propagating information over mul-tiple graph steps (Xu et al., 2018). The graphswe consider are relatively small and are fully-connected, avoiding some of the challenges posedby learning representations for large, sparsely con-

nected graphs.Other recent work related to ours has considered

GNNs for natural language tasks, such as combin-ing structured and unstructured data for questionanswering (Sun et al., 2018), or for representingdependencies in tasks such as AMR parsing andmachine translation (Beck et al., 2018; Bastingset al., 2017). The approach of Krishnamurthyet al. (2017) similarly considers ambiguous en-tity mentions jointly with query tokens for seman-tic parsing, although does not directly consider aGNN.

Previous work has interpreted the Transformer’sself-attention mechanism as a GNN (Velickovicet al., 2018; Battaglia et al., 2018), and extendedit to consider relative positions as edge repre-sentations (Shaw et al., 2018). Previous workhas also similarly represented edge labels as vec-tors, as opposed to matrices, in order to avoidover-parameterizing the model (Marcheggiani andTitov, 2017).

5 Experiments

5.1 Semantic Parsing Datasets

We consider three semantic parsing datasets, withexamples given in Table 1.

GEO The GeoQuery dataset consists of nat-ural language questions about US geographyalong with corresponding logical forms (Zelle andMooney, 1996). We follow the convention ofZettlemoyer and Collins (2005) and use 600 train-ing examples and 280 test examples. We use log-ical forms based on Functional Query Language(FunQL) (Kate et al., 2005).

ATIS The Air Travel Information System (ATIS)dataset consists of natural language queries abouttravel planning (Dahl et al., 1994). We followZettlemoyer and Collins (2007) and use 4473training examples, 448 test examples, and repre-sent the logical forms as lambda expressions.

SPIDER This is a large-scale text-to-SQLdataset that consists of 10,181 questions and5,693 unique complex SQL queries across 200database tables spanning 138 domains (Yu et al.,2018c). We use the standard training set of 8,659training example and development set of 1,034examples, split across different tables.

100

5.2 Experimental SetupModel Configuration We configured hyperpa-rameters based on performance on the validationset for each task, if provided, otherwise cross-validated on the training set.

For the encoder and decoder, we selected thenumber of layers from {1, 2, 3, 4} and embed-ding and hidden dimensions from {64, 128, 256},setting the feed forward layer hidden dimen-sions 4× higher. We employed dropoutat training time with Pdropout selected from{0.1, 0.2, 0.3, 0.4, 0.5, 0.6}. We used 8 attentionheads for each task. We used a clipping distance of8 for relative position representations (Shaw et al.,2018).

We used the Adam optimizer (Kingma and Ba,2015) with β1 = 0.9, β2 = 0.98, and ε = 10−9,and tuned the learning rate for each task. We usedthe same warmup and decay strategy for learningrate as Vaswani et al. (2017), selecting a num-ber of warmup steps up to a maximum of 3000.Early stopping was used to determine the totaltraining steps for each task. We used the finalcheckpoint for evaluation. We batched trainingexamples together, and selected batch size from{32, 64, 128, 256, 512}. During training we usedmasked self-attention (Vaswani et al., 2017) to en-able parallel decoding of output sequences. Forevaluation, we used greedy search.

We used a simple strategy of splitting each in-put utterance on spaces to generate a sequence oftokens. We mapped any token that didn’t occur atleast 2 times in the training dataset to a special out-of-vocabulary token. For experiments that usedBERT, we instead used the same wordpiece (Wuet al., 2016) tokenization as used for pre-training.

BERT For some of our experiments, we eval-uated incorporating a pre-trained BERT (Devlinet al., 2018) encoder by effectively using the out-put of the BERT encoder in place of a learned to-ken embedding table. We then continue to usegraph encoder and decoder layers with randomlyinitialized parameters in addition to BERT, sothere are many parameters that are not pre-trained.The additional encoder layers are still necessary tocondition on entities and relations.

We achieved best results by freezing the pre-trained parameters for an initial number of steps,and then jointly fine-tuning all parameters, sim-ilar to existing approaches for gradual unfreez-ing (Howard and Ruder, 2018). When unfreezing

the pre-trained parameters, we restart the learn-ing rate schedule. We found this to perform betterthan keeping pre-trained parameters either entirelyfrozen or entirely unfrozen during fine-tuning.

We used BERTLARGE (Devlin et al., 2018),which has 24 layers. For fine tuning we usedthe same Adam optimizer with weight decay andlearning rate decay as used for BERT pre-training.We reduced batch sizes to accommodate the sig-nificantly larger model size, and tuned learningrate, warm up steps, and number of frozen stepsfor pre-trained parameters.

Entity Candidate Generator We use an entitycandidate generator that, given x, can retrieve aset of potentially relevant entities, e, for the givendomain. Although all generators share a commoninterface, their implementation varies across tasks.

For GEO and ATIS we use a lexicon of entityaliases in the dataset and attempt to match withngrams in the query. Each entity has a single at-tribute corresponding to the entity’s type. We usedbinary valued relations between entity candidatesbased on whether entity candidate spans overlap,but experiments did not show significant improve-ments from incorporating these relations.

For SPIDER, we generalize our notion of enti-ties to include tables and table columns. We in-clude all relevant tables and columns as entity can-didates, but make use of Levenshtein distance be-tween query ngrams and table and column namesto determine edges between tokens and entity can-didates. We use attributes based on the types andnames of tables and columns. Edges between en-tity candidates capture relations between columnsand the table they belong to, and foreign key rela-tions.

For GEO, ATIS, and SPIDER, this leads to19.5%, 32.7%, and 74.6% of examples contain-ing at least one span associated with multiple en-tity candidates, respectively, indicating some en-tity ambiguity.

Further details on how entity candidate genera-tors were constructed are provided in § A.1.

Output Sequences We pre-processed output se-quences to identify entity argument values, andreplaced those elements with references to entitycandidates in the input. In cases where our entitycandidate generator did not retrieve an entity thatwas used as an argument, we dropped the examplefrom the training data set or considered it incorrect

101

Method GEO ATIS

Kwiatkowski et al. (2013) 89.0 —Liang et al. (2013) 87.9 —Wang et al. (2014) — 91.3Zhao and Huang (2015) 88.9 84.2Jia and Liang (2016) 89.3 83.3− data augmentation 85.0 76.3

Dong and Lapata (2016) † 87.1 84.6Rabinovich et al. (2017) † 87.1 85.9Dong and Lapata (2018) † 88.2 87.7

Ours

GNN w/ edge matrices 82.5 84.6GNN w/ edge vectors 89.3 87.1GNN w/ edge vectors + BERT 92.5 89.7

Method SPIDER

Xu et al. (2017) 10.9Yu et al. (2018a) 8.0Yu et al. (2018b) 24.8− data augmentation 18.9

Ours

GNN w/ edge matrices 29.3GNN w/ edge vectors 32.1GNN w/ edge vectors + BERT 23.5

Table 2: We report accuracies on GEO, ATIS, and SPIDER for various implementations of our GNN sub-layer. ForGEO and ATIS, we use † to denote neural approaches that disambiguate and replace entities in the utterance as apre-processing step. For SPIDER, the evaluation set consists of examples for databases unseen during training.

if in the test set.

Evaluation To evaluate accuracy, we use exactmatch accuracy relative to gold logical forms. ForGEO we directly compare output symbols. ForATIS, we compare normalized logical forms us-ing canonical variable naming and sorting for un-ordered arguments (Jia and Liang, 2016). For SPI-DER we use the provided evaluation script, whichdecomposes each SQL query and conducts setcomparison within each clause without values. Allaccuracies are reported on the test set, except forSPIDER where we report and compare accuracieson the development set.

Copying Tokens To better understand the effectof conditioning on entities and their relations, wealso conducted experiments that considered an al-ternative method for selecting and disambiguatingentities similar to Jia et al. (2016). In this approachwe use our model’s copy mechanism to copy to-kens corresponding to the surface forms of entityarguments, rather than copying entities directly.

P (aj = CopyToken[i] | x,y1:j−1) ∝exp((zjW

x)ᵀw(Nenc)xi

),(9)

where Wx is a learned matrix, and where i ∈{1, . . . , |x|} refers to the index of token xi ∈ V in.If aj = CopyToken[i], then yj = xi.

This allows us to ablate entity information in theinput while still generating logical forms. Whencopying tokens, the decoder determines the type ofthe entity using an additional output symbol. For

GEO, the actual entity can then be identified as apost-processing step, as a type and surface form issufficient. For other tasks this could require a morecomplicated post-processing step to disambiguateentities given a surface form and type.

Method GEO

Copying EntitiesGNN w/ edge vectors + BERT 92.5GNN w/ edge vectors 89.3

Copying TokensGNN w/ edge vectors 87.9− entity candidates, e 84.3

BERT 89.6

Table 3: Experimental results for copying tokens in-stead of entities when decoding, with and without con-ditioning on the set of entity candidates, e.

5.3 Results and AnalysisAccuracies on GEO, ATIS, and SPIDER are shownin Table 2.

GEO and ATIS Without pre-training, and de-spite adding a bit of entity ambiguity, we achievesimilar results to other recent approaches that dis-ambiguate and replace entities in the utterance as apre-processing step during both training and eval-uating (Dong and Lapata, 2016, 2018). When in-corporating BERT, we increase absolute accura-cies over Dong and Lapata (2018) on GEO andATIS by 3.2% and 2.0%, respectively. Notably,

102

they also present techniques and results that lever-age constrained decoding, which our approachwould also likely further benefit from.

For GEO, we find that when ablating all en-tity information in our model and copying to-kens instead of entities, we achieve similar resultsas Jia and Liang (2016) when also ablating theirdata augmentation method, as shown in Table 3.This is expected, since when ablating entities com-pletely, our architecture essentially reduces to thesame sequence-to-sequence task setup. These re-sults demonstrate the impact of conditioning onthe entity candidates, as it improves performanceeven on the token copying setup. It appears thatleveraging BERT can partly compensate for notconditioning on entity candidates, but combiningBERT with our GNN approach and copying enti-ties achieves 2.9% higher accuracy than using onlya BERT encoder and copying tokens.

For ATIS, our results are outperformed byWang et al. (2014) by 1.6%. Their approach useshand-engineered templates to build a CCG lexi-con. Some of these templates attempt to handlethe specific types of ungrammatical utterances inthe ATIS task.

SPIDER For SPIDER, a relatively new dataset,there is less prior work. Competitive approacheshave been specific to the text-to-SQL task (Xuet al., 2017; Yu et al., 2018a,b), incorporating task-specific methods to condition on table and col-umn information, and incorporating SQL-specificstructure when decoding. Our approach improvesabsolute accuracy by +7.3% relative to Yu etal. (2018b) without using any pre-trained languagerepresentations, or constrained decoding. Our ap-proach could also likely benefit from some of theother aspects of Yu et al. (2018b) such as morestructured decoding, data augmentation, and usingpre-trained representations (they use GloVe (Pen-nington et al., 2014)) for tokens, columns, and ta-bles.

Our results were surprisingly worse when at-tempting to incorporate BERT. Of course, success-fully incorporating pre-trained representations isnot always straightforward. In general, we foundusing BERT within our architecture to be sensi-tive to learning rates and learning rate schedules.Notably, the evaluation setup for SPIDER is verydifferent than training, as examples are for tablesunseen during training. Models may not general-ize well to unseen tables and columns. It’s likely

that successfully incorporating BERT for SPIDER

would require careful tuning of hyperparametersspecifically for the database split configuration.

Entity Spans and Relations Ablating span re-lations between entities and tokens for GEO andATIS is shown in Table 4. The impact is moresignificant for ATIS, which contains many querieswith multiple entities of the same type, such asnonstop flights seattle to boston where disam-biguating the origin and destination entities re-quires knowledge of which tokens they are asso-ciated with, given that we represent entities basedonly on their types for these tasks. We leave forfuture work consideration of edges between en-tity candidates that incorporate relevant domainknowledge for these tasks.

Edge Ablations GEO ATIS

GNN w/ edge vectors 89.3 87.1− entity span edges 88.6 34.2

Table 4: Results for ablating information about entitycandidate spans for GEO and ATIS.

For SPIDER, results ablating relations betweenentities and tokens, and relations between entities,are shown in Table 5. This demonstrates the im-portance of entity relations, as they include use-ful information for disambiguating entities such aswhich columns belong to which tables, and whichcolumns have foreign key relations.

Edge Ablations SPIDER

GNN w/ edge vectors 32.1− entity span edges 27.8− entity relation edges 26.3

Table 5: Results for ablating information about rela-tions between entity candidates and tokens for SPIDER.

Edge Representations Using additive edge vec-tors outperforms using learned edge matrix trans-formations for implementing f , across all tasks.While the vector formulation is less expressive,it also introduces far fewer parameters per edgetype, which can be an important considerationgiven that our graph contains many similar edgelabels, such as those representing similar relativepositions between tokens. We leave further ex-ploration of more expressive edge representationsto future work. Another direction to explore is a

103

heterogeneous formulation of the GNN sub-layer,that employs different formulations for differentsubsets of nodes, e.g. for tokens and entities.

6 Conclusions

We have presented an architecture for semanticparsing that uses a Graph Neural Network (GNN)to condition on a graph of tokens, entities, andtheir relations. Experimental results have demon-strated that this approach can achieve competitiveresults across a diverse set of tasks, while also pro-viding a conceptually simple way to incorporateentities and their relations during parsing.

For future direction, we are interested in ex-ploring constrained decoding, better incorporatingpre-trained language representations within our ar-chitecture, conditioning on additional relations be-tween entities, and different GNN formulations.

More broadly, we have presented a flexible ap-proach for conditioning on available knowledge inthe form of entities and their relations, and demon-strated its effectiveness for semantic parsing.

Acknowledgments

We would like to thank Karl Pichotta, Zuyao Li,Tom Kwiatkowski, and Dipanjan Das for helpfuldiscussions. Thanks also to Ming-Wei Chang andKristina Toutanova for their comments, and to allwho provided feedback in draft reading sessions.Finally, we are grateful to the anonymous review-ers for their useful feedback.

ReferencesD. Alvarez-Melis and T. Jaakkola. 2017. Tree struc-

tured decoding with doubly recurrent neural net-works. In International Conference on LearningRepresentations (ICLR).

Jacob Andreas, Andreas Vlachos, and Stephen Clark.2013. Semantic parsing as machine translation. InProceedings of the 51st Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers), volume 2, pages 47–52.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Joost Bastings, Ivan Titov, Wilker Aziz, DiegoMarcheggiani, and Khalil Simaan. 2017. Graph

convolutional encoders for syntax-aware neural ma-chine translation. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1957–1967.

Peter W Battaglia, Jessica B Hamrick, Victor Bapst,Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Ma-teusz Malinowski, Andrea Tacchetti, David Raposo,Adam Santoro, Ryan Faulkner, et al. 2018. Rela-tional inductive biases, deep learning, and graph net-works. arXiv preprint arXiv:1806.01261.

Daniel Beck, Gholamreza Haffari, and Trevor Cohn.2018. Graph-to-sequence learning using gatedgraph neural networks. In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages273–283.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 1533–1544.

Deborah A Dahl, Madeleine Bates, Michael Brown,William Fisher, Kate Hunicke-Smith, David Pallett,Christine Pao, Alexander Rudnicky, and ElizabethShriberg. 1994. Expanding the scope of the atis task:The atis-3 corpus. In Proceedings of the workshopon Human Language Technology, pages 43–48. As-sociation for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Li Dong and Mirella Lapata. 2016. Language to logi-cal form with neural attention. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), vol-ume 1, pages 33–43.

Li Dong and Mirella Lapata. 2018. Coarse-to-fine de-coding for neural semantic parsing. In ACL.

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley,Oriol Vinyals, and George E Dahl. 2017. Neuralmessage passing for quantum chemistry. In Inter-national Conference on Machine Learning, pages1263–1272.

Omer Goldman, Veronica Latcinnik, Udi Naveh, AmirGloberson, and Jonathan Berant. 2017. Weakly-supervised semantic parsing with abstract examples.arXiv preprint arXiv:1711.05240.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OKLi. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),volume 1, pages 1631–1640.

104

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Pointingthe unknown words. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), volume 1,pages 140–149.

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017.Inductive representation learning on large graphs.In Advances in Neural Information Processing Sys-tems, pages 1024–1034.

Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification.arXiv preprint arXiv:1801.06146.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), vol-ume 1, pages 12–22.

Rohit J Kate, Yuk Wah Wong, and Raymond J Mooney.2005. Learning to transform natural to formal lan-guages. In Proceedings of the National Conferenceon Artificial Intelligence, volume 20, page 1062.Menlo Park, CA; Cambridge, MA; London; AAAIPress; MIT Press; 1999.

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In InternationalConference on Learning Representations (ICLR).

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. In International Conference on LearningRepresentations (ICLR).

Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural semantic parsing with type con-straints for semi-structured tables. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 1516–1526.

Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and LukeZettlemoyer. 2013. Scaling semantic parsers withon-the-fly ontology matching. In Proceedings of the2013 conference on empirical methods in naturallanguage processing, pages 1545–1556.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa-ter, and Mark Steedman. 2011. Lexical generaliza-tion in ccg grammar induction for semantic parsing.In Proceedings of the conference on empirical meth-ods in natural language processing, pages 1512–1523. Association for Computational Linguistics.

Yujia Li, Daniel Tarlow, Marc Brockschmidt, andRichard Zemel. 2016. Gated graph sequence neuralnetworks. In International Conference on LearningRepresentations (ICLR).

Percy Liang, Michael I Jordan, and Dan Klein. 2013.Learning dependency-based compositional seman-tics. Computational Linguistics, 39(2):389–446.

Wang Ling, Phil Blunsom, Edward Grefenstette,Karl Moritz Hermann, Tomas Kocisky, FuminWang, and Andrew Senior. 2016. Latent predictornetworks for code generation. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), vol-ume 1, pages 599–609.

Diego Marcheggiani and Ivan Titov. 2017. Encodingsentences with graph convolutional networks for se-mantic role labeling. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 1506–1515.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017. Abstract syntax networks for code genera-tion and semantic parsing. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), vol-ume 1, pages 1139–1149.

Franco Scarselli, Marco Gori, Ah Chung Tsoi, MarkusHagenbuchner, and Gabriele Monfardini. 2009. Thegraph neural network model. IEEE Transactions onNeural Networks, 20(1):61–80.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-attention with relative position represen-tations. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers), volume 2, pages464–468.

Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, KathrynMazaitis, Ruslan Salakhutdinov, and William Co-hen. 2018. Open domain question answering usingearly fusion of knowledge bases and text. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, pages 4231–4242.

Lappoon R Tang and Raymond J Mooney. 2000. Au-tomated construction of database interfaces: Inte-grating statistical and relational learning for seman-tic parsing. In Proceedings of the 2000 Joint SIG-DAT conference on Empirical methods in naturallanguage processing and very large corpora: held inconjunction with the 38th Annual Meeting of the As-sociation for Computational Linguistics-Volume 13,pages 133–141. Association for Computational Lin-guistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

105

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2018. Graph attention networks. In InternationalConference on Learning Representations (ICLR).

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015a. Pointer networks. In Advances in NeuralInformation Processing Systems, pages 2692–2700.

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey Hinton. 2015b. Gram-mar as a foreign language. In Advances in NeuralInformation Processing Systems, pages 2773–2781.

Adrienne Wang, Tom Kwiatkowski, and Luke Zettle-moyer. 2014. Morpho-syntactic lexical generaliza-tion for ccg semantic parsing. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1284–1295.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144.

Chunyang Xiao, Marc Dymetman, and Claire Gardent.2016. Sequence-based structured prediction for se-mantic parsing. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), volume 1, pages1341–1350.

Keyulu Xu, Chengtao Li, Yonglong Tian, Tomo-hiro Sonobe, Ken-ichi Kawarabayashi, and StefanieJegelka. 2018. Representation learning on graphswith jumping knowledge networks. In InternationalConference on Learning Representations (ICLR).

Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:Generating structured queries from natural languagewithout reinforcement learning. arXiv preprintarXiv:1711.04436.

Yi Yang and Ming-Wei Chang. 2015. S-mart: Noveltree-based structured learning algorithms applied totweet entity linking. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 1:Long Papers), volume 1, pages 504–513.

Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of se-mantic parse labeling for knowledge base questionanswering. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), volume 2, pages 201–206.

Pengcheng Yin and Graham Neubig. 2017. A syntacticneural model for general-purpose code generation.In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), volume 1, pages 440–450.

Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, andDragomir Radev. 2018a. Typesql: Knowledge-based type-aware neural text-to-sql generation. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 2 (Short Papers), volume 2, pages 588–594.

Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang,Dongxu Wang, Zifan Li, and Dragomir Radev.2018b. Syntaxsqlnet: Syntax tree networks forcomplex and cross-domaintext-to-sql task. arXivpreprint arXiv:1810.05237.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-ing Yao, Shanelle Roman, et al. 2018c. Spider: Alarge-scale human-labeled dataset for complex andcross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887.

John M Zelle and Raymond J Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In Proceedings of the thirteenth na-tional conference on Artificial intelligence-Volume2, pages 1050–1055. AAAI Press.

Luke S Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: structuredclassification with probabilistic categorial gram-mars. In Proceedings of the Twenty-First Confer-ence on Uncertainty in Artificial Intelligence, pages658–666. AUAI Press.

Luke S Zettlemoyer and Michael Collins. 2007. On-line learning of relaxed ccg grammars for parsing tological form. EMNLP-CoNLL 2007, page 678.

Kai Zhao and Liang Huang. 2015. Type-driven in-cremental semantic parsing with polymorphism. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1416–1421.

106

A Supplemental Material

A.1 Entity Candidate Generator Details

In this section we provide details of how we con-structed entity candidate generators for each task.

GEO The annotator was constructed from thegeobase database, which provides a list of geo-graphical facts. For each entry in the database, weextracted the name as the entity alias and the type(e.g., “state”, “city”) as its attribute. Since not allcities used in the GEO query set are listed as ex-plicit entries, we also used cities in the state en-tries. Finally, geobase has no entries around coun-tries, so we added relevant aliases for “USA” witha special “country” attribute. There was 1 examplewhere an entity in the logical form did not appearin the input, leading to the example being droppedfrom the training set.

In lieu of task-specific edge relations, we usedbinary edge labels between entities that capturedwhich annotations span the same tokens. How-ever, experiments demonstrated that these edgesdid not significantly affect performance. We leaveconsideration of other types of entity relations forthese tasks to future work.

ATIS We constructed a lexicon mapping nat-ural language entity aliases in the dataset (e.g.,“newark international”, “5pm”) to unique entityidentifiers (e.g. “ewr:ap”, “1700:ti”). For ATIS,this lexicon required some manual construction.Each entity identifier has a two-letter suffix (e.g.,“ap”, “ti”) that maps it to a single attribute (e.g.,“airport”, “time”). We allowed overlapping entitymentions when the entities referred to different en-tity identifiers. For instance, in the query span “at-lanta airport”, we include both the city of Atlantaand the Atlanta airport.

Notably there were 9 examples where one of theentities used as an argument in the logical formdid not have a corresponding mention in the in-put utterance. From manual inspection, many ofthe dropped examples appear to have incorrectlyannotated logical forms. These examples weredropped from training set or marked as incorrectif they appeared in the test set.

We use the same binary edge labels between en-tities as for GEO.

SPIDER For SPIDER we generalize our notionof entities to consider tables and columns as enti-ties. We attempt to determine spans for each ta-

ble and column by computing normalized Leven-shtein distance between table and column namesand unigrams or bigrams in the utterance. The bestalignment having a score > 0.75 is selected, andwe use these generated alignments to populate theedges between tokens and entity candidates.

We generate a set of attributes for the tablebased on unigrams in the table name, and an at-tribute to identify the entity as a table. Likewise,for columns, we generate a set of attributes basedon unigrams in the column name as well as an at-tribute to identify the value type of the column. Wealso include attributes indicating whether an align-ment was found between the entity and the inputtext.

We include 3 task-specific edge label types be-tween entity candidates to denote bi-directional re-lations between column entities and the table en-tity they belong to, and to denote the presence of aforeign key relationship between columns.

Documents

Generating Logical Forms from Graph Representations of ...Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 95 Generating Logical Forms from