12
Core Semantic First: A Top-down Approach for AMR Parsing * Deng Cai The Chinese University of Hong Kong [email protected] Wai Lam The Chinese University of Hong Kong [email protected] Abstract We introduce a novel scheme for parsing a piece of text into its Abstract Meaning Repre- sentation (AMR): Graph Spanning based Pars- ing (GSP). One novel characteristic of GSP is that it constructs a parse graph incremen- tally in a top-down fashion. Starting from the root, at each step, a new node and its con- nections to existing nodes will be jointly pre- dicted. The output graph spans the nodes by the distance to the root, following the intuition of first grasping the main ideas then digging into more details. The core semantic first prin- ciple emphasizes capturing the main ideas of a sentence, which is of great interest. We evalu- ate our model on the latest AMR sembank and achieve the state-of-the-art performance in the sense that no heuristic graph re-categorization is adopted. More importantly, the experiments show that our parser is especially good at ob- taining the core semantics. 1 Introduction Abstract Meaning Representation (AMR) (Ba- narescu et al., 2013) is a semantic formalism that encodes the meaning of a sentence as a rooted la- beled directed graph. As illustrated by an example in Figure 1, AMR abstracts away from the surface forms in text, where the root serves as a rudimen- tary representation of the overall focus while the details are elaborated as the depth of the graph in- creases. AMR has been proved useful for many downstream NLP tasks, including text summariza- tion (Liu et al., 2015; Hardy and Vlachos, 2018) and question answering (Mitra and Baral, 2016). The task of AMR parsing is to map natural lan- guage strings to AMR semantic graphs automat- ically. Compared to constituent parsing (Zhang * The work described in this paper is substantially sup- ported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14204418). The first author is grateful for the discus- sions with Zhisong Zhang and Zhijiang Guo. strike-01 time sudden earthquake time manner ARG2 prosper-01 happiness op1 op2 big mod such mod Figure 1: AMR for the sentence “During a time of pros- perity and happiness, such a big earthquake suddenly struck.”, where the subgraphs close to the root repre- sent the core semantics. and Clark, 2009) and dependency parsing (ubler et al., 2009), AMR parsing is considered more challenging due to the following characteristics: (1) The nodes in AMR have no explicit alignment to text tokens; (2) The graph structure is more complicated because of frequent reentrancies and non-projective arcs; (3) There is a large and sparse vocabulary of possible node types (concepts). Many methods for AMR parsing have been de- veloped in the past years, which can be catego- rized into three main classes: Graph-based parsing (Flanigan et al., 2014; Lyu and Titov, 2018) uses a pipeline design for concept identification and re- lation prediction. Transition-based parsing (Wang et al., 2016; Damonte et al., 2017; Ballesteros and Al-Onaizan, 2017; Guo and Lu, 2018; Liu et al., 2018; Wang and Xue, 2017) processes a sentence from left-to-right and constructs the graph incre- mentally. The third class is seq2seq-based parsing (Barzdins and Gosko, 2016; Konstas et al., 2017; van Noord and Bos, 2017), which views parsing as sequence-to-sequence transduction by a lineariza- tion (depth-first traversal) of the AMR graph. arXiv:1909.04303v2 [cs.CL] 12 Sep 2019

Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

Core Semantic First: A Top-down Approach for AMR Parsing∗

Deng CaiThe Chinese University of Hong [email protected]

Wai LamThe Chinese University of Hong Kong

[email protected]

AbstractWe introduce a novel scheme for parsing apiece of text into its Abstract Meaning Repre-sentation (AMR): Graph Spanning based Pars-ing (GSP). One novel characteristic of GSPis that it constructs a parse graph incremen-tally in a top-down fashion. Starting from theroot, at each step, a new node and its con-nections to existing nodes will be jointly pre-dicted. The output graph spans the nodes bythe distance to the root, following the intuitionof first grasping the main ideas then digginginto more details. The core semantic first prin-ciple emphasizes capturing the main ideas of asentence, which is of great interest. We evalu-ate our model on the latest AMR sembank andachieve the state-of-the-art performance in thesense that no heuristic graph re-categorizationis adopted. More importantly, the experimentsshow that our parser is especially good at ob-taining the core semantics.

1 Introduction

Abstract Meaning Representation (AMR) (Ba-narescu et al., 2013) is a semantic formalism thatencodes the meaning of a sentence as a rooted la-beled directed graph. As illustrated by an examplein Figure 1, AMR abstracts away from the surfaceforms in text, where the root serves as a rudimen-tary representation of the overall focus while thedetails are elaborated as the depth of the graph in-creases. AMR has been proved useful for manydownstream NLP tasks, including text summariza-tion (Liu et al., 2015; Hardy and Vlachos, 2018)and question answering (Mitra and Baral, 2016).

The task of AMR parsing is to map natural lan-guage strings to AMR semantic graphs automat-ically. Compared to constituent parsing (Zhang

∗The work described in this paper is substantially sup-ported by a grant from the Research Grant Council of theHong Kong Special Administrative Region, China (ProjectCode: 14204418). The first author is grateful for the discus-sions with Zhisong Zhang and Zhijiang Guo.

strike-01

time suddenearthquake

time mannerARG2

prosper-01 happiness

op1 op2

big

mod

such

mod

Figure 1: AMR for the sentence “During a time of pros-perity and happiness, such a big earthquake suddenlystruck.”, where the subgraphs close to the root repre-sent the core semantics.

and Clark, 2009) and dependency parsing (Kubleret al., 2009), AMR parsing is considered morechallenging due to the following characteristics:(1) The nodes in AMR have no explicit alignmentto text tokens; (2) The graph structure is morecomplicated because of frequent reentrancies andnon-projective arcs; (3) There is a large and sparsevocabulary of possible node types (concepts).

Many methods for AMR parsing have been de-veloped in the past years, which can be catego-rized into three main classes: Graph-based parsing(Flanigan et al., 2014; Lyu and Titov, 2018) usesa pipeline design for concept identification and re-lation prediction. Transition-based parsing (Wanget al., 2016; Damonte et al., 2017; Ballesteros andAl-Onaizan, 2017; Guo and Lu, 2018; Liu et al.,2018; Wang and Xue, 2017) processes a sentencefrom left-to-right and constructs the graph incre-mentally. The third class is seq2seq-based parsing(Barzdins and Gosko, 2016; Konstas et al., 2017;van Noord and Bos, 2017), which views parsing assequence-to-sequence transduction by a lineariza-tion (depth-first traversal) of the AMR graph.

arX

iv:1

909.

0430

3v2

[cs

.CL

] 1

2 Se

p 20

19

Page 2: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

While existing graph-based models cannot suf-ficiently model the interactions between individualdecisions, the autoregressive nature of transition-based and seq2seq-based models makes them suf-fer from error propagation, where later decisionscan easily go awry, especially given the complex-ity of AMR. Since capturing the core semantics ofa sentence is arguably more important and usefulin practice, it is desirable for a parser to have aglobal view and a priority for capturing the mainideas first. In fact, AMR graphs are organized ina hierarchy that the core semantics stay closely tothe root, for which a top-down parsing scheme canfulfill the desiderata. For example, in Figure 1,the subgraph in the red box already conveys thecore meaning “an earthquake suddenly struck at aparticular time”, and the subgraph in the blue boxfurther informs that “the earthquake was big” and“the time was of prosperity and happiness”.

We propose a novel framework for AMR pars-ing known as Graph Spanning based Parsing(GSP). One novel characteristic of GSP is that,to our knowledge, it is the first top-down AMRparser.1 GSP performs parsing in an incremen-tal, root-to-leaf fashion, but still maintains a globalview of the sentence and the previously derivedgraph. At each step, it generates the connectingarcs between the existing nodes and the comingnew node, upon which the type of the new node(concept) is jointly decided. The output graphspans the nodes by the distance to the root, fol-lowing the intuition of first grasping the main ideasthen digging into more details. Compared to pre-vious graph-based methods, our model is capa-ble of capturing more complicated intra-graph in-teractions, while reducing the number of parsingsteps to be linear in the sentence length.2 Com-pared to transition-based methods, our model re-moves the left-to-right restriction and avoids so-phisticated oracle design for handling the com-plexity of AMR graphs.

Notably, most existing methods including thestate-the-of-art parsers often rely on heavy graphre-categorization for reducing the complexityof the original AMR graphs. For graph re-categorization, specific subgraphs of AMR aregrouped together and assigned to a single nodewith a new compound category (Werling et al.,

1Depth-first traversal in seq2seq models does not producea strictly top-down order due to the reentrancies in AMR.

2Since the size of AMR graph is approximately linear inthe length of sentence.

2015; Wang and Xue, 2017; Foland and Martin,2017; Lyu and Titov, 2018; Groschwitz et al.,2018; Guo and Lu, 2018). The hand-crafted rulesfor re-categorization are often non-trivial, requir-ing exhaustive screening and expert-level man-ual efforts. For instance, in the re-categorizationsystem of Lyu and Titov (2018), the graph

fragment “temporal-quantity:ARG3−of−→

rate-entity-91:unit−→ year

:quant−→ 1”will be replaced by one single nested node“rate-entity-3(annual-01)”. There arehundreds of such manual heuristic rules. This kindof re-categorization has been shown to have con-siderable effects on the performance (Wang andXue, 2017; Guo and Lu, 2018). However, one is-sue is that the precise set of re-categorization rulesdiffers among different models, making it difficultto distinguish the performance improvement frommodel optimization or carefully designed rules.In fact, some work will become totally infeasiblewhen removing this re-categorization step. For ex-ample, the parser of Lyu and Titov (2018) requirestight integration with this step as it is built on theassumption that an injective alignment exists be-tween sentence tokens and graph nodes.

We evaluate our parser on the latest AMR sem-bank and achieve competitive results to the state-of-the-art models. The result is remarkable sinceour parser directly operates on the original AMRgraphs and requires no manual efforts for graph re-categorization. The contributions of our work aresummarized as follows:

• We propose a new method for learning AMRparsing that produces high-quality core se-mantics.

• Without the help of heuristic graph re-categorization which requires expensiveexpert-level manual efforts for designingre-categorization rules, our method achievesstate-of-the-art performance.

2 Related Work

Currently, most AMR parsers can be catego-rized into three classes: (1) Graph-based meth-ods (Flanigan et al., 2014, 2016; Werling et al.,2015; Foland and Martin, 2017; Lyu and Titov,2018; Zhang et al., 2019) adopt a pipeline ap-proach for graph construction. It first maps con-tinuous text spans into AMR concepts, then calcu-lates the scores of possible edges and uses a max-

Page 3: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

imum spanning connected subgraph algorithm toselect the final graph. The major deficiency is thatthe concept identification and relation predictionare strictly performed in order, yet the interactionsbetween them should benefit both sides (Zhouet al., 2016). In addition, for computational effi-cacy, usually only first-order information is con-sidered for edge scoring. (2) Transition-basedmethods (Wang et al., 2016; Damonte et al., 2017;Wang and Xue, 2017; Ballesteros and Al-Onaizan,2017; Liu et al., 2018; Peng et al., 2018; Guoand Lu, 2018; Naseem et al., 2019) borrow tech-niques from shift-reduce dependency parsing. Yetthe non-trivial nature of AMR graphs (e.g., reen-trancies and non-projective arcs) makes the tran-sition system even more complicated and difficultto train (Guo and Lu, 2018). (3) Seq2seq-basedmethods (Barzdins and Gosko, 2016; Peng et al.,2017; Konstas et al., 2017; van Noord and Bos,2017) treat AMR parsing as sequence-to-sequenceproblem by linearizing AMR graphs, thus exist-ing seq2seq models (Bahdanau et al., 2014; Luonget al., 2015) can be readily utilized. Despite itssimplicity, the performance of the current seq2seqmodels lag behind when the training data is lim-ited. The first reason is that seq2seq models areoften not as effective on smaller datasets. The sec-ond reason is that the linearized AMRs add thechallenges of making use of the graph structureinformation.

There are also some notable exceptions. Penget al. (2015) introduce a synchronous hyperedgereplacement grammar solution. Pust et al. (2015)regard the task as a machine translation problem,while Artzi et al. (2015) adapt combinatory cate-gorical grammar. Groschwitz et al. (2018); Linde-mann et al. (2019) view AMR graphs as the struc-ture AM algebra.

Most AMR parsers require an explicit align-ment between tokens in the sentences and nodesin the AMR graph during training. Since such in-formation is not annotated, a pre-trained aligner(Flanigan et al., 2014; Pourdamghani et al., 2014;Liu et al., 2018) is often required. More recently,Lyu and Titov (2018) demonstrate that the align-ments can be treated as latent variables in a jointprobabilistic model.

3 Background and Overview

3.1 Background of Multi-head Attention

The multi-head attention mechanism introducedby Vaswani et al. (2017) is used as a basic buildingblock in our framework. The multi-head attentionconsists of H attention heads, and each of whichlearns a distinct attention function. Given a queryvector x and a set of vectors {y1, y2, . . . , ym} orin short y1:m, for each attention head, we projectx and y1:m into distinct query, key, and value rep-resentations q ∈ Rd, K ∈ Rm×d and V ∈ Rm×d

respectively, where d is the dimension of the vec-tor space. Then we perform scaled dot-product at-tention (Vaswani et al., 2017):

a = softmax(Kq)√d

attn = aV

where a ∈ Rm is the attention vector (a distribu-tion over all input y1:m) and attn is the weightedsum of the value vectors. Finally, the outputs of allattention heads are concatenated and projected tothe original dimension of x. For brevity, we willdenote the whole attention procedure describedabove as a function T (x, y1:m).

Based on the multi-head attention, the Trans-former encoder (Vaswani et al., 2017) usesself-attention for context information aggregationwhen given a set of vectors (e.g., word embed-dings in a sentence or node embeddings in agraph).

3.2 Overview

Figure 2 depicts the major neural components inour proposed framework: The Sentence Encodercomponent and the Graph Encoder component aredesigned for token-level sentence representationand node-level graph representation respectively.Given an input sentence w = (w1, w2, . . . , wn),where n is the sentence length, the Sentence En-coder component will first read the whole sentenceand encode each word wi into the hidden state si.The initial graph G0 is always initialized with onedummy node d∗ and a previously generated con-cept cj is encoded into the hidden state vj by theGraph Encoder component.

At each time step t, the Focus Selection com-ponent reads both the sentence representation s1:nand the graph representation v0:t−1 of Gt−1 re-peatedly, generates the initial parser state ht. The

Page 4: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

w1 w2 wn

s1 s2 sn

c1c2

c3

v1v2 v3Focus Selection

ht {agit }k

i= 1as

t

Gt−1

Sentence Encoder Graph Encoder

× kparser state1

234 5

Concept Prediction Relation Classification Relation Identification

? ct

d*

Figure 5: A single layer of the Recursive Neural Ten-sor Network. Each dashed box represents one of d-manyslices and can capture a type of influence a child can haveon its parent.

The RNTN uses this definition for computing p1:

p1 = f

bc

�T

V [1:d]

bc

�+ W

bc

�!,

where W is as defined in the previous models. Thenext parent vector p2 in the tri-gram will be com-puted with the same weights:

p2 = f

ap1

�T

V [1:d]

ap1

�+ W

ap1

�!.

The main advantage over the previous RNNmodel, which is a special case of the RNTN whenV is set to 0, is that the tensor can directly relate in-put vectors. Intuitively, we can interpret each sliceof the tensor as capturing a specific type of compo-sition.

An alternative to RNTNs would be to make thecompositional function more powerful by adding asecond neural network layer. However, initial exper-iments showed that it is hard to optimize this modeland vector interactions are still more implicit than inthe RNTN.

4.4 Tensor Backprop through Structure

We describe in this section how to train the RNTNmodel. As mentioned above, each node has a

softmax classifier trained on its vector representa-tion to predict a given ground truth or target vectort. We assume the target distribution vector at eachnode has a 0-1 encoding. If there are C classes, thenit has length C and a 1 at the correct label. All otherentries are 0.

We want to maximize the probability of the cor-rect prediction, or minimize the cross-entropy errorbetween the predicted distribution yi 2 RC⇥1 atnode i and the target distribution ti 2 RC⇥1 at thatnode. This is equivalent (up to a constant) to mini-mizing the KL-divergence between the two distribu-tions. The error as a function of the RNTN parame-ters ✓ = (V, W, Ws, L) for a sentence is:

E(✓) =X

i

X

j

tij log yij + �k✓k2 (2)

The derivative for the weights of the softmax clas-sifier are standard and simply sum up from eachnode’s error. We define xi to be the vector at nodei (in the example trigram, the xi 2 Rd⇥1’s are(a, b, c, p1, p2)). We skip the standard derivative forWs. Each node backpropagates its error through tothe recursively used weights V, W . Let �i,s 2 Rd⇥1

be the softmax error vector at node i:

�i,s =�W T

s (yi � ti)�⌦ f 0(xi),

where ⌦ is the Hadamard product between the twovectors and f 0 is the element-wise derivative of fwhich in the standard case of using f = tanh canbe computed using only f(xi).

The remaining derivatives can only be computedin a top-down fashion from the top node through thetree and into the leaf nodes. The full derivative forV and W is the sum of the derivatives at each ofthe nodes. We define the complete incoming errormessages for a node i as �i,com. The top node, inour case p2, only received errors from the top node’ssoftmax. Hence, �p2,com = �p2,s which we canuse to obtain the standard backprop derivative forW (Goller and Kuchler, 1996; Socher et al., 2010).For the derivative of each slice k = 1, . . . , d, we get:

@Ep2

@V [k]= �p2,com

k

ap1

� ap1

�T

,

where �p2,comk is just the k’th element of this vector.

Now, we can compute the error message for the two

Figure 5: A single layer of the Recursive Neural Ten-sor Network. Each dashed box represents one of d-manyslices and can capture a type of influence a child can haveon its parent.

The RNTN uses this definition for computing p1:

p1 = f

bc

�T

V [1:d]

bc

�+ W

bc

�!,

where W is as defined in the previous models. Thenext parent vector p2 in the tri-gram will be com-puted with the same weights:

p2 = f

ap1

�T

V [1:d]

ap1

�+ W

ap1

�!.

The main advantage over the previous RNNmodel, which is a special case of the RNTN whenV is set to 0, is that the tensor can directly relate in-put vectors. Intuitively, we can interpret each sliceof the tensor as capturing a specific type of compo-sition.

An alternative to RNTNs would be to make thecompositional function more powerful by adding asecond neural network layer. However, initial exper-iments showed that it is hard to optimize this modeland vector interactions are still more implicit than inthe RNTN.

4.4 Tensor Backprop through Structure

We describe in this section how to train the RNTNmodel. As mentioned above, each node has a

softmax classifier trained on its vector representa-tion to predict a given ground truth or target vectort. We assume the target distribution vector at eachnode has a 0-1 encoding. If there are C classes, thenit has length C and a 1 at the correct label. All otherentries are 0.

We want to maximize the probability of the cor-rect prediction, or minimize the cross-entropy errorbetween the predicted distribution yi 2 RC⇥1 atnode i and the target distribution ti 2 RC⇥1 at thatnode. This is equivalent (up to a constant) to mini-mizing the KL-divergence between the two distribu-tions. The error as a function of the RNTN parame-ters ✓ = (V, W, Ws, L) for a sentence is:

E(✓) =X

i

X

j

tij log yij + �k✓k2 (2)

The derivative for the weights of the softmax clas-sifier are standard and simply sum up from eachnode’s error. We define xi to be the vector at nodei (in the example trigram, the xi 2 Rd⇥1’s are(a, b, c, p1, p2)). We skip the standard derivative forWs. Each node backpropagates its error through tothe recursively used weights V, W . Let �i,s 2 Rd⇥1

be the softmax error vector at node i:

�i,s =�W T

s (yi � ti)�⌦ f 0(xi),

where ⌦ is the Hadamard product between the twovectors and f 0 is the element-wise derivative of fwhich in the standard case of using f = tanh canbe computed using only f(xi).

The remaining derivatives can only be computedin a top-down fashion from the top node through thetree and into the leaf nodes. The full derivative forV and W is the sum of the derivatives at each ofthe nodes. We define the complete incoming errormessages for a node i as �i,com. The top node, inour case p2, only received errors from the top node’ssoftmax. Hence, �p2,com = �p2,s which we canuse to obtain the standard backprop derivative forW (Goller and Kuchler, 1996; Socher et al., 2010).For the derivative of each slice k = 1, . . . , d, we get:

@Ep2

@V [k]= �p2,com

k

ap1

� ap1

�T

,

where �p2,comk is just the k’th element of this vector.

Now, we can compute the error message for the two

f

Figure 5: A single layer of the Recursive Neural Ten-sor Network. Each dashed box represents one of d-manyslices and can capture a type of influence a child can haveon its parent.

The RNTN uses this definition for computing p1:

p1 = f

bc

�T

V [1:d]

bc

�+ W

bc

�!,

where W is as defined in the previous models. Thenext parent vector p2 in the tri-gram will be com-puted with the same weights:

p2 = f

ap1

�T

V [1:d]

ap1

�+ W

ap1

�!.

The main advantage over the previous RNNmodel, which is a special case of the RNTN whenV is set to 0, is that the tensor can directly relate in-put vectors. Intuitively, we can interpret each sliceof the tensor as capturing a specific type of compo-sition.

An alternative to RNTNs would be to make thecompositional function more powerful by adding asecond neural network layer. However, initial exper-iments showed that it is hard to optimize this modeland vector interactions are still more implicit than inthe RNTN.

4.4 Tensor Backprop through Structure

We describe in this section how to train the RNTNmodel. As mentioned above, each node has a

softmax classifier trained on its vector representa-tion to predict a given ground truth or target vectort. We assume the target distribution vector at eachnode has a 0-1 encoding. If there are C classes, thenit has length C and a 1 at the correct label. All otherentries are 0.

We want to maximize the probability of the cor-rect prediction, or minimize the cross-entropy errorbetween the predicted distribution yi 2 RC⇥1 atnode i and the target distribution ti 2 RC⇥1 at thatnode. This is equivalent (up to a constant) to mini-mizing the KL-divergence between the two distribu-tions. The error as a function of the RNTN parame-ters ✓ = (V, W, Ws, L) for a sentence is:

E(✓) =X

i

X

j

tij log yij + �k✓k2 (2)

The derivative for the weights of the softmax clas-sifier are standard and simply sum up from eachnode’s error. We define xi to be the vector at nodei (in the example trigram, the xi 2 Rd⇥1’s are(a, b, c, p1, p2)). We skip the standard derivative forWs. Each node backpropagates its error through tothe recursively used weights V, W . Let �i,s 2 Rd⇥1

be the softmax error vector at node i:

�i,s =�W T

s (yi � ti)�⌦ f 0(xi),

where ⌦ is the Hadamard product between the twovectors and f 0 is the element-wise derivative of fwhich in the standard case of using f = tanh canbe computed using only f(xi).

The remaining derivatives can only be computedin a top-down fashion from the top node through thetree and into the leaf nodes. The full derivative forV and W is the sum of the derivatives at each ofthe nodes. We define the complete incoming errormessages for a node i as �i,com. The top node, inour case p2, only received errors from the top node’ssoftmax. Hence, �p2,com = �p2,s which we canuse to obtain the standard backprop derivative forW (Goller and Kuchler, 1996; Socher et al., 2010).For the derivative of each slice k = 1, . . . , d, we get:

@Ep2

@V [k]= �p2,com

k

ap1

� ap1

�T

,

where �p2,comk is just the k’th element of this vector.

Now, we can compute the error message for the two

f +

Figure 5: A single layer of the Recursive Neural Ten-sor Network. Each dashed box represents one of d-manyslices and can capture a type of influence a child can haveon its parent.

The RNTN uses this definition for computing p1:

p1 = f

bc

�T

V [1:d]

bc

�+ W

bc

�!,

where W is as defined in the previous models. Thenext parent vector p2 in the tri-gram will be com-puted with the same weights:

p2 = f

ap1

�T

V [1:d]

ap1

�+ W

ap1

�!.

The main advantage over the previous RNNmodel, which is a special case of the RNTN whenV is set to 0, is that the tensor can directly relate in-put vectors. Intuitively, we can interpret each sliceof the tensor as capturing a specific type of compo-sition.

An alternative to RNTNs would be to make thecompositional function more powerful by adding asecond neural network layer. However, initial exper-iments showed that it is hard to optimize this modeland vector interactions are still more implicit than inthe RNTN.

4.4 Tensor Backprop through Structure

We describe in this section how to train the RNTNmodel. As mentioned above, each node has a

softmax classifier trained on its vector representa-tion to predict a given ground truth or target vectort. We assume the target distribution vector at eachnode has a 0-1 encoding. If there are C classes, thenit has length C and a 1 at the correct label. All otherentries are 0.

We want to maximize the probability of the cor-rect prediction, or minimize the cross-entropy errorbetween the predicted distribution yi 2 RC⇥1 atnode i and the target distribution ti 2 RC⇥1 at thatnode. This is equivalent (up to a constant) to mini-mizing the KL-divergence between the two distribu-tions. The error as a function of the RNTN parame-ters ✓ = (V, W, Ws, L) for a sentence is:

E(✓) =X

i

X

j

tij log yij + �k✓k2 (2)

The derivative for the weights of the softmax clas-sifier are standard and simply sum up from eachnode’s error. We define xi to be the vector at nodei (in the example trigram, the xi 2 Rd⇥1’s are(a, b, c, p1, p2)). We skip the standard derivative forWs. Each node backpropagates its error through tothe recursively used weights V, W . Let �i,s 2 Rd⇥1

be the softmax error vector at node i:

�i,s =�W T

s (yi � ti)�⌦ f 0(xi),

where ⌦ is the Hadamard product between the twovectors and f 0 is the element-wise derivative of fwhich in the standard case of using f = tanh canbe computed using only f(xi).

The remaining derivatives can only be computedin a top-down fashion from the top node through thetree and into the leaf nodes. The full derivative forV and W is the sum of the derivatives at each ofthe nodes. We define the complete incoming errormessages for a node i as �i,com. The top node, inour case p2, only received errors from the top node’ssoftmax. Hence, �p2,com = �p2,s which we canuse to obtain the standard backprop derivative forW (Goller and Kuchler, 1996; Socher et al., 2010).For the derivative of each slice k = 1, . . . , d, we get:

@Ep2

@V [k]= �p2,com

k

ap1

� ap1

�T

,

where �p2,comk is just the k’th element of this vector.

Now, we can compute the error message for the two

ct

d*

d*

?

d*

Gt

Figure 2: Model architecture of GSP, together with the decoding procedure at the time step t, where the read andwrite operations around the parser state ht follow the order 1©→ 2©→ 3©→ 4©→ 5©.

parser state carries the most useful informationand serves as a writable memory during the expan-sion step. Next, the Relation Identification com-ponent decides which specific head nodes to ex-pand by computing the multiple attention scores{agit }ki=1 over the existing nodes. New arcs aregenerated according to the attention scores. Thenthe Concept Prediction component updates theparser state ht with arc information, computes theattention vector ast over the sentence and accord-ingly chooses a specific part to generate the newconcept ct. Finally, the Relation Classificationcomponent is used to predict the relation labels be-tween the newly generated concept and its prede-cessors. Consequently an updated graphGt is pro-duced and Gt will be processed for the next timestep. The whole decoding procedure is terminatedif the newly generated concept is the special stopconcept �.

Our method expands the graph in a root-to-leaf fashion, nodes with shorter distances to theroot will be introduced first. It follows a similarway that humans grasp the meaning: first seek-ing the main concepts then proceeding to the sub-structures governed by certain head concepts (Ba-narescu et al., 2013).

During training, we use breadth-first search todecide the order of nodes. However, for nodeswith multiple children, there still exist multiplevalid selections. In order to define a deterministic

decoding process, we sort sibling nodes by theirrelations to the head node. We will present morediscussions on the choice of sibling order in 5.3.

4 Framework Description

4.1 Sentence & Graph Representation

Transformer encoder architecture is employed forboth the Sentence Encoder and the Graph encodercomponents. For sentence encoding, a special to-ken (�) is prepended to the input word sequence,whose final hidden state s0 is regarded as an ag-gregated summary of the whole sentence and usedas the initial state in parsing steps.

The Graph Encoder component takespreviously generated concept sequence(c0, c1, . . . , ct−1) (c0 is the dummy node d∗)as input. For computation efficiency and reducingerror propagation, instead of encoding the edgeinformation explicitly, we use the Transformerencoder to capture the interactions between nodes.Finally, the encoder outputs a sequence of noderepresentations (v0, v1, . . . , vt−1).

4.2 Focus Selection

At each time step t, the Focus Selection com-ponent will read the sentence and the partiallyconstructed graph repeatedly for gradually locat-ing and collecting the most relevant informationfor the next expansion. We simulate the repeatedreading by multiple levels of attention. Formally,

Page 5: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

the following recurrence is applied by L times:

x(l+1),1t = LN(h

(l)t + T (l+1),1(h

(l)t , s1:n))

x(l+1),2t = LN(x

(l+1),1t + T (l+1),2(x

(l+1),1t , v0:t−1))

h(l+1)t = max(x

(l+1),2t W l+1

1 + bl+11 )W l+1

2 + bl+12

where T (·, ·) is the multi-head attention function.LN is the layer normalization (Ba et al., 2016) andh(0)t is always initialized with s0. For clarity, we

denote the last hidden state h(L)t as ht, as the parserstate at the time step t. We now proceed to presentthe details of each decision stage of one parsingstep, which is also illustrated in Figure 2.

4.3 Relation Identification

Our Relation Identification component is inspiredby a recent attempt of exposing auxiliary super-vision on attention mechanism (Strubell et al.,2018). It can be considered as another attentionlayer over the existing graph, yet the attentionweights explicitly indicate the likelihood of thenew node being attached to a specific node. Inother words, its aim is to answer the question ofwhere to expand. Since a node can be attached tomultiple nodes by playing different semantic roles,we utilize multi-head attention and take the maxi-mum over different heads as the final arc probabil-ities.

Formally, through a multi-head attention mech-anism taking ht and v0:t−1 as input, we obtain aset of attention weights {agit }ki=1, where k is thenumber of attention heads and agit is the i-th prob-ability vector. The probability of the arc betweenthe new node and the node vj is then computedby agt,j = maxi(a

git,j). Intuitively, each head is in

charge of a set of possible relations (though not ex-plicitly specified). If certain relations do not existbetween the new node and any existing node, theprobability mass will be assigned to the dummynode d∗. The maximum pooling reflects that thearc should be built once one relation is activated.3

The attention mechanism passes the arc deci-sions to later layers by the update of the parserstate as follows:

ht = LN(ht +W arct−1∑

j=0

agt,jvj)

3We also found that there may exist more than one relationbetween two distinct nodes, however, it rarely happens.

4.4 Concept PredictionOur Concept Prediction component uses a softalignment between words and the new concept.Concretely, a single-head attention ast is computedbased on the parser state ht and the sentence rep-resentation s1:n, where ast,i denotes the attentionweight of the word wi in the current time step.This component then updates the parser state withthe alignment information via the following equa-tion:

ht = LN(ht +W concn∑

i=1

ast,isi)

The probability of generating a specific con-cept c from the concept vocabulary V is calculatedas gen(c|ht) = exp(xc

Tht)/∑

c′∈V exp(xc′Tht),

where xc (for c ∈ V) denotes the model param-eters. To address the data sparsity issue in con-cept prediction, we introduce a copy mechanismin similar spirit to Gu et al. (2016). Besides gener-ation, our model can either directly copy an inputtoken wi (e.g, for entity names) or map wi to oneconcept m(wi) according to the alignment statis-tics4 in the training data (e.g., for “went”, it wouldpropose go). Formally, the prediction probabilityof a concept c is given by:

P (c|ht) =P (copy|ht)n∑

i=1

ast,i[[wi = c]]

+P (map|ht)n∑

i=1

ast,i[[m(wi) = c]]

+P (gen|ht)gen(c|ht)

where [[. . .]] is the indicator function. P (copy|ht),P (map|ht) and P (gen|ht) are the probabilities ofthree prediction modes respectively, computed bya single layer neural network with softmax activa-tion.

4.5 Relation ClassificationLastly, the Relation Classification component em-ploys a multi-class classifier for labeling the arcsdetected in the Relation Identification component.The classifier uses a biaffine function to score eachlabel, given the head concept representation vi andthe child vector ht as input:

eit = hTt Wvi + UTht + V T vi + b

4Based on the alignments provided by Liu et al. (2018),for each word, the most frequently aligned concept (or itslemma if it has empty alignment) is used for direct mapping.

Page 6: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

where W,U, V, b are model parameters. As sug-gested by Dozat and Manning (2016), we projectvi and ht to a lower dimension for reducing thecomputation cost and avoiding the overfitting ofthe model. The label probabilities are computedby a softmax function over all label scores.

4.6 Reentrancies

AMR reentrancy is employed when a node partici-pates in multiple semantic relations (with multipleparent nodes), and that is why AMRs are graphs,rather than trees. The reentrancies are often hardto treat. While previous work often either removethem (Guo and Lu, 2018) or relies on rule-basedrestoration in the postprocessing stage (Lyu andTitov, 2018; van Noord and Bos, 2017), our modelprovides a new and principled way to deal withreentrancies. In our approach, when a new nodeis generated, all its connections to already existingnodes are determined by the multi-head attention.For example, for a node with k parent nodes, k dif-ferent heads will point to the those parent nodesrespectively. For a better understanding of ourmodel, a pseudocode is presented in Algorithm 1.

4.7 Training and Inference

Our model is trained to maximize the log likeli-hood of the gold AMR graphs given sentences, i.e.logP (G|w), which can be factorized as:

logP (G|w) =m∑

t=1

(logP (ct|Gt−1,w)

+∑

i∈pred(t)logP (arcit|Gt−1,w)

+∑

i∈pred(t)logP (relarcit |Gt−1,w)

)

where m is the total number of vertices. The setof predecessor nodes of ct is denoted as pred(t).arcit denotes the arc between ci and ct, andrelarcit indicates the arc label (relation type).

As mentioned, GSP is an autoregressive model,such as seq2seq models and transition models, butit factors the distribution according to a top-downgraph structure rather than a depth-first traversal ora left-to-right chain. Meanwhile, GSP has a clearseparation of node, arc and relation label prob-abilities, interacting in a more interpretable andtighten manner.

Algorithm 1 Graph Spanning based Parsing

Input: the input sentence w = (w1, w2, . . . , wn)Output: the AMR graph G corresponds to w.

3 Learning Sentence Representation1: w = (w0 = �) + (w1, w2, . . . , wn)2: s0, s1, s2, . . . , sn = Transformer(w)

3 Initialization3: initialize the graph G0 (c0 = d∗)4: initialize time step t = 1

3 Entering Main Spanning Loop5: while True do6: h0, . . . , vt−1 = Transformer(c0:t−1)7: ht = Focus Selection (s0, v0:t−1, s1:n)8: ht = Relation Identification (ht, v0:t−1)8: decide the parents nodes pred(t) of ct9: ht = Concept Prediction (ht, s1:n)9: decide the node type of ct

10: if ct == � then11: break12: end if13: for i ∈ pred(t) do14: Relation Classification (ht, vi)14: decide the edge type between ct and ci15: end for16: update Gt−1 to Gt

17: end while18: return Gt−1

At the operational or testing time, the pre-diction for the input w is obtained via G =argmaxG′ P (G

′|w). Rather than iterating overall possible graphs, we adopt a beam search toapproximate the best graph. Specifically, foreach partially constructed graph, we only considerthe top-K concepts obtaining the best single-stepprobability (a product of the corresponding con-cept, arc, and relation label probability), where Kis the beam size. Only the best K graphs at eachtime step are kept for the next expansion.

5 Experiments

5.1 Setup

We focus on the most recent LDC2017T10dataset, as it is the largest AMR corpus. It consistsof 36521, 1368, and 1371 sentences in the train-ing, development, and testing sets respectively.

We use Stanford CoreNLP (Manning et al.,2014) for text preprocessing, including tokeniza-tion, lemmatization, part-of-speech, and named-entity tagging. The input for sentence encoder

Page 7: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

consists of the randomly initialized lemma, part-of-speech tag, and named-entity tag embeddings,as well as the output from a learnable CNN withcharacter embeddings as inputs. The graph en-coder uses randomly initialized concept embed-dings and another char-level CNN. Model hyper-parameters are chosen by experiments on the de-velopment set. The details of the hyper-parametersettings are provided in the Appendix. Duringtesting, we use a beam size of 8 for generatinggraphs.5

Conventionally, the quality of AMR parsingresults is evaluated using the Smatch tool (Caiand Knight, 2013), which seeks for the maximumnumber of overlaps between two AMR annota-tions after decomposing AMR graphs into triples.However, the ordinary Smatch metric treats alltriples equally regardless of their roles in the com-position of the whole sentence meaning. We refinethe ordinary Smatch metric to take into considera-tion the notion of core semantics. Specifically, wecompute:

• Smatch-weighted: This metric weights dif-ferent triples by their importance of compos-ing the core ideas. The root distance d of atriple is defined as the minimum root distanceof its involving nodes, the weight of the tripleis then computed as:

w = min(−d+ dthr, 1)

In other words, the weight has a linear decayin root distance until dthr. If two triples arematched, the minimum importance score ofthem is obtained. In our experiments, dthr isset to 5.

• Smatch-core: This metric only compares thesubgraphs representing the main meaning.Precisely, we cut down AMR graphs by set-ting a maximum root distance dmax and onlykeep the nodes and edges within the thresh-old. dmax is set to 4 in our experiments, ofwhich the remaining subgraphs still have abroad coverage of the original meaning, as il-lustrated by the distribution of root distancein Figure 3.

Besides, we also evaluate the quality by comput-ing the following metrics.

5Our code can be found at https://github.com/jcyk/AMR-parser.

05

10152025

0 1 2 3 4 5 6 >=7Root Distance

Perc

enta

ges (

%)

Figure 3: The distribution of root distance of conceptsin the test set.

• complete-match (CM): This metric countsthe number of parsing results that are com-pletely correct.

• root-accuracy (RA): This metric measuresthe accuracy of the root concept identifica-tion.

5.2 Main Results and Case Study

The main result is presented in Table 1. We com-pare our method with the best-performing modelsin each category as discussed in 2.

Concretely, van Noord and Bos (2017) is acharacter-level seq2seq model that achieves verycompetitive result. However, their model is verydata demanding as it requires to train on addi-tional 100K sentence-AMR pairs generated byother parsers. Guo and Lu (2018) is a transition-based parser with refined search space for AMR.Certain concepts and relations (e.g., reentrancies)are removed to reduce the burdens during training.Lyu and Titov (2018) is a graph-based method thatachieves the best-reported result evaluated by theordinary Smatch metric. Their parser uses differ-ent LSTMs for concept prediction, relation identi-fication, and root identification sequentially. Also,the relation identification stage has the time com-plexity of O(m2 logm) where m is the number ofconcepts. Groschwitz et al. (2018) views AMRas terms of the AM algebra (Groschwitz et al.,2017), which allows standard tree-based parsingtechniques to be applicable. The complexity oftheir projective decoder is O(m5). Last but notleast, all these models except for that of van No-ord and Bos (2017) require hand-crafted heuristicsfor graph re-categorization.

We consider the Smatch-weighted metric as themost suitable metric for measuring the parser’squality on capturing core semantics. The com-parison shows that our method significantly out-performs all other methods. The Smatch-coremetric also demonstrates the advantage of our

Page 8: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

ModelGraph Smatch(%)

RA(%) CM(%)Re-ca. weighted core ordinary

Buys and Blunsom (2017) No - - 61.9 - -van Noord and Bos (2017) + 100K No 68.8 67.6 71.0 75.8 10.2

Guo and Lu (2018) Yes 63.5 62.3 69.8 63.6 9.4Lyu and Titov (2018) Yes 66.6 67.1 74.4 59.1 10.2

Groschwitz et al. (2018) Yes - - 71.0 - -Ours No 71.3 70.2 73.2 76.9 11.6

Table 1: Comparison with state-of-the-art methods (results on the test set). Results relying on heuristic rules forgraph re-categorization are marked “Yes” in the Graph Re-ca. column.

generate-01

demand-01

travel-01

and

solve-01pattern balance-01

function-01

settle-02

human

and

capacity

system transport

use-01

landbetween

ARG1 ARG0

ARG1

ARG1-of

ARG2-ofop1 op2

mod

ARG1

op1-of mod mod ARG1

op1 op2

ARG1

The solution is functional human patterns and a balance between transport system capacity and land use generated travel demand.Input:

capacity

demand-01

system

travel-01use-01

between

land

solve-01

andbalance-01

op1 op2

pattern

settle-02

human

functional

mod mod

ARG1

ARG1ARG1

op1

mod ARG1

mod

op2 ARG1

ARG2

transport-01

Ours: L’18:Gold:

ARG1

capacity demand-01

system

transport-01

travel-01

use-01

land

solve-01

andbalance-01

op1 op2

settle-03

human

function-01

mod ARG1-of ARG2ARG1

poss

ARG1

purpose

ARG1

ARG2

generate-01ARG0

ARG1-of

pattern

Figure 4: Case study.

method in capturing the core ideas. Besides, ourmodel achieves the highest root-accuracy (RA)and complete-match (CM), which further confirmsthe usefulness of a global view and the core se-mantic first principle.

Even evaluated by the ordinary Smatch met-ric, our model yields better results than all previ-ously reported models with the exception of Lyuand Titov (2018), which relies on a tremendousamount of manual heuristics for designing rulesfor graph re-categorization and adopts a pipelineapproach. Note that our parser constructs theAMR graph in an end-to-end fashion with a bet-ter (quadratic) time complexity.

We present a case study in Figure 4 with com-parison to the output of Lyu and Titov (2018)’sparser. As seen, both parsers make some mis-takes. Specifically, our method fails to iden-tify the concept generated-01. While Lyuand Titov (2018)’s parser successfully identifiesit, their parser mistakenly treats it as the rootof the whole AMR. It leads to a serious draw-back of making the sentence meaning be inter-preted in a wrong way. In contrast, our methodshows a strong capacity in capturing the main idea“the solution is about some patterns and a bal-ance”. However, on the ordinary Smatch met-

Smat

ch-c

ore

0.55

0.61

0.68

0.74

0.80

Maximum Root Distance

0 1 2 3 4 5 6 full

Ours L’18 GL’18 vN’17

Figure 5: Smatch scores with different root distances.vN17 is van Noord and Bos (2017)’s parser with 100Kadditional training pairs. GL’18 is Guo and Lu (2018)’sparser. L’18 is Lyu and Titov (2018)’s parser.

ric, their graph obtains a higher score (68% vs.66%), which indicates that the ordinary Smatchis not a proper metric for evaluating the qual-ity of capturing core semantics. If we adopt theSmatch-weighted metric, our method achieves abetter score i.e. 74% vs. 61%.

5.3 More Results

To reveal our parser’s ability for grasping mean-ings at different levels of granularity, we plotthe Smatch-core scores in Figure 5 by varying

Page 9: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

the maximum root distance dmax, compared withseveral strong baselines and the state-of-the-artmodel. It demonstrates that our method is betterat abstracting the core ideas of a sentence.

As discussed in 3, there could be multiple validgeneration orders for sibling nodes in an AMRgraph. We experiment with the following traversalvariants: (1) random, which sorts the sibling nodesin completely random order. (2) relation freq.,which sorts the sibling nodes according to their re-lations to the head node. We assign higher priori-ties to relations that occur more frequently, whichdrives our parser always to seek for the most com-mon relation first. (3) combined, which combinesthe above two strategies by using random and re-lation freq. with equal chance. As seen in Ta-ble 2, the deterministic order strategy for training(relation freq.) achieves better performance thanrandom order. Interestingly, the combined strat-egy significantly boosts the performance.6 Thereason is that the random order potentially pro-duces a larger set of training pairs since each ran-dom order strategy can be considered as a differenttraining pair. On the other hand, the deterministicorder stabilizes the maximum likelihood estimatetraining. Therefore, the combined strategy bene-fits from both worlds.

6 Conclusion and Future Work

We presented the first top-down AMR parser. Ourproposed parser builds a AMR graph incremen-tally in a root-to-leaf manner. Experiments showthat our method has a better capability of capturingthe core semantics in a sentence compared withprevious state-of-the-art methods. In addition,we overcome the need of heuristics for graph re-categorization employed in most previous work,which makes our method much more transferableto other semantic representations or languages.

Our methods follows the intuition that humanstend to grasp the core meaning of a sentencefirst. However, some cognitive theories (Lan-gacker, 2008) also suggest that human languageunderstanding is often presented as a circular, ab-ductive process (hermeneutic circle). It is inter-esting to explore the use of some revision mecha-nisms when the initial steps go wrong.

6We note there are many other ways to generate a deter-ministic order. For example, van Noord and Bos (2017) usesthe order of aligned words in the sentence. However, we usethe relation frequency method for its simplicity and not rely-ing on external resources (e.g, an aligner).

OrderSmatch

weighted core ordinaryrandom 68.2 67.4 70.4

relation freq. 69.9 68.3 70.9combined 71.3 70.2 73.2

Table 2: The effect of different sibling orders.

ReferencesYoav Artzi, Kenton Lee, and Luke Zettlemoyer. 2015.

Broad-coverage ccg semantic parsing with amr. InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, pages1699–1710.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. In ICLR.

Miguel Ballesteros and Yaser Al-Onaizan. 2017. AMRparsing using stack-LSTMs. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 1269–1275.

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse, pages 178–186.

Guntis Barzdins and Didzis Gosko. 2016. RIGA atSemEval-2016 task 8: Impact of Smatch extensionsand character-level neural translation on AMR pars-ing accuracy. In Proceedings of the 10th Interna-tional Workshop on Semantic Evaluation (SemEval-2016), pages 1143–1147.

Jan Buys and Phil Blunsom. 2017. Oxford at semeval-2017 task 9: Neural amr parsing with pointer-augmented attention. In Proceedings of the 11thInternational Workshop on Semantic Evaluation(SemEval-2017), pages 914–919.

Shu Cai and Kevin Knight. 2013. Smatch: an evalua-tion metric for semantic feature structures. In Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), volume 2, pages 748–752.

Marco Damonte, Shay B. Cohen, and Giorgio Satta.2017. An incremental parser for abstract meaningrepresentation. In Proceedings of the 15th Confer-ence of the European Chapter of the Association forComputational Linguistics: Volume 1, Long Papers,pages 536–546.

Page 10: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

Timothy Dozat and Christopher D Manning. 2016.Deep biaffine attention for neural dependency pars-ing. arXiv preprint arXiv:1611.01734.

Jeffrey Flanigan, Chris Dyer, Noah A Smith, and JaimeCarbonell. 2016. Cmu at semeval-2016 task 8:Graph-based amr parsing with infinite ramp loss. InProceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016), pages 1202–1206.

Jeffrey Flanigan, Sam Thomson, Jaime Carbonell,Chris Dyer, and Noah A Smith. 2014. A discrim-inative graph-based parser for the abstract meaningrepresentation. In Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), volume 1, pages1426–1436.

William Foland and James H Martin. 2017. Abstractmeaning representation parsing using lstm recurrentneural networks. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), volume 1, pages463–472.

Jonas Groschwitz, Meaghan Fowlie, Mark Johnson,and Alexander Koller. 2017. A constrained graphalgebra for semantic parsing with AMRs. In IWCS2017 - 12th International Conference on Computa-tional Semantics - Long papers.

Jonas Groschwitz, Matthias Lindemann, MeaghanFowlie, Mark Johnson, and Alexander Koller. 2018.AMR dependency parsing with a typed semantic al-gebra. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1831–1841.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1631–1640.

Zhijiang Guo and Wei Lu. 2018. Better transition-based amr parsing with refined search space. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, pages 1712–1722.

Hardy Hardy and Andreas Vlachos. 2018. Guided neu-ral language generation for abstractive summariza-tion using abstract meaning representation. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, pages 768–773.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, YejinChoi, and Luke Zettlemoyer. 2017. Neural AMR:

Sequence-to-sequence models for parsing and gen-eration. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 146–157.

Sandra Kubler, Ryan McDonald, and Joakim Nivre.2009. Dependency parsing. Synthesis Lectures onHuman Language Technologies, 1(1):1–127.

Ronald W Langacker. 2008. Cognitive Grammar: ABasic Introduction. Oxford University Press.

Matthias Lindemann, Jonas Groschwitz, and Alexan-der Koller. 2019. Compositional semantic parsingacross graphbanks. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 4576–4585.

Fei Liu, Jeffrey Flanigan, Sam Thomson, NormanSadeh, and Noah A. Smith. 2015. Toward abstrac-tive summarization using semantic representations.In Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1077–1086.

Yijia Liu, Wanxiang Che, Bo Zheng, Bing Qin,and Ting Liu. 2018. An AMR aligner tuned bytransition-based parser. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 2422–2430.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 1412–1421.

Chunchuan Lyu and Ivan Titov. 2018. AMR parsing asgraph prediction with latent alignment. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 397–407.

Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David McClosky.2014. The stanford corenlp natural language pro-cessing toolkit. In Proceedings of 52nd annualmeeting of the association for computational lin-guistics: system demonstrations, pages 55–60.

Arindam Mitra and Chitta Baral. 2016. Addressing aquestion answering challenge by combining statis-tical methods with inductive rule learning and rea-soning. In Thirtieth AAAI Conference on ArtificialIntelligence.

Tahira Naseem, Abhishek Shah, Hui Wan, Radu Flo-rian, Salim Roukos, and Miguel Ballesteros. 2019.Rewarding Smatch: Transition-based AMR parsingwith reinforcement learning. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, pages 4586–4592.

Page 11: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

Rik van Noord and Johan Bos. 2017. Neural seman-tic parsing by character-based translation: Experi-ments with abstract meaning representations. arXivpreprint arXiv:1705.09980.

Xiaochang Peng, Daniel Gildea, and Giorgio Satta.2018. Amr parsing with cache transition systems.In Thirty-Second AAAI Conference on Artificial In-telligence.

Xiaochang Peng, Linfeng Song, and Daniel Gildea.2015. A synchronous hyperedge replacement gram-mar based approach for amr parsing. In Proceed-ings of the Nineteenth Conference on ComputationalNatural Language Learning, pages 32–41.

Xiaochang Peng, Chuan Wang, Daniel Gildea, and Ni-anwen Xue. 2017. Addressing the data sparsity is-sue in neural AMR parsing. In Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers, pages 366–375.

Nima Pourdamghani, Yang Gao, Ulf Hermjakob, andKevin Knight. 2014. Aligning english strings withabstract meaning representation graphs. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing, pages 425–429.

Michael Pust, Ulf Hermjakob, Kevin Knight, DanielMarcu, and Jonathan May. 2015. Parsing englishinto abstract meaning representation using syntax-based machine translation. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 1143–1154.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch, 15(1):1929–1958.

Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-informed self-attention for semanticrole labeling. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 5027–5038.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Chuan Wang, Sameer Pradhan, Xiaoman Pan, HengJi, and Nianwen Xue. 2016. Camr at semeval-2016task 8: An extended transition-based amr parser. InProceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016), pages 1173–1178.

Chuan Wang and Nianwen Xue. 2017. Getting themost out of amr parsing. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 1257–1268.

Keenon Werling, Gabor Angeli, and Christopher D.Manning. 2015. Robust subgraph generation im-proves abstract meaning representation parsing. InProceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages982–991.

Sheng Zhang, Xutai Ma, Kevin Duh, and BenjaminVan Durme. 2019. AMR parsing as sequence-to-graph transduction. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 80–94.

Yue Zhang and Stephen Clark. 2009. Transition-basedparsing of the chinese treebank using a global dis-criminative model. In Proceedings of the 11th Inter-national Conference on Parsing Technologies, pages162–171.

Junsheng Zhou, Feiyu Xu, Hans Uszkoreit, WeiguangQu, Ran Li, and Yanhui Gu. 2016. AMR parsingwith an incremental joint model. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 680–689.

Page 12: Core Semantic First: A Top-down Approach for AMR Parsing · 2019. 9. 13. · whileArtzi et al.(2015) adapt combinatory cate-gorical grammar.Groschwitz et al.(2018);Linde-mann et al.(2019)

model component hyper-parameter value

char-level CNN

number of filters 256width of filters 3

char embedding size 32final hidden size 128

Transformernumber of heads 8hidden state size 512

feed-forward hidden size 1024

Sentence Encoder

Transformer layers 4lemma embedding size 200

POS tag embedding size 32NER tag embedding size 16

Graph Encoder Transformer layers 1concept embedding size 300

Focus Selection attention layers 3Relation Identification number of heads 8Relation Classification hidden state size 100

Table 3: Hyper-parameters settings.

A Implementation Details

In all experiments, we use the same char-levelCNN settings in the sentence encoder and thegraph encoder. In addition, all Transformer(Vaswani et al., 2017) layers in our model sharethe same hyper-parameter settings. For compu-tation efficiency, we only allow each concept toattend to its previously generated concepts in thegraph encoder.7 Table 3 summarizes the chosenhyper-parameters after we tuned on the develop-ment set. To mitigate overfitting, we also ap-ply dropout (Srivastava et al., 2014) with the droprate 0.2 between different layers. We use a spe-cial UNK token to replace the input lemmas, POStags, and NER tags with a rate of 0.33. Parame-ter optimization is performed with the Adam op-timizer (Kingma and Ba, 2014) with β1 = 0.9and β2 = 0.999. The same learning rate scheduleof (Vaswani et al., 2017) is adopted in our experi-ments. We use early stopping on the developmentset for choosing the best model.

Following Lyu and Titov (2018), for word sensedisambiguation, we simply use the most frequentsense in the training set, or -01 if not presented.For wikification, we look-up in the training set forthe most frequent one and default to “-.

7Otherwise, we will need to re-compute the hidden statesfor all existing nodes at each parsing step.