16
Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1 , Yan Zhang 1, Zhiyang Teng 1,2 , Wei Lu 1 1 Singapore University of Technology and Design 8 Somapah Road, Singapore, 487372 2 School of Engineering, Westlake University, China {zhijiang guo,yan zhang,zhiyang teng}@mymail.sutd.edu.sg [email protected], [email protected] Abstract We focus on graph-to-sequence learning, which can be framed as transducing graph structures to sequences for text generation. To capture structural information associated with graphs, we investigate the problem of encoding graphs using graph convolutional networks (GCNs). Unlike various existing approaches where shallow architectures were used for capturing local structural information only, we introduce a dense connection strategy, proposing a novel Densely Connected Graph Convolutional Network (DCGCN). Such a deep architecture is able to integrate both local and non-local features to learn a better structural representation of a graph. Our model outperforms the state-of-the-art neural models significantly on AMR-to-text generation and syntax-based neural machine translation. 1 Introduction Graphs play an important role in natural language processing (NLP) as they are able to capture richer structural information than sequences and trees. Generally, semantics of sentences can be encoded as graphs. For example, the abstract meaning representation (AMR) (Banarescu et al., 2013) is a directed, labeled graph as shown in Figure 1, where nodes in the graph denote semantic concepts and edges denote relations between concepts. Such graph representations can capture rich semantic- level structural information, and are attractive representations useful for semantics-related tasks such as semantic parsing (Guo and Lu, 2018) and natural language generation (Beck et al., 2018). In this paper, we focus on the graph-to-sequence Contributed equally. learning tasks, where we aim to learn repre- sentations for graphs that are useful for text generation. Graph convolutional networks (GCNs) (Kipf and Welling, 2017) are variants of convolutional neural networks (CNNs) that operate directly on graphs, where the representation of each node is iteratively updated based on those of its adja- cent nodes in the graph through an information propagation scheme. For example, the first layer of GCNs can only capture the graph’s adjacency information between immediate neighbors, while with the second layer one will be able to cap- ture second-order proximity information (neigh- borhood information two hops away from one node) as shown in Figure 1. Formally, L layers will be needed in order to capture neighborhood information that is L hops away. GCNs have been successfully applied to many NLP tasks (Bastings et al., 2017; Zhang et al., 2018b). Interestingly, although deeper GCNs with more layers will be able to capture richer neigh- borhood information of a graph, empirically it has been observed that the best performance is achieved with a 2-layer model (Li et al., 2018). Therefore, recent efforts that leverage recurrence- based graph neural networks have been explored as the alternatives to encode the structural infor- mation of graphs. Examples include graph-state long short-term memory (LSTM) networks (Song et al., 2018) and gated graph neural networks (GGNNs) (Beck et al., 2018). Deep architectures based on such recurrence-based models have been successfully built for tasks such as language gener- ation, where rich neighborhood information cap- tured was shown useful. Compared with recurrent neural networks, con- volutional architectures are highly parallelizable and are more amenable to hardware acceleration 297 Transactions of the Association for Computational Linguistics, vol. 7, pp. 297–312, 2019. Action Editor: Stefan Reizler. Submission batch: 11/2018; Revision batch: 2/2019; Published 6/2019. c 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

Densely Connected Graph Convolutional Networks forGraph-to-Sequence Learning

Zhijiang Guo1∗, Yan Zhang1∗, Zhiyang Teng1,2, Wei Lu1

1Singapore University of Technology and Design8 Somapah Road, Singapore, 487372

2School of Engineering, Westlake University, China{zhijiang guo,yan zhang,zhiyang teng}@mymail.sutd.edu.sg

[email protected], [email protected]

AbstractWe focus on graph-to-sequence learning,which can be framed as transducing graphstructures to sequences for text generation.To capture structural information associatedwith graphs, we investigate the problem ofencoding graphs using graph convolutionalnetworks (GCNs). Unlike various existingapproaches where shallow architectures wereused for capturing local structural informationonly, we introduce a dense connection strategy,proposing a novel Densely Connected GraphConvolutional Network (DCGCN). Such adeep architecture is able to integrate bothlocal and non-local features to learn a betterstructural representation of a graph. Our modeloutperforms the state-of-the-art neural modelssignificantly on AMR-to-text generation andsyntax-based neural machine translation.

1 Introduction

Graphs play an important role in natural languageprocessing (NLP) as they are able to capture richerstructural information than sequences and trees.Generally, semantics of sentences can be encodedas graphs. For example, the abstract meaningrepresentation (AMR) (Banarescu et al., 2013) isa directed, labeled graph as shown in Figure 1,where nodes in the graph denote semantic conceptsand edges denote relations between concepts. Suchgraph representations can capture rich semantic-level structural information, and are attractiverepresentations useful for semantics-related taskssuch as semantic parsing (Guo and Lu, 2018) andnatural language generation (Beck et al., 2018). Inthis paper, we focus on the graph-to-sequence

∗ Contributed equally.

learning tasks, where we aim to learn repre-sentations for graphs that are useful for textgeneration.

Graph convolutional networks (GCNs) (Kipfand Welling, 2017) are variants of convolutionalneural networks (CNNs) that operate directly ongraphs, where the representation of each node isiteratively updated based on those of its adja-cent nodes in the graph through an informationpropagation scheme. For example, the first layerof GCNs can only capture the graph’s adjacencyinformation between immediate neighbors, whilewith the second layer one will be able to cap-ture second-order proximity information (neigh-borhood information two hops away from onenode) as shown in Figure 1. Formally, L layerswill be needed in order to capture neighborhoodinformation that is L hops away.

GCNs have been successfully applied to manyNLP tasks (Bastings et al., 2017; Zhang et al.,2018b). Interestingly, although deeper GCNs withmore layers will be able to capture richer neigh-borhood information of a graph, empirically ithas been observed that the best performance isachieved with a 2-layer model (Li et al., 2018).

Therefore, recent efforts that leverage recurrence-based graph neural networks have been exploredas the alternatives to encode the structural infor-mation of graphs. Examples include graph-statelong short-term memory (LSTM) networks (Songet al., 2018) and gated graph neural networks(GGNNs) (Beck et al., 2018). Deep architecturesbased on such recurrence-based models have beensuccessfully built for tasks such as language gener-ation, where rich neighborhood information cap-tured was shown useful.

Compared with recurrent neural networks, con-volutional architectures are highly parallelizableand are more amenable to hardware acceleration

297

Transactions of the Association for Computational Linguistics, vol. 7, pp. 297–312, 2019. Action Editor: Stefan Reizler.Submission batch: 11/2018; Revision batch: 2/2019; Published 6/2019.

c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Page 2: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

Figure 1: A 3-layer densely connected graph convo-lutional network. The example AMR graph here corre-sponds to the sentence ‘‘You guys know what I mean.’’Every layer encodes information about immediateneighbors and 3 layers are needed to capture third-order neighborhood information (nodes that are 3 hopsaway from the current node). Each layer concatenatesall preceding outputs as the input.

(Gehring et al., 2017). It is therefore worthwhileto explore the possibility of applying deeperGCNs that are able to capture more non-localinformation associated with the graph for graph-to-sequence learning. Prior efforts have tried totrain deep GCNs by incorporating residual con-nections (Bastings et al., 2017). Xu et al. (2018)show that vanilla residual connections proposedby He et al. (2016) are not effective for graphneural networks. They next attempt to resolve thisissue by adding additional recurrent layers on topof graph convolutional layers. However, they arestill confined to relatively shallow GCNs archi-tectures (at most 6 layers in their experiments),which may not be able to capture the rich non-local interactions for larger graphs.

In this paper, to better address the issue oflearning deeper GCNs, we introduce dense con-nectivity to GCNs and propose the novel denselyconnected graph convolutional networks (DCGCNs),inspired by DenseNets (Huang et al., 2017) thatdistill insights from residual connections. Thedense connectivity strategy is illustrated in Figure 1schematically. Direct connections are introducedfrom any layer to all its preceding layers. Forexample, the third layer receives the outputs ofthe first layer and the second layer, capturing thefirst-order, the second-order, and the third-orderneighborhood information. With the help of denseconnections, we are able to train multi-layer GCNmodels with a large depth, allowing rich local andnon-local information to be captured for learning

a better graph representation than those learnedfrom the shallower GCN models.

Experiments show that our model is able toachieve better performance for graph-to-sequencelearning tasks. For the AMR-to-text generationtask, our model surpasses the current state-of-the-art neural models trained on LDC2015E86and LDC2017T10 by 2 and 4.3 BLEU points,respectively. For the syntax-based neural machinetranslation task, our model is also consistently bet-ter than others, showing the effectiveness of themodel on a large training set. Our code is avail-able at https://github.com/Cartus/DCGCN.1

2 Densely Connected GCNs

In this section, we will present the basic com-ponents used for constructing our DCGCN model.

2.1 GCNs

GCNs are neural networks that operate directlyon graph structures (Kipf and Welling, 2017).Here we mathematically illustrate how multi-layerGCNs work on an undirected graph G = (V , E),where V and E are the set of nodes and edges,respectively. The convolution computation fornode v at the l-th layer, which takes the inputfeature representation h(l−1) as input and outputsthe induced representation h

(l)v , can be defined

as

h(l)v = ρ

( ∑u∈N (v)

W (l)h(l−1)u + b(l)

)(1)

where W (l) is the weight matrix, b(l) is the biasvector, N (v) is the set of one-hop neighbors ofnode v, and ρ is an activation function (e.g., RELU[Nair and Hinton, 2010]). h(0)

v is the initial inputxv, where xv ∈ R

d and d is the input featuredimension.

GCNs with Residual Connections. Bastingset al. (2017) integrate residual connections (Heet al., 2016) into GCNs to help informationpropagation. Specifically, each node is updated

1Our implementation is based on MXNET (Chen et al.,2015) and the Sockeye (Felix et al., 2017) toolkit.

298

Page 3: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

according to Equation (1) first and then the result-ing representation is combined with the node’srepresentation from the last iteration:

h(l)v = ρ

( ∑u∈N (v)

W (l)h(l−1)u +b(l)

)+h(l−1)

v (2)

GCNs with Layer Aggregations. Xu et al.(2018) propose layer aggregations for GCNs, inwhich the final representation of each node iscomputed by combining the node’s representa-tions from all GCN layers:

hfinalv = LA(h(l)

v ,h(l−1)v , . . . . ,h(1)

v ) (3)

where theLA function can be concatenation, max-pooling, or LSTM-attention operations as definedin Xu et al. (2018).

2.2 Dense ConnectivityDense connectivity is the core component ofthe proposed DCGCN. With dense connectivity,node v in the l-th layer not only takes inputsfrom h(l−1), but also receives information fromall the preceding layers, as shown in Figure 2.Mathematically, we first define g(l)

u as the concat-enation of the initial node representation and thenode representations produced in layers 1, · · · ,l − 1:

g(l)u = [xu;h

(1)u ; . . . ;h(l−1)

u ]. (4)

Such a mechanism allows deeper layers to captureall previous information to alleviate the problemdiscussed in Section 1 in graph neural networks.Similar strategies are also proposed in previouswork (He et al., 2016; Huang et al., 2017).

While dense connectivity allows trainingdeeper neural networks, every intermediate layeris designated to be of very small size, allowingadding only a small set of feature-maps at eachlayer. The final classifier makes predictions basedon all feature-maps, which is called ‘‘collectiveknowledge’’ (Huang et al., 2017). Such a strategyimproves the parameter efficiency. In practice, thedimensions of these small hidden layers dhiddenare decided by the number of layers L and theinput feature dimension d. In DCGCN, we usedhidden = d/L.

For example, if we have a 3-layer (L = 3)DCGCN model and input dimension is 300(d = 300), the hidden dimension of each layerwill be dhidden = d/L = 300/3 = 100. Then

Figure 2: Each DCGCN block has two sub-blocks. Bothof them are densely connected graph convolutionallayers with different numbers of layers. A linear trans-formation is used between two sub-blocks, followed bya residual connection.

we concatenate the output of each layer to formthe new representation. We have 3 layers so theoutput dimension is 300 (3 × 100). Differentfrom the GCN model whose hidden dimension islarger than or equal to the input dimension, theDCGCN model shrinks the hidden dimension asthe number of layers increases in order to improvethe parameter efficiency similar to DenseNets(Huang et al., 2017).

Accordingly, we modify the convolution com-putation of each layer as:

h(l)v = ρ

( ∑u∈N (v)

W (l)g(l)u + b(l)

)(5)

The column dimension of the weight matrixincreases by dhidden per layer, that is, W (l) ∈Rdhidden×d(l) , where d(l) = d+ dhidden × (l − 1).

2.3 Graph AttentionAttention mechanisms have become almost ade facto standard in many sequence-based tasks(Vaswani et al., 2017). In DCGCNs, we also incor-porate the self-attention strategy by implicitlyspecifying different weights to different nodes in aneighborhood similar to graph attention networks(Velickovic et al., 2018).

299

Page 4: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

In order to perform self-attention on nodes,attention coefficients are required. The input forthe calculation is a set of vectors, g(l) =

{g(l)1 , g

(l)2 , . . . , g

(l)n }, after node-wise feature trans-

formation g(l)u = W (l)g

(l)u . As an initial step,

a shared linear projection parameterized by aweight matrix, Wa ∈ R

dhidden×dhidden , is appliedto nodes in the graph. Attention coefficients canbe computed as:

α(l)ij =

exp(φ(a�[Wag

(l)i ;Wag

(l)j ]

))∑

k∈Niexp

(φ(a�[Wag

(l)i ;Wag

(l)k ]

))(6)

where a ∈ R2dhidden is a weight vector, φ is

the activation function (here we use LeakyReLU[Girshick et al., 2014]). These coefficients areused to compute a linear combination of thenode representations. Modifying the convolutioncomputation for attention, we arrive at:

h(l)v = ρ

( ∑u∈N (v)

α(l)vuW

(l)g(l)u + b(l)

)(7)

where α(l)vu are normalized attention coefficients

computed by the attention mechanism at l-th layer.Note that these coefficients will not change thedimension of the output representations.

3 Graph-to-Sequence Model

In the following we will explain the model ar-chitecture of the graph-to-sequence model. Weleverage DCGCNs as the graph encoder, whichdirectly models the graph structure withoutlinearization.

3.1 Graph EncoderThe graph encoder is composed of DCGCNblocks, as shown in Figure 3. Within each DCGCNblock, we design two types of multi-layer DCGCNsas two sub-blocks to capture graph structure atdifferent abstract levels. As Figure 2 shows, ineach block, the first sub-block has n-layers andthe second sub-block hasm-layers. This prototypeshares the same spirit with the usage of twodifferent-sized filters in DenseNets (Huang et al.,2017).

Linear Combination Layer. In addition todensely connected layers, we include a linear

Figure 3: The model concatenates node embeddingsand positional embeddings as inputs. The encoder con-tains a stack of N identical blocks. The linear trans-formation layer combines output of all blocks intohidden representations. These are fed into an attentionmechanism, generating the context vector. The decoder,a 2-layer LSTM (Hochreiter and Schmidhuber, 1997),makes predictions based on hidden representations andthe context vector.

combination layer between multi-layer DCGCNsto filter the representations from different DCGCNlayers, reaching a more expressive representation.This strategy is inspired by ELMo (Peters et al.,2018), which combines the hidden states fromdifferent LSTM layers. We also use a residualconnection (He et al., 2016) to incorporate theinitial inputs of multi-layer GCNs into the linearcombination layer, see Figure 3. Formally, theoutput of the linear combination layer is definedas:

hcomb = Wcomb

(hout + xv

)+ bcomb (8)

where hout is the output of the densely connectedlayers by concatenating outputs from all previousL layers hout = [h(1); . . . ;h(L)] and hout ∈ R

d.xv is the input of the DCGCN layer. hout and xv

share the same dimension d. Wcomb ∈ Rd×d is a

weight matrix and bcomb is a bias vector for thelinear transformation. Both Wcomb and bcomb aredifferent according to different DCGCN layers.In addition, another linear combination layer isadded to obtain the final representations as shownin Figure 3.

3.2 Extended Levi GraphIn order to improve the information propagationprocess in graph structures such as AMR graphsand dependency trees, previous researchersenrich the original input graphs with additional

300

Page 5: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

transformations. Marcheggiani and Titov (2017)add reverse edges as well as self-loop edges foreach node to the original graph. This strategyis similar to the bidirectional recurrent neuralnetworks (RNNs) (Elman, 1990), which can enjoythe information propagation from two directions.Beck et al. (2018) adapt this approach and addi-tionally transform the directed input graphs intoLevi graphs (Gross et al., 2013). Basically, edgesin the original graphs are turned into additionalnodes in Levi graphs. With this approach, we canencode the original edge labels and node inputsin the same way. Specifically, Beck et al. (2018)define three types of edge labels on the Levigraph: default, reverse, and self, which refer tothe original edges, the new virtual edges that arereverse to the original edges, and the self-loopedges.

Scarselli et al. (2009) add another node that isconnected to all other nodes. Zhang et al. (2018a)use a global sentence-level node to assemble andback-distribute information. Motivated by theseworks, we propose an extended Levi graph, whichadds a global node in the Levi graph. For everynode x in the original Levi graph, there is anew edge (global) from the global node to x.Figure 4 shows an example AMR graph and itscorresponding extended Levi graph. The edge typevocabulary for the extended Levi graph of theAMR graph now becomes T = { default, reverse,self, global}. Our motivations are three-fold. First,the global node gives each node a global view ofthe input graph, which can make each node moreaware of the non-local information. Second, theglobal node can serve as a hub to help nodecommunications, which can facilitate the nodeinformation propagation process. Third, the outputvectors of the global node in the encoder can beused as the initial states of the decoder, which arecrucial for sequence-to-sequence learning tasks.Prior efforts average representations of all nodesas the graph embedding to initialize the decoder.Instead, we directly use the learned representationof the global nodes, which captures the infor-mation from all nodes in the whole graph.

The input to the syntax-based neural machinetranslation task is the dependency tree. Unlikethe AMR graph, the sentence contains significantsequential information. Beck et al. (2018) injectthis information by adding sequential connectionsto each token. In our model, we also add for-ward and backward sequential connections, as

Figure 4: An AMR graph (top) and its correspondingextended Levi graph (bottom). The extended Levi graphcontains an additional global node and four differenttype of edges.

illustrated in Figure 5. Therefore, the edge typevocabulary for the extended Levi graph of thedependency tree becomes T = {default, reverse,self, global, forward, backward}.

Positional encodings about the relative orabsolute position of the tokens have been provedbeneficial for sequence learning (Gehring et al.,2017). We also include positional encodingsby concatenating them with the learned wordembeddings. The positional encodings are indexedby integer values representing the minimum dis-tance from the root node. For example, come-01in Figure 4 is the root node of the AMR graph, soits index should be 0, where and is the child nodeof come-01, its index is 1. Notice that we denotethe index of the global node as −1.

301

Page 6: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

Figure 5: A dependency tree and its extended Levigraph.

3.3 Direction Aggregation

Directionality and edge labels play an importantrole in linguistic structures. Information from in-coming edges, outgoing edges, and self edgesshould be treated differently by using separateweight matrices. Moreover, information fromincoming edges that have different labels shouldhave different weight matrices, too. Following thismotivation, we incorporate the directionality of anedge directly in its label. For example, node learn-01 in Figure 4 has three incoming edges, theseedges have three different types: default (fromnode op2), self (from node learn-01), and global(from node gnode). For the AMR graph we havefour types of edges while for dependency treeswe have six as mentioned in Section 3.2. Thus,considering different type of edges, we modify theconvolution computation as:

v(l)t = ρ

( ∑u∈N (v)

dir(u,v)=t

α(l)vuW

(l)t g(l)

u + b(l)t

)(9)

where dir(u, v) selects the weight matrix and biasterm associated with the edge type t. For example,in the AMR generation task, there are four edgetypes: default, reverse, self, and global. Each typecorresponds to a separate weight matrix and aseparate bias term.

Now we need to aggregate representationslearned from different types of edges. A simpleway to do this is averaging them to get the fi-nal representations. However, Hamilton et al.(2017) show that using a mean-based functionto aggregate feature information from differentnodes may not be satisfactory, since informa-tion from different sources should not be treatedequally. Thus we assign different weights to infor-mation from different types of edges to integratesuch information. Specifically, we concatenate thelearned representations from all types of edgesand perform a linear transformation, mathemati-cally represented as:

f([v(l)1 ; · · · ;v(l)

T ]) = Wf [v(l)1 ; · · · ;v(l)

T ] + bf

(10)where Wf ∈ R

d′×dhidden is the weight matrix and

d′= T × dhidden. T is the size of the edge type

vocabulary and dhidden is the hidden dimensionin DCGCN layers as described in Section 2.2.bf ∈ R

dhidden is a bias vector. Finally, the con-volution computation becomes:

h(l)v = ρ

(f([v

(l)1 ; · · · ;v(l)

T ]))

(11)

3.4 DecoderWe use an attention-based LSTM decoder(Bahdanau et al., 2015). The initial state of thedecoder is the representation of the global nodedescribed in Section 3.2. The decoder yieldsthe natural language sequence by calculating asequence of hidden states sequentially. Here wealso include the coverage mechanism (Tu et al.,2016). Therefore, when generating the t-th token,the decoder considers five factors: the attentionmemory, the word embedding of the (t − 1)-thtoken, the previous hidden state of LSTM, theprevious context vector, and the previous coveragevector.

4 Experiments

4.1 Experimental SetupWe assess the effectiveness of our models ontwo typical graph-to-sequence learning tasks,

302

Page 7: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

Dataset Train Dev Test

AMR15 (LDC2015E86) 16,833 1,368 1,371AMR17 (LDC2017T10) 36,521 1,368 1,371

English-Czech 181,112 2,656 2,999English-German 226,822 2,169 2,999

Table 1: The number of sentences in four datasets.

including AMR-to-text generation and syntax-based neural machine translation (NMT). Forthe AMR-to-text generation task, we use twobenchmarks—the LDC2015E86 dataset (AMR15)and the LDC2017T10 dataset (AMR17). In thesedatasets, each instance contains a sentence andan AMR graph. We follow Konstas et al. (2017)to apply entity simplification in the preprocessingsteps. We then transform each preprocessed AMRgraph into its extended Levi graph as describedin Section 3.2. For the syntax-based NMT task,we evaluate our model on both the En-Deand the En-Cs News Commentary v11 datasetfrom the WMT16 translation task.2 We parseEnglish sentences after tokenization to generatethe dependency trees on the source side usingSyntaxNet (Alberti et al., 2017).3 We tokenizeCzech and German using the Moses tokenizer.4

On the target side, we use byte-pair encodings(Sennrich et al., 2016) with 8,000 merge oper-ations to obtain subwords. We transform thelabelled dependency trees into their correspondingextended Levi graphs as described in Section 3.2.Table 1 shows the statistics of these four datasets.The AMR-to-text datasets contain about 16 K∼ 36 K training instances. The NMT datasetsare relatively large, consisting of around 200 Ktraining instances.

We tune model hyper-parameters using randomlayouts based on the results of the developmentset. We choose the number of DCGCN blocks(Block) from {1, 2, 3, 4}. We select the featuredimension d from {180, 240, 300, 360, 420}. Wedo not use pretrained embeddings. The encoderand the decoder share the training vocabulary.We adopt Adam (Kingma and Ba, 2015) with aninitial learning rate of 0.0003 as the optimizer. The

2http://www.statmt.org/wmt16/translation-task.html.

3https://github.com/tensorflow/models/tree/master/research/syntaxnet.

4https://github.com/moses-smt/mosesdecoder.

Model T #P B C

Seq2SeqB (Beck et al., 2018) S 28,4 M 21.7 49.1GGNN2Seq (Beck et al., 2018) S 28.3M 23.3 50.4Seq2SeqB (Beck et al., 2018) E 142M 26.6 52.5GGNN2Seq (Beck et al., 2018) E 141M 27.5 53.5

DCGCN (ours) S 18.5M 27.6 57.3E 92.5 M 30.4 59.6

Table 2: Main results on AMR17. #P shows themodel size in terms of parameters; ‘‘S’’ and ‘‘E’’denote single and ensemble models, respectively.

batch size (Batch) candidates are {16, 20, 24}.We determine when to stop training based onthe perplexity change in the development set. Fordecoding, we use beam search with beam size 10.Through preliminary experiments, we find that thecombinations (Block = 4, d = 360,Batch = 16)and (Block = 2, d = 360, Batch = 24) give bestresults on AMR and NMT tasks, respectively.Following previous work, we evaluate the resultsin terms of both BLEU (B) scores (Papineni et al.,2002) and sentence-level CHRF++ (C) scores(Popovic, 2017; Beck et al., 2018). Particularly,we use case-insensitive BLEU scores for AMRand case sensitive BLEU scores for NMT. Forensemble models, we train five models withdifferent random seeds and then use Sockeye(Felix et al., 2017) to perform default ensembledecoding.

4.2 Main Results on AMR-to-text Generation

We compare the performance of DCGCNs withthe other three kinds of models: (1) sequence-to-sequence (Seq2Seq) models, which use linearizedgraphs as inputs; (2) recurrent graph encoders(GGNN2Seq, GraphLSTM); (3) models trainedwith external resources. For convenience, we denotethe LSTM-based Seq2Seq models of Konstaset al. (2017) and Beck et al. (2018) as Seq2SeqKand Seq2SeqB, respectively. GGNN2Seq (Becket al., 2018) is the model that leverages GGNNsas graph encoders.

Table 2 shows the results on AMR17. Our singlemodel achieves 27.6 BLEU points, which is thenew state-of-the-art result for single models. Inparticular, our single DCGCN model consistentlyoutperforms Seq2Seq models by a significantmargin when trained without external resources.For example, the single DCGCN model gains

303

Page 8: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

5.9 more BLEU points than the single models ofSeq2SeqB on AMR17. These results demonstratethe importance of explicitly capturing the graphstructure in the encoder.

In addition, our single DCGCN model obtainsbetter results than previous ensemble models. Forexample, on AMR17, the single DCGCN modelis 1 BLEU point higher than the ensemble modelof Seq2SeqB. Our model requires substantiallyfewer parameters (e.g., the parameter size is only3/5 and 1/9 of those in GGNN2Seq and Seq2SeqB,respectively). The ensemble approach based oncombining five DCGCN models initialized withdifferent random seeds achieves a BLEU scoreof 30.4 and a CHRF++ score of 59.6.

Under the same setting, our model also consis-tently outperforms graph encoders based onrecurrent neural networks or gating mechanisms.For GGNN2Seq, our single model is 3.3 and0.1 BLEU points higher than their single andensemble models, respectively. We also havesimilar observations in terms of CHRF++ scoresfor sentence-level evaluations. DCGCN also out-performs GraphLSTM by 2.0 BLEU points in thefully supervised setting as shown in Table 3.Note that GraphLSTM uses char-level neuralrepresentations and pretrained word embeddings,whereas our model solely relies on word-levelrepresentations with random initializations. Thisempirically shows that compared with recurrentgraph encoders, DCGCNs can learn better rep-resentations for graphs.

Moreover, we compare our results with thestate-of-the-art semi-supervised models on theAMR15 test set (Table 3), including non-neuralmethods such as TSP (Song et al., 2016), PBMT(Pourdamghani et al., 2016), Tree2Str (Flaniganet al., 2016), and SNRG (Song et al., 2017). Allthese non-neural models train language models onthe whole Gigaword corpus. Our ensemble modelgives 28.2 BLEU points without external data,which is better than these other methods.

Following Konstas et al. (2017) and Song et al.(2018), we also evaluate our model using externalGigaword sentences as training data. We first usethe additional data to pretrain the model, then finetune it on the gold data. Using additional 0.1Mdata, the single DCGCN model achieves a BLEUscore of 29.0, which is higher than Seq2SeqK(Konstas et al., 2017) and GraphLSTM (Songet al., 2018) trained with 0.2M additional data.When using the same amount of 0.2M data, the

Model External B

Seq2SeqK (Konstas et al., 2017) − 22.0GraphLSTM (Song et al., 2018) − 23.3

DCGCN(single) − 25.7DCGCN(ensemble) − 28.2

TSP (Song et al., 2016) ALL 22.4PBMT (Pourdamghani et al., 2016) ALL 26.9Tree2Str (Flanigan et al., 2016) ALL 23.0SNRG (Song et al., 2017) ALL 25.6

Seq2SeqK (Konstas et al., 2017) 0.2M 27.4GraphLSTM (Song et al., 2018) 0.2M 28.2

DCGCN(single) 0.1M 29.0DCGCN(single) 0.2M 31.6

Seq2SeqK (Konstas et al., 2017) 2M 32.3GraphLSTM (Song et al., 2018) 2M 33.6Seq2SeqK (Konstas et al., 2017) 20M 33.8

DCGCN(single) 0.3M 33.2DCGCN(ensemble) 0.3M 35.3

Table 3: Main results on AMR15 with/withoutexternal Gigaword sentences as auto-parseddata are used.

performance of DCGCN is 4.2 and 3.4 BLEUpoints higher than Seq2SeqK and GraphLSTM,respectively. The DCGCN model is able to achievecompetitive BLEU points (33.2) by using 0.3Mexternal data, while GraphLSTM achieves a scoreof 33.6 by using 2M data and Seq2SeqK achievesa score of 33.8 by using 20M data. These resultsshow that our model is more effective in termsof using automatically generated AMR graphs.Using 0.3M additional data, our ensemble modelachieves the new state-of-the-art result of 35.3BLEU points.

4.3 Main Results on Syntax-based NMT

Table 4 shows the results for the English-German(En-De) and English-Czech (En-Cs) translationtasks. BoW+GCN, CNN+GCN, and BiRNN+GCNrefer to utilizing the following encoders with aGCN layer on top respectively: 1) a bag-of-words encoder, 2) a one-layer CNN, and 3) abidirectional RNN. PB-SMT is the phrase-basedstatistical machine translation model using Moses(Koehn et al., 2007). Our single model achieves19.0 and 12.1 BLEU points on the En-De andEn-Cs tasks, respectively, significantly outperform-ing all the single models. For example, compared

304

Page 9: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

English-German English-Czech

Model Type #P B C #P B C

BoW+GCN (Bastings et al., 2017) Single − 12.2 − − 7.5 −CNN+GCN (Bastings et al., 2017) Single − 13.7 − − 8.7 −BiRNN+GCN (Bastings et al., 2017) Single − 16.1 − − 9.6 −PB-SMT (Beck et al., 2018) Single − 12.8 43.2 − 8.6 36.4Seq2SeqB (Beck et al., 2018) Single 41.4M 15.5 40.8 39.1M 8.9 33.8GGNN2Seq (Beck et al., 2018) Single 41.2M 16.7 42.4 38.8M 9.8 33.3

DCGCN (ours) Single 29.7M 19.0 44.1 28.3M 12.1 37.1

Seq2SeqB (Beck et al., 2018) Ensemble 207M 19.0 44.1 195M 11.3 36.4GGNN2Seq (Beck et al., 2018) Ensemble 206M 19.6 45.1 194M 11.7 35.9

DCGCN (ours) Ensemble 149M 20.5 45.8 142M 13.1 37.8

Table 4: Main results on English-German and English-Czech datasets.

with the best GCN-based model (BiRNN+GCN),our single DCGCN model surpasses it by 2.7and 2.5 BLEU points on the En-De and En-Cstasks, respectively. Our models consist of fullGCN layers, removing the burden of using arecurrent encoder to extract non-local contextualinformation in the bottom layers. Compared withnon-GCN models, our single DCGCN model is2.2 and 1.9 BLEU points higher than the currentstate-of-the-art single model (GGNN2Seq) on theEn-De and En-Cs translation tasks, respectively.In addition, our single model is comparable to theensemble results of Seq2SeqB and GGNN2Seq,whereas the number of parameters of our models isonly about 1/6 of theirs. Additionally, the ensem-ble DCGCN models achieve 20.5 and 13.1 BLEUpoints on the En-De and En-Cs tasks, respectively.Our ensemble results are significantly higher thanthose of the state-of-the-art syntax-based ensem-ble models reported by GGNN2Seq (En-De: 20.5vs. 19.6; En-Cs: 13.1 vs. 11.7 in terms of BLEU).

4.4 Additional Experiments

Layers in the Sub-block. Table 5 shows theeffect of the number of layers of each sub-block on the AMR15 development set. DenseNets(Huang et al., 2017) use two kinds of convolutionfilters: 1 × 1 and 3 × 3. Similar to DenseNets,we choose the values of n and m for layersfrom [1, 2, 3, 6]. We choose this value range byconsidering the scale of non-local nodes, theabstract information at different level, and thecalculation efficiency. For brevity, we only showrepresentative configurations. We first inves-tigate DCGCN with one block. In general, the

Block n m B C

1

1 1 17.6 48.31 2 19.2 50.32 1 18.4 49.11 3 19.6 49.43 1 20.0 50.53 3 21.4 51.03 6 21.8 51.76 3 21.7 51.56 6 22.0 52.1

23 6 23.5 53.36 3 23.3 53.46 6 22.0 52.1

Table 5: The effect of the number of layers insideDCGCN sub-blocks on the AMR15 develop-ment set.

performance increases when we gradually enlargen and m. For example, when n = 1 and m = 1,the BLEU score is 17.6; when n = 6 and m = 6,the BLEU score becomes 22.0. We observethat the three settings (n = 6, m = 3), (n = 3,m = 6), and (n = 6, m = 6) give similar resultsfor both 1 DCGCN block and 2 DCGCN blocks.Because the first two settings contain fewerparameters than the third setting, it is reasonable tochoose either (n = 6, m = 3) or (n = 3, m = 6).For later experiments, we use (n = 6, m = 3).

Comparisons with Baselines. The first blockin Table 6 shows the performance of our twobaseline models: multi-layer GCNs with residualconnections (GCN+RC) and multi-layer GCNswith both residual connections and layer aggre-gations (GCN+RC+LA). In general, increasing

305

Page 10: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

GCN B C GCN B C

+RC (2) 16.8 48.1 +RC+LA (2) 18.3 47.9+RC (4) 18.4 49.6 +RC+LA (4) 18.0 51.1+RC (6) 19.9 49.7 +RC+LA (6) 21.3 50.8+RC (9) 21.1 50.5 +RC+LA (9) 22.0 52.6+RC (10) 20.7 50.7 +RC+LA (10) 21.2 52.9

DCGCN1 (9) 22.9 53.0 DCGCN3 (27) 24.8 54.7DCGCN2 (18) 24.2 54.4 DCGCN4 (36) 25.5 55.4

Table 6: Comparisons with baselines. +RC denotesGCNs with residual connections. +RC+LA refersto GCNs with both residual connections and layeraggregations. DCGCNi represents our modelwith i blocks, containing i × (n + m) layers.The number of layers for each model is shownin parentheses.

the number of GCN layers from 2 to 9 booststhe model performance. However, when the layernumber exceeds 10, the performance of bothbaseline models start to drop. For example,GCN+RC+LA (10) achieves a BLEU score of21.2, which is worse than GCN+RC+LA (9).In preliminary experiments, we cannot manageto train very deep GCN+RC and GCN+RC+LAmodels. In contrast, our DCGCN models canbe trained using a large number of layers. Forexample, DCGCN4 contains 36 layers. When weincrease the DCGCN blocks from 1 to 4, the modelperformance continues increasing on the AMR15development set. We therefore choose DCGCN4for the AMR experiments. Using a similar method,DCGCN2 is selected for the NMT tasks. Whenthe layer numbers are 9, DCGCN1 is better thanGCN+RC in term of B/C scores (21.7/51.5 vs.21.1/50.5). GCN+RC+LA (9) is sightly better thanDCGCN1. However, when we set the number to18, GCN+RC+LA achieves a BLEU score of19.4, which is significantly worse than the BLEUscore obtained by DCGCN2 (23.3). We also tryGCN+RC+LA (27), but it does not converge. Inconclusion, these results show the robustness andeffectiveness of our DCGCN models.

Performance vs. Parameter Budget. We alsoevaluate the performance of DCGCN model againstdifferent number of parameters on the AMR gen-eration task. Results are shown in Figure 6. Specif-ically, we try four parameter budgets, including11.8M, 14.0M, 16.2M, and 18.4M. These num-bers correspond to the model size (in terms ofnumber of parameters) of DCGCN1, DCGCN2,

DCGCN3, and DCGCN4, respectively. For eachbudget, we vary both the depth of GCN modelsand the hidden vector dimensions of each nodein GCNs in order to exhaust the entire budget.For example, GCN(2) − 512, GCN(3) − 426,GCN(4)−372, andGCN(5)−336 contain about11.8M parameters, where GCN(i) − d indicatesa GCN model with i layers and the hidden size foreach node is d. We compare DCGCN1 with thesefour models. DCGCN1 gives 22.9 BLEU points.For the GCN models, the best result is obtainedby GCN(5)− 336, which falls behind DCGCN1by 2.0 BLEU points. We compare DCGCN2,DCGCN3, and DCGCN4 with their equal-sizedGCN models in a similar way. The results showthat DCGCN consistently outperforms GCN underthe same parameter budget. When the parameterbudget becomes larger, we can observe that theperformance difference becomes more prominent.In particular, the BLEU margins between DCGCNmodels and their best GCN models are 2.0, 2.7,2.7, and 3.4, respectively.

Performance vs. Layers. We compare DCGCNmodels with different layers under the same param-eter budget. Table 7 shows the results. For exam-ple, when both DCGCN1 and DCGCN2 are limitedto 10.9M parameters, DCGCN2 obtains 22.2BLEU points, which is higher than DCGCN1(20.9). Similarly, when DCGCN3 and DCGCN4contain 18.6M and 18.4M parameters, DCGCN4outperforms DCGCN3 by 1 BLEU point with aslightly smaller model. In general, we found whenthe parameter budget is the same, deeper DCGCNmodels can obtain better results than the shallowerones.

Level of Density. Table 8 shows the ablationstudy of the level of density of our model. We useDCGCNs with 4 dense blocks as the full model.Then we remove dense connections graduallyfrom the last block to the first block. In general, theperformance of the model drops substantially aswe remove more dense connections until it cannotconverge without dense connections. The full modelgives 25.5 BLEU points on the AMR15 dev set.After removing the dense connections in the lastblock, the BLEU score becomes 24.8. Withoutusing the dense connections in the last two blocks,the score drops to 23.8. Furthermore, excluding thedense connections in the last three blocks onlygives 23.2 BLEU points. Although these four

306

Page 11: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

Figure 6: Comparison of DCGCN and GCN over different number of parameters. a-b means the model has alayers (a blocks for DCGCN) and the hidden size is b (e.g., 5-336 means a 5-layer GCN with the hidden size 336).

Model D #P B C

DCGCN(1) 300 10.9M 20.9 52.0DCGCN(2) 180 22.2 52.3

DCGCN(2) 240 11.3M 22.8 52.8DCGCN(4) 180 11.4M 23.4 53.4

DCGCN(1) 420 12.6M 22.2 52.4DCGCN(2) 300 12.5M 23.8 53.8DCGCN(3) 240 12.3M 23.9 54.1

DCGCN(2) 360 14.0M 24.2 54.4DCGCN(3) 300 24.4 54.2

DCGCN(2) 420 15.6M 24.1 53.7DCGCN(4) 300 24.6 54.8

DCGCN(3) 420 18.6M 24.5 54.6DCGCN(4) 360 18.4M 25.5 55.4

Table 7: Comparisons of different DCGCNmodels under almost the same parameter budget.

models have the same number of layers, denseconnections allow the model to achieve muchbetter performance. If all the dense connectionsare not considered, the model does not coverageat all. These results indicate dense connections doplay a significant role in our model.

Ablation Study for Encoder and Decoder.Following Song et al. (2018), we conduct a furtherablation study for modules used in the graphencoder and LSTM decoder on the AMR15 devset, including linear combination, global node,direction aggregation, graph attention mecha-nism, and coverage mechanism using the 4-blockmodels by always keeping the dense connections.

Table 9 shows the results. For the encoder, wefind that the linear combination and the globalnode have more contributions in terms of B/C

Model B C

DCGCN4 25.5 55.4-{4} dense block 24.8 54.9-{3, 4} dense blocks 23.8 54.1-{2, 3, 4} dense blocks 23.2 53.1

Table 8: Ablation study for density ofconnections on the dev set of AMR15. -{i} denseblock denotes removing the dense connectionsin the i-th block.

Model B C

DCGCN4 25.5 55.4

Encoder Modules-Linear Combination 23.7 53.2-Global Node 24.2 54.6-Direction Aggregation 24.6 54.6-Graph Attention 24.9 54.7-Global Node&Linear Combination 22.9 52.4

Decoder Modules-Coverage Mechanism 23.8 53.0

Table 9: Ablation study for modules used in thegraph encoder and the LSTM decoder.

scores. The results drop by 2/2.2 and 1.3/1.2points, respectively, after removing them. Withoutthese two components, our model gives a BLEUscore of 22.6, which is still better than the bestGCN+RC model (21.1) and the best GCN+RC+LAmodel (22.1). Adding either the global node or thelinear combination improves the baseline modelswith only dense connections. This suggests thatenriching input graphs with the global node andincluding the linear combination can facilitateGCNs to learn better information aggregations,

307

Page 12: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

Figure 7: CHRF++ scores with respect to the inputgraph size for three models.

producing more expressive graph representations.Results also show the linear combination is moreeffective than the global node. Considering themtogether further enhances the model performance.After removing the graph attention module, ourmodel gives 24.9 BLEU points. Similarly, exclud-ing the direction aggregation module leads to aperformance drop to 24.6 BLEU points. The cover-age mechanism is also effective in our models.Without the coverage mechanism, the result dropsby 1.7/2.4 points for B/C scores.

4.5 Analysis and Discussion

Graph Size. Following Bastings et al. (2017),we show in Figure 7 the CHRF++ score variationsaccording to the graph size |G| on the AMR2015development set, where |G| refers to the numberof nodes in the extended Levi graph. We binthe graph size into five classes (≤ 30, (30, 40],(40, 50], (50, 60], > 60). We average the sentence-level CHRF++ scores of the sentences in thesame bin to plot Figure 7. For small graphs (i.e.,|G| ≤ 30), DCGCN obtains similar results asthe baselines. For large graphs, DCGCN signif-icantly outperforms the two baselines. In general,as the graph size increases, the gap between DCGCNand the two baselines becomes larger. In addition,we can also notice that the margin between GCNand GCN+LA is quite stable, while the marginbetween DCGCN and GCN+LA varies accordingto the graph size. The trend for BLEU scores issimilar to CHRF++ scores. This suggests thatDCGCN can perform better for larger graphsas its deeper architecture can capture the long-distance dependencies. Dense connections facil-itate information propagation in large graphs, while

(s / state-0100 :ARG0 (p / person0000 :ARG0-of (h / have-org-role-91000000 :ARG1 (i / intelligence00000000 :mod (c / country :wiki "united states"0000000000 :name (n / name :op1 "u.s.")))000000 :ARG2 (o / official)))00 :ARG1 (c2 / continue-010000 :ARG0 (p2 / person000000 :ARG0-of (h2 / have-org-role-9100000000 :ARG2 (o2 / official0000000000 :mod (c3 / country :wiki "north korea"000000000000 :name (n2 / name :op1 "north" :op2000000000000 "korea")))))0000 :ARG1 (t / trade-01000000 :ARG1 (t2 / technology00000000 :purpose (w / weapon0000000000 :ARG2-of (d / destroy-01000000000000 :degree (m / mass))))000000 :mod (g / globe))0000 :ARG2-of (i2 / include-01000000 :ARG1 (i3 / instruct-0100000000 :ARG3 (m2 / make-0100000000000 :ARG1 (m3 / missile0000000000000 :ARG1-of (a / advanced-02)))))))

Reference: u.s. intelligence officials stated that north koreanofficials are continuing global trade in technology for weaponsof mass destruction including instructions for making advancedmissiles.

GCN+RC: a u.s. intelligence official stated that north koreaofficials continued the global trade for weapons of massdestruction by making advanced missiles to make advancedmissiles.

GCN+RC+LA: a u.s. intelligence official stated that northkorea officials continued global trade with weapons of massdestruction including making advanced missiles.

DCGCN: a u.s. intelligence official stated that north koreaofficials continue global trade on technology for weapons ofmass destruction including instructions to make advancedmissiles.

Table 10: Example outputs.

shallow GCNs might struggle to capture suchdependencies.

Example Output. Table 10 shows example out-puts from three models for the AMR-to-text task,together with the corresponding AMR graph aswell as the text reference. The word ‘‘technology’’in the reference acts as a link between ‘‘globaltrade’’ and ‘‘weapons of mass destruction’’, offeringthe background knowledge to help understand thecontext. The word ‘‘instructions’’ also plays acrucial role in the generated sentence — withoutthe word the sentence will have a significantly dif-ferent meaning. Both GCN+RC and GCN+RC+LAfail to successfully generate these two importantwords. The output from GCN+RC does not evenappear to be grammatically correct. In contrast,DCGCN manages to generate both words. Webelieve this is because DCGCN is able to learnricher semantic information by capturing complexlong dependencies. GCN+RC+LA does generatean output that looks similar to the reference at

308

Page 13: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

the token level. However, the conveyed semanticinformation in the generated sentence largely dif-fers from that of the reference. DCGCNs do nothave this problem.

5 Related Work

Our work builds on a rich line of recent efforts ongraph-to-sequence models, graph convolutionalnetworks, and densely connected convolutionalnetworks.

Graph-to-Sequence Learning. Early researchefforts for graph-to-sequence learning are basedon statistical methods. Lu et al. (2009) present alanguage generation model using the tree-structuredmeaning representation based on tree conditionalrandom fields. Lu and Ng (2011) propose a modelfor language generation from lambda calculusexpressions that can be represented as foreststructures. Konstas and Lapata (2012, 2013) lever-age hypergraphs for concept-to-text generation.Flanigan et al. (2016) transform a given AMRgraph into a spanning tree, before translating itinto a sentence using a tree-to-string transducer.Pourdamghani et al. (2016) adopt a phrase-basedmodel for machine translation (Koehn et al., 2003)based on a linearized AMR graph. Song et al.(2017) leverage a synchronous node replacementgrammar. Konstas et al. (2017) also linearize theinput graph and feed it to the Seq2Seq model(Sutskever et al., 2014).

Sequence-based neural networks may lose struc-tural information from the original graph becausethey require linearization of the input graph. Recentresearch efforts consider developing encoderswith graph neural networks. Beck et al. (2018)use GGNNs (Li et al., 2016) as the encoder andintroduce the Levi graph that allows nodes andedges to have their own hidden representations.Song et al. (2018) propose the graph-state LSTMto directly encode graph-level semantics. In orderto capture non-local information, the encoderperforms graph state transition by informationexchange between connected nodes. Their workbelongs to the family of RNNs. Our graph en-coder is built based on GCNs. Recurrent graphneural networks (Li et al., 2016; Song et al.,2018) use gated operations to update node stateswhereas graph convolutional networks use lineartransformation. The contrast between our modeland theirs is reminiscent of the contrast betweenCNN and RNN.

Closest to our work, Bastings et al. (2017) stackGCNs upon a RNN or CNN encoder because2-layer GCNs may not be able to capture non-local information, especially when the graph islarge. Our graph encoder solely relies on theDCGCN model, whose deep network structureencodes richer local and non-local information forlearning better graph representations.

Densely Connected Convolutional Networks.Intuitively, neural networks should be able to learnrich representations by stacking a large numberof layers. However, empirical results often donot support such an intuition—useful informationcaptured in earlier layers may get lost afterpassing through subsequent layers. Many recentefforts focus on resolving such an issue. HighwayNetworks (Srivastava et al., 2015) use bypassingpaths along with gating units to train networks.ResNets (He et al., 2016), in which identity map-pings are used as bypassing paths, have achievedimpressive performance on various tasks. DenseNets(Huang et al., 2017) refine this insight and proposea dense connectivity strategy, which connects alllayers directly with each other to ensure maximuminformation flow between layers.

Graph Convolutional Networks. Early effortsthat attempt to extend neural networks to dealwith arbitrary structured graphs are introducedby Gori et al. (2005) and Scarselli et al. (2009),where the states of nodes are updated basedon the states of their neighbors. Bruna (2014)then applies the convolution operation on graphLaplacians to construct efficient architectures inthe spectral domain. Subsequent efforts improveits computational efficiency with local spectralconvolution techniques (Henaff et al., 2015;Defferrard et al., 2016; Kipf and Welling, 2017).

Our approach is closely related to GCNs (Kipfand Welling, 2017), which restrict the filtersto operate on a first-order neighborhood aroundeach node. Recent improvements and extensionsof GCNs include using additional aggregationmethods such as vertex attention (Velickovicet al., 2018) or pooling mechanism (Hamilton et al.,2017) to better summarize neighborhood states.

However, the best performance of GCNs isachieved with a 2-layer model, while deepermodels perform worse though they can potentiallyhave access to more non-local information.Li et al. (2018) show that this issue is due to the

309

Page 14: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

over-smoothed output representations that impededistinguishing nodes from different clusters. Re-cent attempts that try to address this issue includesthe use of layer-aggregation functions (Xu et al.,2018), which combine learned features from alllayers, and the use of co-training and self-trainingmechanisms that encourage exploration on theentire graph (Li et al., 2018).

6 Conclusion

We introduce the novel densely connected graphconvolutional networks to learn structural graphrepresentations. Experimental results show thatDCGCNs can outperform state-of-the-art modelsin two tasks: AMR-to-text generation and syntax-based neural machine translation. Unlike previousdesigns of GCNs, DCGCNs scale naturally tosignificantly more layers without suffering fromperformance degradation and optimization diffi-culties, thanks to the introduced dense connec-tivity mechanism. Such a deep architecture allowsthe encoder to better capture the rich structuralinformation of a graph, especially when it is large.

There are multiple venues for future work. Onenatural question we would like to ask is how tomake use of the proposed framework to performimproved graph representation learning for variousgraph related tasks (Xu et al., 2018). On theother hand, we would also like to investigate howother NLP applications such as relation extraction(Zhang et al., 2018b) and semantic role labeling(Marcheggiani and Titov, 2017) can potentiallybenefit from our proposed approach.

Acknowledgments

We would like to thank the anonymous reviewersand our Action Editor Stefan Riezler for theircomments and suggestions on this work. We wouldalso like to thank Daniel Beck, Linfeng Song,Joost Bastings, Zuozhu Liu, and Yiluan Guo fortheir helpful suggestions. This work is supportedby Singapore Ministry of Education AcademicResearch Fund (AcRF) Tier 2 Project MOE2017-T2-1-156. This work is also partially supportedby SUTD project PIE-SGP-AI-2018-01.

References

Chris Alberti, Daniel Andor, Ivan Bogatyy,Michael Collins, Daniel Gillick, Lingpeng Kong,

Terry Koo, Ji Ma, Mark Omernick, SlavPetrov, Chayut Thanapirom, Zora Tung, andDavid Weiss. 2017. Syntaxnet models forthe conll 2017 shared task. arXiv preprintarxiv:1703.04929.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate. InProceedings of ICLR.

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, andNathan Schneider. 2013. Abstract meaningrepresentation for sembanking. In Proceedingsof LAW@ACL.

Joost Bastings, Ivan Titov, Wilker Aziz, DiegoMarcheggiani, and Khalil Sima’an. 2017. Graphconvolutional encoders for syntax-aware neuralmachine translation. In Proceedings of EMNLP.

Daniel Beck, Gholamreza Haffari, and TrevorCohn. 2018. Graph-to-sequence learning usinggated graph neural networks. In Proceedings ofACL.

Joan Bruna. 2014. Spectral networks and deeplocally connected networks on graphs. InProceedings of ICLR.

Tianqi Chen, Mu Li, Yutian Li, Min Lin,Naiyan Wang, Minjie Wang, Tianjun Xiao,Bing Xu, Chiyuan Zhang, and Zheng Zhang.2015. Mxnet: A flexible and efficient machinelearning library for heterogeneous distributedsystems. arXiv preprint arxiv:1512.01274.

Michael Defferrard, Xavier Bresson, and PierreVandergheynst. 2016. Convolutional neuralnetworks on graphs with fast localized spectralfiltering. In Proceedings of NIPS.

Jeffrey L. Elman. 1990. Finding structure in time.Cognitive Science, 14(2):179–211.

Hieber Felix, Domhan Tobias, DenkowskiMichael, Vilar David, Sokolov Artem, CliftonAnn, and Post Matt. 2017. Sockeye: A toolkitfor neural machine translation. arXiv preprintarxiv:1712.05690.

Jeffrey Flanigan, Chris Dyer, Noah A. Smith, andJaime G. Carbonell. 2016. Generation from

310

Page 15: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

abstract meaning representation using treetransducers. In Proceedings of NAACL-HLT .

Jonas Gehring, Michael Auli, David Grangier,Denis Yarats, and Yann Dauphin. 2017. Con-volutional sequence to sequence learning. InProceedings of ICML.

Ross Girshick, Jeff Donahue, Trevor Darrell, andJitendra Malik. 2014. Rich feature hierarchiesfor accurate object detection and semanticsegmentation. In Proceedings of CVPR.

Michele Gori, Gabriele Monfardini, and FrancoScarselli. 2005. A new model for learning ingraph domains. In Proceedings of IJCNN.

Jonathan L. Gross, Jay Yellen, and Ping Zhang.2013. Handbook of Graph Theory, SecondEdition. Chapman & Hall/CRC.

Zhijiang Guo and Wei Lu. 2018. Better transition-based AMR parsing with a refined search space.In Proceedings of EMNLP.

William L. Hamilton, Rex Ying, and JureLeskovec. 2017. Inductive representationlearning on large graphs. In Proceedings ofNIPS.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. 2016. Deep residual learning forimage recognition. In Proceedings of CVPR.

Mikael Henaff, Joan Bruna, and YannLeCun. 2015. Deep convolutional networkson graph-structured data. arXiv preprintarxiv:1506.05163.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Gao Huang, Zhuang Liu, Laurens van derMaaten, and Kilian Q. Weinberger. 2017.Densely connected convolutional networks. InProceedings of CVPR.

Diederik P. Kingma and Jimmy Ba. 2015.Adam: A method for stochastic optimization.In Proceedings of ICLR.

Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolu-tional networks. In Proceedings of ICLR.

Philipp Koehn, Hieu Hoang, Alexandra Birch,Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade Shen,Christine Moran, Richard Zens, Chris Dyer,Ondrej Bojar, Alexandra Constantin, and EvanHerbst. 2007. Moses: Open source toolkit forstatistical machine translation. In Proceedingsof ACL(Demo).

Philipp Koehn, Franz Josef Och, and DanielMarcu. 2003. Statistical phrase-based transla-tion. In Proceedings of NAACL-HLT .

Ioannis Konstas, Srinivasan Iyer, Mark Yatskar,Yejin Choi, and Luke S. Zettlemoyer. 2017.Neural amr: Sequence-to-sequence models forparsing and generation. In Proceedings of ACL.

Ioannis Konstas and Mirella Lapata. 2012.Unsupervised concept-to-text generation withhypergraphs. In Proceedings of NAACL-HLT .

Ioannis Konstas and Mirella Lapata. 2013.Inducing document plans for concept-to-textgeneration. In Proceedings of EMNLP.

Qimai Li, Zhichao Han, and Xiao-Ming Wu.2018. Deeper insights into graph convolutionalnetworks for semi-supervised learning. InProceedings of AAAI.

Yujia Li, Daniel Tarlow, Marc Brockschmidt, andRichard S. Zemel. 2016. Gated graph sequenceneural networks. In Proceedings of ICLR.

Wei Lu and Hwee Tou Ng. 2011. A probabilisticforest-to-string model for language generationfrom typed lambda calculus expressions. InProceedings of EMNLP.

Wei Lu, Hwee Tou Ng, and Wee Sun Lee.2009. Natural language generation with treeconditional random fields. In Proceedings ofEMNLP.

Diego Marcheggiani and Ivan Titov. 2017.Encoding sentences with graph convolutionalnetworks for semantic role labeling. InProceedings of EMNLP.

Vinod Nair and Geoffrey E. Hinton. 2010.Rectified linear units improve restricted boltz-mann machines. In Proceedings of ICML.

311

Page 16: Densely Connected Graph Convolutional Networks for Graph-to … · 2019. 8. 18. · Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning Zhijiang Guo 1∗,

Kishore Papineni, Salim Roukos, Todd Ward,and Wei-Jing Zhu. 2002. Bleu: a method forautomatic evaluation of machine translation.In Proceedings of ACL.

Matthew E. Peters, Mark Neumann, Mohit Iyyer,Matt Gardner, Christopher Clark, KentonLee, and Luke S. Zettlemoyer. 2018. Deepcontextualized word representations. In Proceed-ings of NAACL-HLT .

Maja Popovic. 2017. chrf++: Words helpingcharacter n-grams. In Proceedings ofWMT@ACL.

Nima Pourdamghani, Kevin Knight, andUlf Hermjakob. 2016. Generating englishfrom abstract meaning representations. InProceedings of INLG.

Franco Scarselli, Marco Gori, Ah ChungTsoi, Markus Hagenbuchner, and GabrieleMonfardini. 2009. The graph neural networkmodel. IEEE Transactions on Neural Networks,20(1):61–80.

Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016. Neural machine translation of rarewords with subword units. In Proceedings ofACL.

Linfeng Song, Xiaochang Peng, Yue Zhang,Zhiguo Wang, and Daniel Gildea. 2017. AMR-to-text generation with synchronous nodereplacement grammar. In Proceedings of ACL.

Linfeng Song, Yue Zhang, Xiaochang Peng,Zhiguo Wang, and Daniel Gildea. 2016. AMR-to-text generation as a traveling salesmanproblem. In Proceedings of EMNLP.

Linfeng Song, Yue Zhang, Zhiguo Wang,and Daniel Gildea. 2018. A graph-to-sequence model for AMR-to-text generation.In Proceedings of ACL.

Rupesh Kumar Srivastava, Klaus Greff, andJurgen Schmidhuber. 2015. Training very deepnetworks. In Proceedings of NIPS.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.2014. Sequence to sequence learning with neuralnetworks. In Proceedings of NIPS.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, XiaohuaLiu, and Hang Li. 2016. Modeling coverage forneural machine translation. In Proceedings ofACL.

Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. 2017.Attention is all you need. In Proceedings ofNIPS.

Petar Velickovic, Guillem Cucurull, ArantxaCasanova, Adriana Romero, Pietro Lio, andYoshua Bengio. 2018. Graph attention networks.In Proceedings of ICLR.

Keyulu Xu, Chengtao Li, Yonglong Tian,Tomohiro Sonobe, Ken ichi Kawarabayashi,and Stefanie Jegelka. 2018. Representationlearning on graphs with jumping knowledgenetworks. In Proceedings of ICML.

Yue Zhang, Qi Liu, and Linfeng Song. 2018a.Sentence-state LSTM for text representation.In Proceedings of ACL.

Yuhao Zhang, Peng Qi, and Christopher D.Manning. 2018b. Graph convolution over pruneddependency trees improves relation extraction.In Proceedings of EMNLP.

312