Assessing the Ability of Self-Attention Networks to Learn ... · Assessing the Ability of Self-Attention Networks to Learn Word Order Baosong Yang yLongyue Wangz Derek F. Wong Lidia

Assessing the Ability of Self-Attention Networks to Learn Word Order

Baosong Yang† Longyue Wang‡ Derek F. Wong† Lidia S. Chao† Zhaopeng Tu‡∗†NLP2CT Lab, Department of Computer and Information Science, University of Macau

[email protected], {derekfw,lidiasc}@umac.mo‡Tencent AI Lab

{vinnylywang,zptu}@tencent.com

Abstract

Self-attention networks (SAN) have attracted alot of interests due to their high parallelizationand strong performance on a variety of NLPtasks, e.g. machine translation. Due to thelack of recurrence structure such as recurrentneural networks (RNN), SAN is ascribed tobe weak at learning positional information ofwords for sequence modeling. However, nei-ther this speculation has been empirically con-firmed, nor explanations for their strong per-formances on machine translation tasks when“lacking positional information” have been ex-plored. To this end, we propose a novel wordreordering detection task to quantify how wellthe word order information learned by SANand RNN. Specifically, we randomly moveone word to another position, and examinewhether a trained model can detect both theoriginal and inserted positions. Experimentalresults reveal that: 1) SAN trained on word re-ordering detection indeed has difficulty learn-ing the positional information even with theposition embedding; and 2) SAN trained onmachine translation learns better positional in-formation than its RNN counterpart, in whichposition embedding plays a critical role. Al-though recurrence structure make the modelmore universally-effective on learning wordorder, learning objectives matter more in thedownstream tasks such as machine translation.

1 Introduction

Self-attention networks (SAN, Parikh et al., 2016;Lin et al., 2017) have shown promising empiricalresults in a variety of natural language processing(NLP) tasks, such as machine translation (Vaswaniet al., 2017), semantic role labelling (Strubellet al., 2018), and language representations (De-vlin et al., 2019). The popularity of SAN lies in

∗ Zhaopeng Tu is the corresponding author of the paper.This work was conducted when Baosong Yang was interningat Tencent AI Lab.

its high parallelization in computation, and flexi-bility in modeling dependencies regardless of dis-tance by explicitly attending to all the signals. Po-sition embedding (Gehring et al., 2017) is gener-ally deployed to capture sequential information forSAN (Vaswani et al., 2017; Shaw et al., 2018).

Recent studies claimed that SAN with positionembedding is still weak at learning word orderinformation, due to the lack of recurrence struc-ture that is essential for sequence modeling (Shenet al., 2018a; Chen et al., 2018; Hao et al., 2019).However, such claims are mainly based on a theo-retical argument, which have not been empiricallyvalidated. In addition, this can not explain wellwhy SAN-based models outperform their RNNcounterpart in machine translation – a benchmarksequence modeling task (Vaswani et al., 2017).

Our goal in this work is to empirically assess theability of SAN to learn word order. We focus onasking the following research questions:

Q1: Is recurrence structure obligate for learningword order, and does the conclusion hold indifferent scenarios (e.g., translation)?

Q2: Is the model architecture the critical factor forlearning word order in the downstream taskssuch as machine translation?

Q3: Is position embedding powerful enough tocapture word order information for SAN?

We approach these questions with a novel prob-ing task – word reordering detection (WRD),which aims to detect the positions of randomlyreordered words in the input sentence. We com-pare SAN with RNN, as well as directional SAN(DiSAN, Shen et al., 2018a) that augments SANwith recurrence modeling. In this study, we focuson the encoders implemented with different archi-tectures, so as to investigate their abilities to learn

arX

iv:1

906.

0059

2v1

[cs

.CL

] 3

Jun

201

9

PI

H

FFN

Softmax

Linear

PO

Softmax

E✕

Encoder

Output Layer

Bush holda talk with Sharon .

Bush hold a talk with Sharon .

T O T T T TI

Golden

Reorder

Labels

PO

PI 0.5 0.00.0

0.00.00.20.20.10.00.00.20.30.40.1

(a) Position Detector

PI

H

FFN

Softmax

Linear

PO

Softmax

E✕

Encoder

Output Layer

Bush holda talk with Sharon .

Bush hold a talk with Sharon .

0.1 0.2 0.2 0.0 0.0 0.00.1 0.3 0.2 0.0 0.0 0.0T O T T T TI

Golden

Reordered

Labels

0.40.5

PO

PI

(b) Output Layer

Figure 1: Illustration of (a) the position detector, where (b) the output layer is build upon a randomly initialized orpre-trained encoder. In this example, the word “hold” is moved to another place. The goal of this task is to predictthe inserted position “I” and the original position “O” of “hold”.

word order information of the input sequence. Theencoders are trained on objectives like detectionaccuracy and machine translation, to study the in-fluences of learning objectives.

Our experimental results reveal that: (Q1) SANindeed underperforms the architectures with re-currence modeling (i.e. RNN and DiSAN) onthe WRD task, while this conclusion does nothold in machine translation: SAN trained withthe translation objective outperforms both RNNand DiSAN on detection accuracy; (Q2) Learn-ing objectives matter more than model architec-tures in downstream tasks such as machine trans-lation; and (Q3) Position encoding is good enoughfor SAN in machine translation, while DiSAN isa more universally-effective mechanism to learnword order information for SAN.

Contributions The key contributions are:

• We design a novel probing task along withthe corresponding benchmark model, whichcan assess the abilities of different architec-tures to learn word order information.1

• Our study dispels the doubt on the inability ofSAN to learn word order information in ma-chine translation, indicating that the learningobjective can greatly influence the suitabilityof an architecture for downstream tasks.

2 Word Reordering Detection Task

In order to investigate the ability of self-attentionnetworks to extract word order information, in this

1The data and codes are released at: https://github.com/baosongyang/WRD.

section, we design an artificial task to evaluate theabilities of the examined models to detect the er-roneous word orders in a given sequence.

Task Description Given a sentence X ={x1, ..., xi, ..., xN}, we randomly pop a word xiand insert it into another position j (1 ≤ i, j ≤ Nand i 6= j). The objective of this task is to de-tect both the position the word is popped out (la-beled as “O”), as well as the position the word isinserted (labeled as “I”). As seen the example inFigure 1 (a), the word “hold” is moved from the2nd slot to the 4th slot. Accordingly, the 2nd and4th slots are labelled as “O” and “I”, respectively.To exactly detect word reordering, the examinedmodels have to learn to recognize both the normaland abnormal word order in a sentence.

Position Detector Figure 1 (a) depicts the archi-tecture of the position detector. Let the sequentialrepresentations H = {h1, ...,hN} be the output ofeach encoder noted in Section 3, which are fed tothe output layer (Figure 1 (b)). Since only one pairof “I” and “O” labels should be generated in theoutput sequence, we cast the task as a pointer de-tection problem (Vinyals et al., 2015). To this end,we turn to an output layer that commonly used inthe reading comprehension task (Wang and Jiang,2017; Du and Cardie, 2017), which aims to iden-tify the start and end positions of the answer inthe given text.2 The output layer consists of twosub-layers, which progressively predicts the prob-

2Contrary to reading comprehension in which the startand end positions are ordered, “I” and “O” do not have to beordered in our tasks, that is, the popped word can be insertedto either left or right position.

https://github.com/baosongyang/WRD

https://github.com/baosongyang/WRD

abilities of each position being labelled as “I” and“O”. The probability distribution of the sequencebeing labelled as “I” is calculated as:

PI = SoftMax(U>I tanh(WIH)) ∈ RN (1)

where WI ∈ Rd×d and UI ∈ Rd are trainableparameters, and d is the dimensionality of H.

The second layer aims to locate the originalposition “O”, which conditions on the predictedpopped word at the position “I”.3 To make thelearning process differentiable, we follow Xu et al.(2017) to use the weighted sum of hidden statesas the approximate embedding E of the poppedword. The embedding subsequently serves as aquery to attend to the sequence H to find whichposition is most similar to the original position ofpopped word. The probability distribution of thesequence being labelled as “O” is calculated as:

E = PI(WQH) ∈ Rd (2)

PO = ATT(E,WKH) ∈ RN (3)

where {WQ,WK} ∈ Rd×d are trainable parame-ters that transform H to query and key spaces re-spectively. ATT(·) denotes the dot-product atten-tion (Luong et al., 2015; Vaswani et al., 2017).

Training and Predicting In training process,the objective is to minimize the cross entropy ofthe true inserted and original positions, which isthe sum of the negative log probabilities of thegroundtruth indices by the predicted distributions:

L = Q>I logPI +Q>O logPO (4)

where {QI ,QO} ∈ RN is an one-hot vector to in-dicate the groundtruth indices for the inserted andoriginal positions. During prediction, we choosethe positions with highest probabilities from thedistributions PI and PO as “I” and “O”, respec-tively. Considering the instance in Figure 1 (a),the 4th position is labelled as inserted position “I”,and the 2nd position as the original position “O”.

3 Experimental Setup

In this study, we strove to empirically test whetherSAN indeed weak at learning positional informa-tion and come up with the reason about the strongperformance of SAN on machine translation. Inresponse to the three research questions in Sec-tion 1, we give following experimental settings:

3We tried to predict the position of “O” without feedingthe approximate embedding, i.e. predicting “I” and “O” indi-vidually. It slightly underperforms the current model.

• Q1: We compare SAN with two recurrencearchitectures – RNN and DiSAN on theWRD task, thus to quantify their abilities onlearning word order (Section 3.1).

• Q2: To compare the effects of learning objec-tives and model architectures, we train eachencoder under two scenarios, i.e. trained onobjectives like WRD accuracy and on ma-chine translation (Section 3.2).

• Q3: The strength of position encoding is ap-praised by ablating position encoding and re-currence modeling for SAN.

3.1 Encoder Setting

PI

H

FFN

Softmax

Linear

PO

Softmax

E✕

(a) RNN

PI

H

FFN

Softmax

Linear

PO

Softmax

E✕

(b) SAN

PI

H

FFN

Softmax

Linear

PO

Softmax

E✕

(c) DiSAN

Figure 2: Illustration of (a) RNN; (b) SAN; and (c)DiSAN. Colored arrows denote parallel operations.

RNN and SAN are commonly used to producesentence representations on NLP tasks (Cho et al.,2014; Lin et al., 2017; Chen et al., 2018). Asshown in Figure 2, we investigate three archi-tectures in this study. Mathematically, let X ={x1, . . . ,xN} ∈ Rd×N be the embedding matrixof the input sentence, and H = {h1, . . . ,hN} ∈Rd×N be the output sequence of representations.

• RNN sequentially produces each state:

hn = f(hn−1,xn), (5)

where f(·) is GRU (Cho et al., 2014) in thisstudy. RNN is particularly hard to parallelizedue to their inherent dependence on the pre-vious state hn−1.

• SAN (Lin et al., 2017) produces each hiddenstate in a parallel fashion:

hn = ATT(qn,K)V, (6)

where the query qn ∈ Rd and the keys andvalues (K,V) ∈ Rd×N are transformed fromX. To imitate the order of the sequence,Vaswani et al. (2017) deployed position en-codings (Gehring et al., 2017) into SAN.

• DiSAN (Shen et al., 2018a) augments SANwith the ability to encode word order:

hn = ATT(qn,K≤n)V≤n, (7)

where (K≤n,V≤n) indicate leftward ele-ments, e.g., K≤n = {k1, . . . ,kn}.

To enable a fair comparison of architectures,we only vary the sub-layer of sequence modeling(e.g. the SAN sub-layer) in the Transformer en-coder (Vaswani et al., 2017), and keep the othercomponents the same for all architectures. We usebi-directional setting for RNN and DiSAN, andapply position embedding for SAN and DiSAN.We follow Vaswani et al. (2017) to set the con-figurations of the encoders, which consists of 6stacked layers with the layer size being 512.

3.2 Learning ObjectivesIn this study, we employ two strategies to train theencoders, which differ at the learning objectivesand data used to train the associated parameters.Note that in both strategies, the output layer inFigure 2 is fine-trained on the WRD data with theword reordering detection objective.

WRD Encoders We first directly train the en-coders on the WRD data, to evaluate the abilitiesof model architectures. The WRD encoders arerandomly initialized and co-trained with the out-put layer. Accordingly, the detection accuracy canbe treated as the learning objective of this groupof encoders. Meanwhile, we can investigate thereliability of the proposed WRD task by check-ing whether the performances of different archi-tectures (i.e. RNN, SAN, and DiSAN) are con-sistent with previous findings on other benchmarkNLP tasks (Shen et al., 2018a; Tang et al., 2018;Tran et al., 2018; Devlin et al., 2019).

NMT Encoders To quantify how well differentarchitectures learn word order information withthe learning objective of machine translation, wefirst train the NMT models (both encoder and de-coder) on bilingual corpus using the same configu-ration reported by Vaswani et al. (2017). Then, wefix the parameters of the encoder, and only trainthe parameter associated with the output layer onthe WRD data. In this way, we can probe therepresentations learned by NMT models, on theirabilities to learn word order of input sentences.

To cope with WRD task, all the models weretrained for 600K steps, each of which is allocated a

batch of 500 sentences. The training set is shuffledafter each epoch. We use Adam (Kingma and Ba,2015) with β1 = 0.9, β2 = 0.98 and ε = 10−9.The learning rate linearly warms up over the first4,000 steps, and decreases thereafter proportion-ally to the inverse square root of the step number.We use a dropout rate of 0.1 on all layers.

3.3 DataMachine Translation We pre-train NMT mod-els on the benchmark WMT14 English⇒German(En⇒De) data, which consists of 4.5M sentencepairs. The validation and test sets are new-stest2013 and newstest2014, respectively. Todemonstrate the universality of the findings in thisstudy, we also conduct experiments on WAT17English⇒Japanese (En⇒Ja) data. Specifically,we follow Morishita et al. (2017) to use the firsttwo sections of WAT17 dataset as the training data,which approximately consists of 2.0M sentencepairs. We use newsdev2017 as the validation setand newstest2017 as the test set.

Word Reordering Detection We conduct thistask on the English sentences, which are extractedfrom the source side of WMT14 En⇒De data withmaximum length to 80. For each sentence in dif-ferent sets (i.e. training, validation, and test sets),we construct an instance by randomly moving aword to another position. Finally we construct7M, 10K and 10K samples for training, validatingand testing, respectively. Note that a sentence canbe sampled multiple times, thus each dataset in theWRD data contains more instances than that in themachine translation data.

All the English and German data are tokenizedusing the scripts in Moses. The Japanese sen-tences are segmented by the word segmentationtoolkit KeTea (Neubig et al., 2011). To reduce thevocabulary size, all the sentences are processed bybyte-pair encoding (BPE) (Sennrich et al., 2016)with 32K merge operations for all the data.

4 Experimental Results

We return to the central questions originally posed,that is, whether SAN is indeed weak at learningpositional information. Using the above experi-mental design, we give the following answers:

A1: SAN-based encoder trained on the WRD datais indeed harder to learn positional informa-tion than the recurrence architectures (Sec-tion 4.1), while there is no evidence that

Models Insert Original BothRNN 78.4 73.4 68.2SAN 73.2 66.0 60.1DiSAN 79.6 70.1 68.0

Table 1: Accuracy on the WRD task. “Insert” and“Original” denotes the accuracies of detecting the in-serted and original positions respectively, and “Both”denotes detecting both positions.

Figure 3: Learning curve of WRD encoders on WRDtask. Y-axis denotes the accuracy on the validation set.Obviously, SAN has slower convergence.

SAN-based NMT encoders learns less wordorder information (Section 4.2);

A2: The learning objective plays a more crucialrole on learning word order than the architec-ture in downstream tasks (Section 4.3);

A3: While the position encoding is powerfulenough to capture word order informationin machine translation, DiSAN is a moreuniversally-effective mechanism (Table 2).

4.1 Results on WRD Encoders

We first check the performance of each WRD en-coder on the proposed WRD task from two as-pects: 1) WRD accuracy; and 2) learning ability.

WRD Accuracy The detection results are con-cluded in Table 1. As seen, both RNN and DiSANsignificantly outperform SAN on our task, indi-cating that the recurrence structure (RNN) exactlyperforms better than parallelization (SAN) on cap-turing word order information in a sentence. Nev-ertheless, the drawback can be alleviated by apply-ing directional attention functions. The compara-ble result between DiSAN and RNN confirms thehypothesis by Shen et al. (2018a) and Devlin et al.(2019) that directional SAN exactly improves the

ability of SAN to learn word order. The consis-tency between prior studies and our results verifiedthe reliability of the proposed WRD task.

Learning Curve We visualize the learningcurve of the training. As shown in Figure 3, SANhas much slower convergence than others, show-ing that SAN has a harder time learning word or-der information than RNN and DiSAN. This isconsistent with our intuition that the parallel struc-ture is more difficult to learn word order infor-mation than those models with a sequential pro-cess. Considering DiSAN, although it has slightlyslower learning speed at the early stage of thetraining, it is able to achieve comparable accuracyto RNN at the mid and late phases of the training.

4.2 Results on Pre-Trained NMT EncodersWe investigate whether the SAN indeed lacks theability to learn word order information under ma-chine translation context. The results are con-cluded in Table 2. We first report the effectivenessof the compared models on translation tasks. ForEn-De translation, SAN outperforms RNN, whichis consistent with the results reported in (Chenet al., 2018). The tendency is universal on En-Jawhich is a distant language pair (Bosch and Se-bastian-Galles, 2001; Isozaki et al., 2010). More-over, DiSAN incrementally improves the transla-tion quality, demonstrating that model directionalinformation benefits to the translation quality. Theconsistent translation performances make the fol-lowing evaluation on WRD accuracy convincing.

Concerning the performances of NMT encoderson the WRD task:

SAN-based NMT Encoder Performs Better Itis surprising to see that SAN yields even higher ac-curacy on WRD task than other pre-trained NMTencoders, despite its lower translation qualitiescomparing with DiSAN. The results not only dis-pel the doubt on the inablity of SAN-based en-coder to learn word order in machine translation,but also demonstrate that SAN learns to retainmore features with respect to word order duringthe training of machine translation.

Learning Objectives Matter More In addition,both the NMT encoders underperform the WRDencoders on detection task across models and lan-guage pairs.4 The only difference between the

4The En⇒Ja pre-trained encoders yield lower accuracyon WRD task than that of En⇒De pre-trained encoders. We

Model Translation DetectionEn⇒De En⇒Ja En⇒De Enc. En⇒Ja Enc. WRD Enc.

RNN 26.8 42.9 33.9 29.0 68.2SAN 27.3 43.6 41.6 32.8 60.1

- Pos Emb 11.5 – 0.3 – 0.3DiSAN 27.6 43.7 39.7 31.2 68.0

- Pos Emb 27.0 43.1 40.1 31.0 62.8

Table 2: Performances of NMT encoders pre-trained on WMT14 En⇒De and WAT17 En⇒Ja data. “Translation”denotes translation quality measured in BLEU scores, while “Detection” denotes the accuracies on WRD task.“En⇒De Enc.” denotes NMT encoder trained with translation objective on the En⇒De data. We also list thedetection accuracies of WRD encoders (“WRD Enc.”) for comparison. “- Pos Emb” indicates removing posi-tional embeddings from SAN- or DiSAN-based encoder. Surprisingly, SAN-based NMT encoder achieves the bestaccuracy on the WRD task, which contrasts with the performances of WRD encoders (the last column).

(a) WRD Encoder (b) En⇒De NMT Encoder (c) En⇒Ja NMT Encoder

Figure 4: Accuracy of pre-trained NMT encoders according to various distances between the positions of “O”and “I” (X-axis). As seen, the performance of each WRD encoder (a) is stable across various distances, while thepre-trained (b) En⇒De and (c) En⇒Ja encoders consistently get lower accuracy with the increasing of distance.

two kinds of encoders is the learning objective.This raises a hypothesis that the learning objectivesometimes severs as a more critical factor than themodel architecture on modeling word order.

Position Encoding VS. Recurrence ModelingIn order to assess the importance of position en-coding, we redo the experiments by removingthe position encoding from SAN and DiSAN (“-Pos Emb”). Clearly, SAN-based encoder withoutposition embedding fails on both machine trans-lation and our WRD task, indicating the neces-sity of position encoding on learning word or-der. It is encourage to see that SAN yields higherBLEU score and detection accuracy than “DiSAN-Pos Emb” in machine translation scenario. Itmeans that position embedding is more suitable oncapture word order information for machine trans-

attribute this to the difference between the source sentencesin pre-training corpus (En-Ja) and that of WRD data (fromEn-De dataset). Despite of this, the tendency of results areconsistent across language pairs.

lation than modeling recurrence for SAN. Consid-ering both two scenarios, DiSAN-based encodersachieve comparable detection accuracies to thebest models, revealing its effectiveness and uni-versality on learning word order.

4.3 AnalysisIn response to above results, we provide furtheranalyses to verify our hypothesis on NMT en-coders. We discuss three questions in this section:1) Does learning objective indeed affect the ex-tracting of word order information; 2) How SANderives word order information from position en-coding; and 3) Whether more word order informa-tion retained is useful for machine translation.

Accuracy According to Distance We furtherinvestigate the accuracy of WRD task accordingto various distance between the positions of wordis popped out and inserted. As shown in Figure 4(a), WRD encoders marginally reduce the perfor-mance with the increasing of distances. How-

(a) En⇒De NMT encoder (b) En⇒Ja NMT encoder

Figure 5: Performance of each layer from (a) pre-trained En⇒De encoder and (b) pre-trained En⇒Ja encoder onWRD task. The evaluation are conducted on the test set. Clearly, the accuracy of SAN gradually increased withthe stacking of layers and consistently outperform that of other models across layers.

ever, this kind of stability is destroyed when wepre-train each encoder with a learning objectiveof machine translation. As seen in Figure 4 (b)and (c), the performance of pre-trained NMT en-coders obviously became worse on long-distancecases across language pairs and model variants.This is consistent with prior observation on NMTsystems that both RNN and SAN fail to fully cap-ture long-distance dependencies (Tai et al., 2015;Yang et al., 2017; Tang et al., 2018).

Regarding to information bottleneck princi-ple (Tishby and Zaslavsky, 2015; Alemi et al.,2016), our NMT models are trained to maximallymaintain the relevant information between sourceand target, while abandon irrelevant features in thesource sentence, e.g. portion of word order infor-mation. Different NLP tasks have distinct require-ments on linguistic information (Conneau et al.,2018). For machine translation, the local patterns(e.g. phrases) matter more (Luong et al., 2015;Yang et al., 2018, 2019), while long-distance wordorder information plays a relatively trivial role inunderstanding the meaning of a source sentence.Recent studies also pointed out that abandoningirrelevant features in source sentence benefits tosome downstream NLP tasks (Lei et al., 2016; Yuet al., 2017; Shen et al., 2018b). An immediateconsequence of such kind of data process inequal-ity (Schumacher and Nielsen, 1996) is that infor-mation about word order that is lost in encodercannot be recovered in the detector, and conse-quently drops the performance on our WRD task.The results verified that the learning objective in-deed affects more on learning word order informa-tion than model architecture in our case.

Accuracy According to Layer Several re-searchers may doubt that the parallel structure ofSAN may lead to failure on capturing word or-der information at higher layers, since the posi-tion embeddings are merely injected at the inputlayer. Accordingly, we further probe the repre-sentations at each layer on our WRD task to ex-plore how does SAN learn word order informa-tion. As seen in Figure 5, SAN achieves betterperformance than other NMT encoders on the pro-posed WRD tasks across almost all the layers. Theresult dispels the doubt on the inability of positionencoding and confirms the speculation by Vaswaniet al. (2017) and Shaw et al. (2018) who suggestedthat SAN can profit from the use of residual net-work which propagates the positional informationto higher layers. Moreover, both SAN and RNNgradually increase their performance on our taskwith the stacking of layers. The same tendencydemonstrates that position encoding is able to pro-vide same learning manner to that of recurrentstructure with respect to word order for SAN. Boththe results confirm the strength of position encod-ing to bring word order properties into SAN.

We strove to come up with the reason why SANcaptured even more word order information in ma-chine translation task. Yin et al. (2017) and Tranet al. (2018) found that the approach with a re-currence structure (e.g. RNN) has an easier timelearning syntactic information than that of mod-els with a parallel structure (e.g. CNN, SAN). In-spired by their findings, we argue that SAN triesto partially countervail its disadvantage in paral-lel structure by reserving more word order infor-mation, thus to help for the encoding of deeper

Acc

urac

y

0

20

40

60

80

Iterations (×10K)

0 10 20 30 40 50 60

RNNSANDiSAN

BLE

U D

rop

-6

-5

-4

-3

En-De En-Ja

RNN DiSAN SAN

Figure 6: The differences of translation performancewhen the pre-trained NMT models are fed with theoriginal (“Golden”) and reordered (“Reorder”) sourcesentences. As seen, SAN and DiSAN perform betteron handling noises in terms of erroneous word order.

linguistic properties required by machine transla-tion. Recent studies on multi-layer learning shownthat different layers tend to learn distinct linguis-tic information (Peters et al., 2018; Raganato andTiedemann, 2018; Li et al., 2019). The better ac-curacy achieved by SAN across layers indicatesthat SAN indeed tries to preserve more word orderinformation during the learning of other linguisticproperties for translation purpose.

Effect of Wrong Word Order Noises For hu-mans, a small number of erroneous word orders ina sentence usually does not affect the comprehen-sion. For example, we can understand the meaningof English sentence “Dropped the boy the ball.”,despite its erroneous word order. It is intrigu-ing whether NMT model has the ability to tacklethe wrong order noises. As a results, we makeerroneous word order noises on English-Germandevelopment set by moving one word to anotherposition, and evaluate the drop of the translationquality of each model. As listed in Figure 6, SANand DiSAN yield less drops on translation qualitythan their RNN counterpart, demonstrating the ef-fectiveness of self-attention on ablating wrong or-der noises. We attribute this to the fact that models(e.g. RNN-based models) will not learn to be ro-bust to errors since they are never observed (Sper-ber et al., 2017; Cheng et al., 2018). On the con-trary, since SAN-based NMT encoder is good atrecognizing and reserving anomalous word orderinformation under NMT context, it may raise theability of decoder on handling noises occurred inthe training set, thus to be more robust in translat-ing sentences with anomalous word order.

5 Related Work

Exploring Properties of SAN SAN has yieldedstrong empirical performance in a variety of NLPtasks (Vaswani et al., 2017; Tan et al., 2018; Liet al., 2018; Devlin et al., 2019). In responseto these impressive results, several studies haveemerged with the goal of understanding SAN onmany properties. For example, Tran et al. (2018)compared SAN and RNN on language inferencetasks, and pointed out that SAN is weak at learninghierarchical structure than its RNN counterpart.Moreover, Tang et al. (2018) conducted experi-ments on subject-verb agreement and word sensedisambiguation tasks. They found that SAN isgood at extracting semantic properties, while un-derperforms RNN on capturing long-distance de-pendencies. This is in contrast to our intuition thatSAN is good at capturing long-distance dependen-cies. In this work, we focus on exploring the abil-ity of SAN on modeling word order information.

Probing Task on Word Order To open theblack box of networks, probing task is used asa first step which facilitates comparing differ-ent models on a much finer-grained level. Mostwork has focused on probing fixed-sentence en-coders, e.g. sentence embedding (Adi et al., 2017;Conneau et al., 2018). Among them, Adi et al.(2017) and Conneau et al. (2018) introduced toprobe the sensitivity to legal word orders by de-tecting whether there exists a pair of permutedword in a sentence by giving its sentence em-bedding. However, analysis on sentence encod-ings may introduce confounds, making it diffi-cult to infer whether the relevant information isencoded within the specific position of interestor rather inferred from diffuse information else-where in the sentence (Tenney et al., 2019). Inthis study, we directly probe the token representa-tions for word- and phrase-level properties, whichhas been widely used for probing token-level rep-resentations learned in neural machine translationsystems, e.g. part-of-speech, semantic tags, mor-phology as well as constituent structure (Shi et al.,2016; Belinkov et al., 2017; Blevins et al., 2018).

6 Conclusion

In this paper, we introduce a novel word reorder-ing detection task which can probe the ability ofa model to extract word order information. Withthe help of the proposed task, we evaluate RNN,

SAN and DiSAN upon Transformer framework toempirically test the theoretical claims that SANlacks the ability to learn word order. The resultsreveal that RNN and DiSAN exactly perform bet-ter than SAN on extracting word order informa-tion in the case they are trained individually forour task. However, there is no evidence that SANlearns less word order information under the ma-chine translation context.

Our further analyses for the encoders pre-trained on the NMT data suggest that 1) the learn-ing objective sometimes plays a crucial role onlearning a specific feature (e.g. word order) in adownstream NLP task; 2) modeling recurrence isuniversally-effective to learn word order informa-tion for SAN; and 3) RNN is more sensitive on er-roneous word order noises in machine translationsystem. These observations facilitate the under-standing of different tasks and model architecturesin finer-grained level, rather than merely in overallscore (e.g. BLEU). As our approach is not lim-ited to the NMT encoders, it is also interesting toexplore how do the models trained on other NLPtasks learn word order information.

Acknowledgments

The work was partly supported by the NationalNatural Science Foundation of China (Grant No.61672555), the Joint Project of Macao Sci-ence and Technology Development Fund and Na-tional Natural Science Foundation of China (GrantNo. 045/2017/AFJ) and the Multi-Year ResearchGrant from the University of Macau (Grant No.MYRG2017-00087-FST). We thank the anony-mous reviewers for their insightful comments.

ReferencesYossi Adi, Einat Kermany, Yonatan Belinkov, Ofer

Lavi, and Yoav Goldberg. 2017. Fine-grained Anal-ysis of Sentence Embeddings Using Auxiliary Pre-diction Tasks. In ICLR.

Alexander A Alemi, Ian Fischer, Joshua V Dillon, andKevin Murphy. 2016. Deep Variational InformationBottleneck. In ICLR.

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, HassanSajjad, and James Glass. 2017. What do Neural Ma-chine Translation Models Learn about Morphology?In ACL.

Terra Blevins, Omer Levy, and Luke Zettlemoyer.2018. Deep RNNs Encode Soft Hierarchical Syn-tax. In ACL.

Laura Bosch and Nuria Sebastian-Galles. 2001. Ev-idence of Early Language Discrimination Abilitiesin Infants from Bilingual Environments. Infancy,2(1):29–49.

Mia Xu Chen, Orhan Firat, Ankur Bapna, MelvinJohnson, Wolfgang Macherey, Goerge Foster, LlionJones, Parmar Niki, Mike Schuster, Zhifeng Chen,Yonghui Wu, and Macduff Hughes. 2018. The Bestof Both Worlds: Combining Recent Advances inNeural Machine Translation. In ACL.

Yong Cheng, Zhaopeng Tu, Fandong Meng, JunjieZhai, and Yang Liu. 2018. Towards Robust NeuralMachine Translation. In ACL.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. InEMNLP.

Alexis Conneau, German Kruszewski, GuillaumeLample, Loıc Barrault, and Marco Baroni. 2018.What You Can Cram into A Single $&!#∗ Vector:Probing Sentence Embeddings for Linguistic Prop-erties. In ACL.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In NAACL.

Xinya Du and Claire Cardie. 2017. Identifying Whereto Focus in Reading Comprehension for NeuralQuestion Generation. In EMNLP.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N. Dauphin. 2017. ConvolutionalSequence to Sequence Learning. In ICML.

Jie Hao, Xing Wang, Baosong Yang, Longyue Wang,Jinfeng Zhang, and Zhaopeng Tu. 2019. ModelingRecurrence for Transformer. In NAACL.

Hideki Isozaki, Tsutomu Hirao, Kevin Duh, KatsuhitoSudoh, and Hajime Tsukada. 2010. Automatic Eval-uation of Translation Quality for Distant LanguagePairs. In EMNLP.

Diederik P Kingma and Jimmy Ba. 2015. Adam: AMethod for Stochastic Optimization. ICLR.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing Neural Predictions. In EMNLP.

Jian Li, Zhaopeng Tu, Baosong Yang, Michael R Lyu,and Tong Zhang. 2018. Multi-Head Attention withDisagreement Regularization. In EMNLP.

Jian Li, Baosong Yang, Zi-Yi Dou, Xing Wang,Michael R. Lyu, and Zhaopeng Tu. 2019. Infor-mation Aggregation for Multi-Head Attention withRouting-by-Agreement. In NAACL.

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A Structured Self-attentive SentenceEmbedding. In ICLR.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In EMNLP.

Makoto Morishita, Jun Suzuki, and Masaaki Nagata.2017. NTT Neural Machine Translation Systems atWAT 2017. In WAT.

Graham Neubig, Yosuke Nakata, and Shinsuke Mori.2011. Pointwise Prediction for Robust, AdaptableJapanese Morphological Analysis. In ACL.

Ankur Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A Decomposable AttentionModel for Natural Language Inference. In EMNLP.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep Contextualized Word Rep-resentations. In NAACL.

Alessandro Raganato and Jorg Tiedemann. 2018.An Analysis of Encoder Representations inTransformer-Based Machine Translation. InEMNLP Workshop BlackboxNLP: Analyzing andInterpreting Neural Networks for NLP.

Benjamin Schumacher and Michael A Nielsen. 1996.Quantum Data Processing and Error Correction.Physical Review A, 54(4):2629.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural Machine Translation of Rare Wordswith Subword Units. In ACL.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-Attention with Relative Position Repre-sentations. In NAACL.

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang,Shirui Pan, and Chengqi Zhang. 2018a. DiSAN: Di-rectional Self-attention Network for RNN/CNN-freeLanguage Understanding. In AAAI.

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, SenWang, and Chengqi Zhang. 2018b. Reinforced Self-attention Network: A Hybrid of Hard and Soft At-tention for Sequence Modeling. In IJCAI.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016. DoesString-Based Neural MT Learn Source Syntax? InEMNLP.

Matthias Sperber, Jan Niehues, and Alex Waibel.2017. Toward Robust Neural Machine Translationfor Noisy Input Sequences. In IWSLT.

Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-Informed Self-Attention for Seman-tic Role Labeling. In EMNLP.

Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015. Improved Semantic Representa-tions From Tree-Structured Long Short-Term Mem-ory Networks. In ACL.

Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen,and Xiaodong Shi. 2018. Deep Semantic Role La-beling with Self-attention. In AAAI.

Gongbo Tang, Mathias Muller, Annette Rios, and RicoSennrich. 2018. Why Self-Attention? A TargetedEvaluation of Neural Machine Translation Architec-tures. In EMNLP.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Sam Bowman, Dipanjan Das,et al. 2019. What Do You Learn from Context?Probing for Sentence Structure in ContextualizedWord Representations. In ICLR.

Naftali Tishby and Noga Zaslavsky. 2015. Deep Learn-ing and The Information Bottleneck Principle. InITW.

Ke Tran, Arianna Bisazza, and Christof Monz. 2018.The Importance of Being Recurrent for ModelingHierarchical Structure. In EMNLP.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In NIPS.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer Networks. In NIPS.

Shuohang Wang and Jing Jiang. 2017. MachineComprehension Using Match-LSTM and AnswerPointer. In ICLR.

Zhen Xu, Bingquan Liu, Baoxun Wang, SUNChengjie, Xiaolong Wang, Zhuoran Wang, andChao Qi. 2017. Neural Response Generation viaGAN with An Approximate Embedding Layer. InEMNLP.

Baosong Yang, Zhaopeng Tu, Derek F. Wong, FandongMeng, Lidia S. Chao, and Tong Zhang. 2018. Mod-eling Localness for Self-Attention Networks. InEMNLP.

Baosong Yang, Longyue Wang, Derek F. Wong,Lidia S. Chao, and Zhaopeng Tu. 2019. Convolu-tional Self-Attention Networks. In NAACL.

Baosong Yang, Derek F Wong, Tong Xiao, Lidia SChao, and Jingbo Zhu. 2017. Towards Bidirec-tional Hierarchical Representations for Attention-based Neural Machine Translation. In EMNLP.

Wenpeng Yin, Katharina Kann, Mo Yu, and Hin-rich Schutze. 2017. Comparative Study of CNNand RNN for Natural Language Processing. arXivpreprint:1702.01923.

Adams Wei Yu, Hongrae Lee, and Quoc Le. 2017.Learning to Skim Text. In ACL.

Documents

Assessing the Ability of Self-Attention Networks to Learn ... · Assessing the Ability of Self-Attention Networks to Learn Word Order Baosong Yang yLongyue Wangz Derek F. Wong Lidia