Learning to Actively Learn Neural Machine Translation

Learning to Actively Learn Neural Machine Translation

Ming Liu Wray BuntineFaculty of Information Technology, Monash University

{ming.m.liu, wray.buntine, gholamreza.haffari} @ monash.edu

Gholamreza Haffari

Abstract

Traditional active learning (AL) methods formachine translation (MT) rely on heuristics.However, these heuristics are limited when thecharacteristics of the MT problem change dueto e.g. the language pair or the amount of theinitial bitext. In this paper, we present a frame-work to learn sentence selection strategies forneural MT. We train the AL query strategy us-ing a high-resource language-pair based on ALsimulations, and then transfer it to the low-resource language-pair of interest. The learnedquery strategy capitalizes on the shared char-acteristics between the language pairs to makean effective use of the AL budget. Our ex-periments on three language-pairs confirmsthat our method is more effective than strongheuristic-based methods in various conditions,including cold-start and warm-start as well assmall and extremely small data conditions.

1 Introduction

Parallel training bitext plays a key role in thequality neural machine translation (NMT). Learn-ing high-quality NMT models in bilingually low-resource scenarios is one of the key challenges, asNMT’s quality degrades severely in such setting(Koehn and Knowles, 2017).

Recently, the importance of learning NMTmodels in scarce parallel bitext scenarios hasgained attention. Unsupervised approaches tryto learn NMT models without the need for par-allel bitext (Artetxe et al., 2017; Lample et al.,2017). Dual learning/backtranslation tries to startoff from a small amount of bilingual text, andleverage monolingual text in the source and tar-get language (Sennrich et al., 2015a; He et al.,2016). Zero/few shot approach attempts to trans-fer NMT learned from rich bilingual settings tolow-resource settings (Johnson et al., 2016; Guet al., 2018).

In this paper, we approach this problem fromthe active learning (AL) perspective. Assumingthe availability of an annotation budget and a poolof monolingual source text as well as a smalltraining bitext, the goal is to select the most use-ful source sentences and query their translationfrom an oracle up to the annotation budget. Thequeried sentences need to be selected carefully toget the value for the budget, i.e. get the highestimprovements in the translation quality of the re-trained model. The AL approach is orthogonal tothe aforementioned approaches to bilingually low-resource NMT, and can be potentially combinedwith them.

We present a framework to learn the sentenceselection policy most suitable and effective for theNMT task at hand. This is in contrast to themajority of work in AL-MT where hard-codedheuristics are used for query selection (Haffariand Sarkar, 2009; Bloodgood and Callison-Burch,2010). More concretely, we learn the query pol-icy based on a high-resource language-pair shar-ing similar characteristics with the low-resourcelanguage-pair of interest. After trained, the policyis applied to the language-pair of interest capital-ising on the learned signals for effective query se-lection. We make use of imitation learning (IL) totrain the query policy. Previous work has shownthat the IL approach leads to more effective pol-icy learning (Liu et al., 2018), compared to rein-forcement learning (RL) (Fang et al., 2017) . Ourproposed method effectively trains AL policies forbatch queries needed for NMT, as opposed to theprevious work on single query selection.

We conduct experiments on three languagepairs Finnish-English, German-English, andCzech-English. Simulating low resource sce-narios, we consider various settings, includingcold-start and warm-start as well as small andextremely small data conditions. The experiments

show the effectiveness and superiority of ourpolicy query compared to strong baselines.

2 Learning to Actively Learn MT

Active learning is an iterative process: Firstly, amodel is built using some initially available data.Then, the most worthwhile data points are selectedfrom the unlabelled set for annotation by the ora-cle. The underlying model is then re-trained usingthe expanded labeled data. This process is thenrepeated until the budget is exhausted. The mainchallenge is how to identify and select the mostbeneficial unlabelled data points during the AL it-erations.

The AL strategy can be learned by attemptingto actively learn on tasks sampled from a distri-bution over the tasks (Bachman et al., 2017). Wesimulate the AL scenario on instances of a low-resource MT problem created using the bitext ofthe resource-rich language pair, where the transla-tion of some part of the bitext is kept hidden. Thisallows to have an automatic oracle to reveal thetranslations of the queried sentences, resulting inan efficient way to quickly evaluate an AL strat-egy. Once the AL strategy is learned on simula-tions, it is then applied to real AL scenarios. Themore related are the low-resource language-pair inthe real scenario to those used to train the AL strat-egy, the more effective the AL strategy would be.

We are interested to train a translation modelmφφφ which maps an input sentence from a sourcelanguage xxx ∈ X to its translation yyy ∈ Yxxx in atarget language, where Yxxx is the set of candidatetranslations for the input xxx and φφφ is the parametervector of the translation model. Let D = {(xxx,yyy)}be a support set of parallel corpus, which is ran-domly partitioned into parallel bitext Dlab, mono-lingual text Dunl, and evaluation Devl datasets.Repeated random partitioning creates multiple in-stances of the AL problem.

3 Hierarchical MDP Formulation

A crucial difference of our setting to the previouswork (Fang et al., 2017; Liu et al., 2018) is thatthe AL agent receives the reward from the oracleonly after taking a sequence of actions, i.e. se-lection of an AL batch which may correspond tomultiple training minibatches for the underlyingNMT model. This fulfils the requirements for ef-fective training of NMT, as minibatch updates aremore effective than those of single sentence pairs.

Furthermore, it is presumably more efficient andpractical to query the translation of an untranslatedbatch from a human translator, rather than one sen-tence in each AL round.

At each time step t of an AL problem, the al-gorithm interacts with the oracle and queries thelabels of a batch selected from the pool Dunl

t toform bbbt. As the result of this sequence of ac-tions to select sentences in bbbt, the AL algorithmreceives a reward BLEU(mφφφ, D

evl) which is theBLEU score on Devl based on the retrained NMTmodel using the batch mbbbt

φφφ .

Formally, this results in a hierarchical Markovdecision process (HMDP) for batch sentence se-lection in AL. A state ssst := 〈Dlab

t , Dunlt , bbbt,φφφt〉 of

the HMDP in the time step t consist of the bitextDlabt , the monotext Dunl

t , the current text batchbbbt, and the parameters of the currently trainedNMT model φφφt. The high-level MDP consists ofa goal set G := {retrain, haltHI}, where set-ting a goal gt ∈ G corresponds to either halt-ing the AL process, or giving the execution tothe low-level MDP to collect a new batch of bi-text bbbt, re-training the underlying NMT model toget the update parameters φφφt+1, receiving the re-ward RHI(ssst, at, ssst+1) := BLEU(mφφφt+1

, Devl),and updating the new state as ssst+1 = 〈Dlab

t ∪bbbt, D

unlt , ∅,φφφt+1〉. The haltHI goal is set in case

the full AL annotation budget is exhausted, other-wise the re-train goal is set in the next time step.

The low-level MDP consists of primitive ac-tions at ∈ Dunl

t ∪ {haltLO} corresponding toeither selecting of the monolingual sentences inDunlt , or halting the low-level policy and giving

the execution back to the high-level MDP. The haltaction is performed in case the maximum amountof source text is chosen for the current AL round,when the oracle is asked for the translation of thesource sentences in the monolingual batch, whichis then replaced by the resulting bitext. The sen-tence selection action, on the other hand, forms thenext state by adding the chosen monolingual sen-tence to the batch and removing it from the poolof monolingual sentences. The underlying NMTmodel is not trained as a result of taking an actionin the low-level policy, and the reward function isconstant zero.

A trajectory in our HMDP consists of σ :=(sss1, g1, τ1, r1, sss2, ..., sssH , gH , rH , sssH+1) which isthe concatenation of interleaved high-level trajec-tory τHI := (sss1, g1, r1, sss2, .., sssH+1) and low-level

score

xy

cccglobal

ccclocal

Figure 1: The policy network.

trajectories τ := (sss1, a1, sss2, a2, ..., sssT , aT , sssT+1).Clearly, the intermediate goals set by the top-level MDP into the σ are retrain, and only thelast goal gH is haltHI, where H is determinedby checking whether the total AL budget BHI isexhausted. Likewise, the intermediate actions inτh are sentence selection, and only the last actionaT is haltLO, where T is determined by checkingwhether the round-AL budget BLO is exhausted.

We aim to find the optimal AL policy prescrib-ing which datapoint needs to be queried in a givenstate to get the most benefit. The optimal policy isfound by maximising the expected long-term re-ward, where the expectation is over the choice ofthe synthesised AL problems and other sources ofrandomness, i.e. partioing of D into Dlab, Dunl,and Devl. Following (Bachman et al., 2017), wemaximise the sum of the rewards after each ALround to encourage the anytime behaviour, i.e. themodel should perform well after each batch query.

4 Deep Imitation learning for AL-NMT

The question remains of how to train the policynetwork to maximize the reward, i.e. the generali-sation performance of the underlying NMT model.As the policy for the high-level MDP is fixed, weonly need to learn the optimal policy for the low-level MDP. We formulate learning the AL pol-icy as an imitation learning problem. More con-cretely, the policy is trained using an algorithmicexpert, which can generate a reasonable AL tra-jectories (batches) for each AL state in the high-level MDP. The algorithmic expert’s trajectories,i.e. sequences of AL states paired with the expert’sactions in the low-level MDP, are then used to trainthe policy network. As such, the policy network isa classifier, conditioned on a context summarisingboth global and local histories, to choose the bestsentence (action) among the candidates. After the

AL policy is trained based on AL simulations, it isthen transferred to the real AL scenario.

For simplicity of presentation, the training algo-rithms are presented using a fixed number of ALiterations for the high-level and low-level MDPs.This corresponds to AL with the sentence-basedbudget. However, extending them for AL withtoken-based budget is straightforward, and we ex-periment with both versions in §5.

Policy Network’s Architecture The policyscoring network is a fully-connected network withtwo hidden layers (see Figure 1). The input in-volves the representation for three elements: (i)global context which includes all previous ALbatches, (ii) local context which summarises theprevious sentences selected for the current ALbatch, and (iii) the candidate sentence paired withits translation generated by the currently trainedNMT model.

For each source sentencexxx paired with its trans-lation yyy, we denote the representation by rep(xxx,yyy).We construct it by simply concatenating the rep-resentations of the source and target sentences,each of which is built by summing the embed-dings of its words. We found this simple method towork well, compared to more complicated meth-ods, e.g. taking the last hidden state of the decoderin the underlying NMT model. The global context(cccglobal) and local contexts (ccclocal) are constructedby summing the representation of the previouslyselected batches and sentence-pairs, respectively.

IL-based Training Algorithm The IL-basedtraining method is presented in Algorithm 1. Thepolicy network is initialised randomly, and trainedbased T simulated AL problems (lines 3–20), byportioning the available large bilingual corpus intothree sets: (i) Dlab as the growing training bi-tex, (ii) Dunl as the pool of untranslated sentenceswhere we pretend the translations are not given,and (iii) Devl as the evaluation set used by our al-gorithmic expert.

For each simulated AL problem, Algorithm 1executes THI iterations (lines 7–19) to collect ALbatches for training the underlying NMT modeland the policy network. An AL batch is obtainedeither from the policy network (line 15) or fromthe algorithmic expert (lines 10-13), depending ontossing a coin (line 9). The latter also includesadding the selected batch, the candidate batches,and the relevant state information to the replay

Algorithm 1 Learning AL-NMT PolicyInput: Parallel corpus D, Iwidth the width of the constructed

search lattices, the coin parameter α, the number of sam-pled AL batches K

Output: policy π1: M ← ∅ . Replay Memory2: Initialise π with a random policy3: for T training iterations do4: Dlab, Devl, Dunl ← randomPartition(D)5: φφφ← trainModel(Dlab)6: cccglobal ← 0007: for t← 1 to THI do . MDPHI8: S ← searchLattice(Dunl, Iwidth)9: if coinToss(α) = Head then

10: BBB ← {samplePath(S,φφφ,cccglobal, π, β)}K111: BBB ← BBB + samplePath(S,φφφ,cccglobal, π, 0)

12: bbb← argmaxbbb′∈BBB

BLEU(mbbb′φφφ , D

evl) . expert

13: M ←M ∪ {(cccglobal,φφφ,BBB,bbb)}14: else15: bbb← samplePath(S,φφφ,cccglobal, π, 0) . policy16: Dlab ← Dlab + bbb17: Dunl ← Dunl − {xxx s.t. (xxx,yyy) ∈ bbb}18: φφφ← retrainModel(φφφ,Dlab)19: cccglobal ← cccglobal ⊕ rep(bbb)20: π ← updatePolicy(π,M,φφφ)

21: return π

Algorithm 2 samplePath (selecting an AL batch)Input: Search lattice S, global context cccglobal, policy π,

perturbation probability βOutput: Selected AL batch bbb

1: bbb← ∅2: ccclocal ← 0003: for t← 1 to TLO do . MDPLO4: if coinToss(β) = Head then5: xxxt ← π0(S[t]) . perturbation policy6: else7: xxxt ← argmaxxxx∈S[t] π(cccglobal, ccclocal,xxx)

8: yyyt ← oracle(xxxt) . getting the gold translation9: ccclocal ← ccclocal ⊕ rep(xxxt, yyyt)

10: bbb← bbb+ (xxxt, yyyt)

11: return bbb

memory M , based on which the policy will be re-trained. The selected batch is then used to retrainthe underlying NMT model, update the trainingbilingual corpus and pool of monotext, and updatethe global context vector (lines 16–19).

The mixture of the policy network and algo-rithmic expert in batch collection on simulatedAL problems is inspired by Dataset AggregationDAGGER (Ross and Bagnell, 2014). This makessure that the collected states-actions pairs in thereplay memory include situations encountered be-yond executing only the algorithmic expert. Thisinforms the trained AL policy how to act reason-ably in new situations encountered in the test time,where only the network policy is in charge and theexpert does not exist.

⇡⇡

S[1] S[TLO]....

⇡0

⇡0

Figure 2: The search lattice and the selection of a batchbased on the perturbed policy. Each circle denotes asentence from the pool of monotext. The number ofsentences at each time step, denoted by S[t], is Iwidth.The black sentences are selected in this AL batch.

Algorithmic Expert At a given AL state, the al-gorithmic expert selects a reasonable batch fromthe pool, Dunl via:

argmaxbbb∈BBB

BLEU(mbbbφφφ, D

evl) (1)

where mbbbφφφ denotes the underlying NMT model φ

further retrained by incorporating the batch bbb, andBBB denotes the possible batches from Dunl (Liuet al., 2018). However, the number of possiblebatches is exponential in the size Dunl, hence theabove optimisation procedure would be very sloweven for a moderately-sized pool.

We construct a search lattice S from which thecandidate batches in BBB are sampled (see Figure2). The search lattice is constructed by samplinga fixed number of candidate sentences Iwidth fromDunl for each position in a batch, whose size isTLO. A candidate AL batch is then be selected us-ing Algorithm 2. It executes a mixture of the cur-rent AL policy π and a perturbation policy π0 (e.g.random sentence selection or any other heuristic)in the lower-level MDP to sample a batch. Afterseveral such batches are sampled to form BBB, thebest one is selected according to eqn 1.

We have carefully designed the search space tobe able to incorporate the current policy’s recom-mended batch and sampled deviations from it inBBB. This is inspired by the LOLS (Locally Opti-mal Learning to Search) algorithm (Chang et al.,2015), to invest efforts in the neighbourhood ofthe current policy and improve it. Moreover, hav-ing to deal with only Iwidth number of sentencesat each selection stage makes the batch formationalgorithm based on the policy fast and efficient.

Re-training the Policy Network To train ourpolicy network, we turn sentence preferencescores to probabilities over the candidate batches,

Algorithm 3 Policy TransferInput: Bitext Dlab, monotext Dunl, pre-trained policy πOutput: NMT model φφφ1: Initialise φφφ with Dlab

2: cccglobal ← 0003: for t← 1 to THI do4: S ← searchLattice(Dunl, Iwidth)5: bbb← samplePath(S,φφφ,cccglobal, π, 0)

6: Dlab ← Dlab + bbb7: Dunl ← Dunl − {xxx s.t. (xxx,yyy) ∈ bbb}8: φφφ← retrainModel(φφφ,Dlab)9: Dlab ← Dlab + bbb

10: cccglobal ← cccglobal ⊕ rep(bbb)11: return φφφ

and optimise the parameters to maximise the log-likelihood objective (Liu et al., 2018).

More specifically, let (cccglobal,BBB,bbb) be a trainingtuple in the replay memory. We define the proba-bility of the correct action/batch as

Pr(bbb|BBB,cccglobal) :=score(bbb, cccglobal)∑

bbb′∈BBB score(bbb′, cccglobal).

The preference score for a batch is the sum of itssentences’ preference scores,

score(bbb, cccglobal) :=

|bbb|∑t=1

π(cccglobal, ccclocal<t, rep(xxxt, yyyt))

where ccclocal<t denotes the local context up to thesentence t in the batch.

To form the log-likelihood, we use recent tuplesand randomly sample several older ones from thereplay memory. We then use stochastic gradientdescent (SGD) to maximise the training objective,where the gradient of the network parameters arecalculated using the backpropagation algorithm.

Transfering the Policy We now apply the pol-icy learnt on the source language pair to AL in thetarget task (see Algorithm 3). To enable transfer-ring the policy to a new language pair, we makeuse of pre-trained multilingual word embeddings.In our experiments, we either use the pre-trainedword embeddings from (Ammar et al., 2016) orbuild it based on the available bitext and monotextin the source and target language (c.f. §5.2). Toretrain our NMT model, we make parameter up-dates based on the mini-batches from the AL batchas well as sampled mini-batches from the previousiterations.

5 Experiment

Datasets Our experiments use the followinglanguage pairs in the news domain based on

WMT2018: English-Czech (EN-CS), English-German (EN-DE), English-Finnish (EN-FI). ForAL evaluation, we randomly sample 500K sen-tence pairs from the parallel corpora in WMT2018for each of the three language pairs, and take 100Kas the initially available bitext and the rest of 400Kas the pool of untranslated sentences, pretendingthe translation is not available. During the AL it-erations, the translation is revealed for the queriedsource sentences in order to retrain the underlyingNMT model.

For pre-processing the text, we normalise thepunctuations and tokenise using moses1 scripts.The trained models are evaluated using BLEU ontokenised and cased sensitive test data from thenewstest 2017.

NMT Model Our baseline model consists of a2-layer bi-directional LSTM encoder with an em-beddings size of 512 and a hidden size of 512. The1-layer LSTM decoder with 512 hidden units usesan attention network with 128 hidden units. Weuse a multiplicative-style attention attention archi-tecture (Luong et al., 2015). The model is opti-mized using Adam (Kingma and Ba, 2014) witha learning rate of 0.0001, where the dropout rateis set to 0.3. We set the mini-batch size to 200and the maximum sentence length to 50. We trainthe base NMT models for 5 epochs on the initiallyavailable bitext, as the perplexity on the dev set donot improve beyond more training epochs. Aftergetting new translated text in each AL iteration,we further sample ×5 more bilingual sentencesfrom the previously available bitext, and make onepass over this data to re-train the underlying NMTmodel. For decoding, we use beam-search withthe beam size of 3.

Selection Strategies We compare our policy-based sentence selection for NMT-AL with thefollowing heuristics:

• Random We randomly select monolingualsentences up to the AL budget.

• Length-based We use shortest/longestmonolingual sentences up to the AL budget.

• Total Token Entropy (TTE) We sortmonolingual sentences based on their TTEwhich has been shown to be a strong ALheuristic (Settles and Craven, 2008) for

1http://www.statmt.org/moses

System EN→DE EN→FI EN→DE EN→CS EN→CS EN→FIBase NMT (100K) 13.2 10.3 13.2 8.1 8.1 10.3

AL with 135K token budgetRandom 13.9 11.2 13.9 8.3 8.3 11.2Shortest 14.5 11.5 14.5 8.6 8.6 11.5Longest 14.1 11.3 14.1 8.2 8.2 11.3TTE 14.2 11.3 14.2 8.5 8.5 11.3Token Policy 15.5 12.8 14.8 8.5 9.0 12.5

AL with 677K token budgetRandom 15.9 13.5 15.9 9.2 9.2 13.5Shortest 15.8 13.7 15.8 8.9 8.9 13.7Longest 15.6 13.5 15.6 8.5 8.5 13.5TTE 15.6 13.7 15.6 8.6 8.6 13.7Token Policy 16.6 14.1 16.3 9.2 10.3 13.9FULL bitext (500K) 20.5 18.3 20.5 12.1 12.1 18.3

Table 1: BLEU scores on tests sets with different selection strategies, the budget is at token level with annotationfor 135.45k tokens and 677.25k tokens respectively.

sequence-prediction tasks. Given a mono-lingual sentence xxx, we compute the TTE as∑|yyy|

i=1 Entropy[Pi(.|yyy<i,xxx,φφφ)] where yyy is thedecoded translation based on the current un-derlying NMT model φφφ, and Pi(.|yyy<i,xxx,φφφ)is the distribution over the vocabulary wordsfor the position i of the translation given thesource sentence and the previously generatedwords. We also experimented with the nor-malised version of this measure, i.e. dividingTTE by |yyy|, and found that their difference isnegligible. So we only report TTE results.

5.1 Translating from English

Setting We train the AL policy on a language-pair treating it as high-resource, and apply it toanother language-pair treated as low-resource. Totransfer the policies across languages, we makeuse of pre-trained multilingual word embeddingslearned from monolingual text and bilingual dic-tionaries (Ammar et al., 2016). Furthermore, weuse these cross-lingual word embeddings to ini-tialise the embedding table of the NMT in the low-resource language-pair. The source and target vo-cabularies for the NMT model in the low-resourcescenario are constructed using the initially avail-able 100K bitext, and are expanded during the ALiterations as more translated text becomes avail-able.

Results Table 1 shows the results. The exper-iments are performed with two limits on tokenannotation budget: 135k and 677k correspondingto select roughly 10K and 50K sentences in to-

System EN→DE EN→FIBase NMT (100K) 13.2 10.3

AL with 10K sentence budgetRandom 14.2 11.3Shortest 13.9 11.0Longest 14.7 11.8TTE 14.3 11.2Sent πEN→CS 14.5 11.5Token πEN→CS 15.1 11.8

AL with 50K sentence budgetRandom 16.1 13.5Shortest 15.5 12.9Longest 17.2 14.2TTE 16.6 13.5Sent πEN→CS 16.3 14.0Token πEN→CS 16.8 14.1

Table 2: BLEU scores on tests sets for different lan-guage pairs with different selection strategies, the bud-get is at sentence level with annotation for 10k sen-tences and 50k sentences respectively.

tal in AL2, respectively. The number of AL it-erations is 50, hence the token annotation budgetfor each round is 2.7K and 13.5K. As we can seeour policy-based AL method is very effective, andoutperforms the strong AL baselines in all casesexcept, when transferring the policy trained onEN → FI to EN → CS where it is on-par withthe best baseline.

Sentence vs Token Budget In our results in Ta-ble 1, we have taken the number of tokens in theselected sentences as a proxy for the annotationcost. Another option to measure the annotationcost is the number of selected sentences, whichadmittedly is not the best proxy. Nonetheless, one

2These token limit budgets are calculated using randomselection of 10K and 50K sentences multiple times, and tak-ing the average of the tokens across the sampled sets.

100K initial bitext 10K initial bitextAL method cold-start warm-start cold-start warm-startBase NMT 10.6/11.8 13.9/14.7 2.3/2.5 5.4/5.8Random 12.9/13.3 15.1/16.2 5.5/5.6 9.3/9.6Shortest 13.0/13.5 15.9/16.4 5.9/6.1 9.1/9.3Longest 12.5/12.9 15.3/15.8 5.7/5.9 9.8/10.2TTE 12.8/13.2 15.8/16.1 5.9/6.2 9.8/10.1πCS→EN 13.9/14.2 16.8/17.3 6.3/6.5 10.5/10.9πFI→EN 13.5/13.8 16.5/16.9 6.1/6.2 10.2/10.3πEN→CS 13.3/13.6 16.4/16.5 5.1/5.7 10.3/10.5πEN→FI 13.2/13.5 15.9/16.3 5.1/5.6 9.8/10.2EnsembleπCS,FI→EN 14.1/14.3 16.8/17.5 6.3/6.5 10.5/10.9πEN→CS,FI 13.6/13.8 16.5/16.9 5.8/5.9 10.3/10.5

Full Model (500K) 20.5/20.6 22.3/22.5 - -

Table 3: BLEU scores on tests sets using different selection strategies. The token level annotation budget is 677K.

may be interested to see how different AL meth-ods compare against each other based on this costmeasure.

Table 2 show the results based on the sentence-based annotation cost. We train a policy on EN→CS, and apply it to EN → DE and EN → FItranslation tasks. In addition to the token-basedAL policy from Table 1, we train another policybased on the sentence budget. The token-basedpolicy is competitive in EN → DE, where thelongest sentence heuristic achieves the best perfor-mance, presumably due to the enormous trainingsignal obtained by translation of long sentences.The token-based policy is on par with longest sen-tence heuristic in EN→ FI for both 10K and 100KAL budgets to outperform the other methods.

5.2 Translating into English

Setting We investigate the performance of theAL methods on DE → EN based on the policiestrained on the other language pairs. In addition to100K training data condition, we assess the effec-tiveness of the AL methods in an extremely low-resource condition consisting of only 10K bilin-gual sentences as the initial bitext.

In addition to the source word embedding ta-ble that we initialised in the previous section’sexperiments using the cross lingual word embed-dings, we are able further to initialise all of theother NMT parameters for DE → EN transla-tion. This includes the target word embedding ta-ble and the decoder softmax, as the target languageis the same (EN) in the language-pairs used for

both policy training and policy testing. We refer tothis setting as warm-start, as opposed to cold-startin which we only initialised the source embed-ding table with the cross-lingual embeddings. Forthe warm-start experiments, we transfer the NMTtrained on 500K CS-EN bitext, based on whichthe policy is trained. We use byte-pair encoding(BPE) (Sennrich et al., 2015b) with 30K opera-tions to bpe the EN side. For the source side, weuse words in order to use the cross-lingual wordembeddings. All parameters of the transferredNMT are frozen, except the ones correspondingto the bidirectional RNN encoder and the sourceword embedding table.

To make this experimental condition as realis-tic as possible, we learn the cross-lingual wordembedding for DE using large amounts of mono-lingual text and the initially available bitext, as-suming a multilingual word embedding alreadyexists for the languages used in the policy train-ing phase. More concretely, we sample 5M DEtext from WMT2018 data3, and train monolingualword embeddings as part of a skip-gram languagemodel using fastText.4 We then create a bilin-gual EN-DE word dictionary based on the initiallyavailable bitext (either 100K or 10K) using wordalignments generated by fast align.5 Thebilingual dictionary is used to project the mono-lingual DE word embedding space into that of EN,

3We make sure that it does not include the DE sentencesin the 400K pool used in the AL experiments.

4https://github.com/facebookresearch/fastText5https://github.com/clab/fast align

hence aligning the spaces through the followingorthogonal projection:

argmaxQQQ

m∑i=1

eee[yi]T ·QQQ · eee[xi] s.t. QQQT ·QQQ = III

where {(yi, xi)}mi=1 is the bilingual dictionaryconsisting of pairs of DE-EN words6, eee[yi] andeee[xi] are the embeddings of the DE and EN words,and QQQ is the orthogonal transformation matrixaligning the two embedding spaces. We solvethe above optimisation problem using SVD as in(Smith et al., 2017). The cross-lingual word em-bedding for a DE word y is then eee[y]T · QQQ. Webuild two such cross-lingual embeddings based onthe two bilingual dictionaries constructed from the10K and 100K bitext, in order to use in their cor-responding experiments.

Results Table 3 presents the results, on two con-ditions of 100K and 10K initial bilingual sen-tences. For each of these data conditions, we ex-periments with both cold-start and warm-start set-tings using the pre-trained multilingual word em-beddings from (Ammar et al., 2016) or those wehave trained with the available bitext plus addi-tional monotext. Firstly, the warm start strategyto transfer the NMT system from CS → EN toDE → EN has been very effective, particularlyon extremely low bilingual condition of 10K sen-tence pairs. It is worth noting that our multilingualword embeddings are very effective, even-thoughthey are trained using small bitext. Secondly, ourpolicy-based AL methods are more effective thanthe baseline methods and lead to up to +1 BLEUscore improvements.

We further take the ensemble of multiple trainedpolicies to build a new AL query strategy. In theensemble, we rank sentences based on each of thepolicies. Then we produce a final ranking by com-bining these rankings. Specifically, we sum theranking of each sentence according to each pol-icy to get a rank score, and re-rank the sentencesaccording to their rank score. Table 3 shows thatensembling is helpful, but does not produce signif-icant improvements compared to the best policy.

5.3 Analysis

TTE is a competitive heuristic-based strategy, asshown in the above experiments. We compare

6One can incorporate human curated bilingual lexicons tothe automatically curated dictionaries as well.

Figure 3: On the task of DE→EN, the plot shows thelog fraction of words vs the log frequency from the se-lected data returned by different strategies, in which wehave a 677K token budget and do warm start with 100Kinitial bitext. The AL policy here is πEN→CS .

the word frequency distributions of the selectedsource text returned by Random, TTE against ourAL policy. The policy we use here is πEN→CS

and applied on the task of DE→EN, which is con-ducted in the warm-start scenario with 100K initialbitext and 677K token budget.

Fig. 3 is the log-log plot of the fraction of vo-cabulary words (y axis) having a particular fre-quency (x axis). Our AL policy is less likely toselect high-frequency words than other two meth-ods when it is given a fixed token budget.

6 Conclusion

We have introduced an effective approach forlearning active learning policies for NMT, wherethe learner needs to make batch queries. We haveprovides a hierarchical MDP formulation of theproblem, and proposed a policy network structurecapturing the context in both MDP levels. Ourpolicy training method uses imitation learning anda search lattice to carefully collect AL trajecto-ries for further improvement of the current policy.We have provided experimental results on threelanguage pairs, where the policies are transferredacross languages using multilingual word embed-dings. Our experiments confirms that our methodis more effective than strong heuristic-based meth-ods in various conditions, including cold-start andwarm-start as well as small and extremely smalldata conditions.

ReferencesWaleed Ammar, George Mulcaire, Yulia Tsvetkov,

Guillaume Lample, Chris Dyer, and Noah A Smith.2016. Massively multilingual word embeddings.arXiv preprint arXiv:1602.01925.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2017. Unsupervised neural ma-chine translation. arXiv preprint arXiv:1710.11041.

Philip Bachman, Alessandro Sordoni, and AdamTrischler. 2017. Learning algorithms for activelearning. In Proceedings of the 34th InternationalConference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages301–310, International Convention Centre, Sydney,Australia. PMLR.

Michael Bloodgood and Chris Callison-Burch. 2010.Bucking the trend: Large-scale cost-focused activelearning for statistical machine translation. In Pro-ceedings of the 48th Annual Meeting of the Associa-tion for Computational Linguistics, pages 854–864.

Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agar-wal, Hal Daume, III, and John Langford. 2015.Learning to search better than your teacher. InProceedings of the 32Nd International Conferenceon International Conference on Machine Learning,pages 2058–2066.

Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learninghow to active learn: A deep reinforcement learningapproach. arXiv preprint arXiv:1708.02383.

Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor OKLi. 2018. Universal neural machine translation forextremely low resource languages. arXiv preprintarXiv:1802.05368.

Gholamreza Haffari and Anoop Sarkar. 2009. Activelearning for multilingual statistical machine transla-tion. In Proceedings of the Joint Conference of the47th Annual Meeting of the ACL, pages 181–189.

Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,Tieyan Liu, and Wei-Ying Ma. 2016. Dual learn-ing for machine translation. In Advances in NeuralInformation Processing Systems, pages 820–828.

Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Vigas, Martin Wattenberg, G.s Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’smultilingual neural machine translation system: En-abling zero-shot translation.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Philipp Koehn and Rebecca Knowles. 2017. Six chal-lenges for neural machine translation. In Pro-ceedings of the First Workshop on Neural MachineTranslation, pages 28–39. Association for Compu-tational Linguistics.

Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2017. Unsupervisedmachine translation using monolingual corpora only.arXiv preprint arXiv:1711.00043.

Ming Liu, Wray Buntine, and Gholamreza Haffari.2018. Learning how to actively learn: A deep im-itation learning approach. In Proceedings of the An-nual Meeting of the Association for ComputationalLinguistics.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025.

Stephane Ross and J Andrew Bagnell. 2014. Rein-forcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2015a. Improving neural machine translationmodels with monolingual data. arXiv preprintarXiv:1511.06709.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2015b. Neural machine translation of rarewords with subword units. arXiv preprintarXiv:1508.07909.

Burr Settles and Mark Craven. 2008. An analysis of ac-tive learning strategies for sequence labeling tasks.In Proceedings of the conference on empirical meth-ods in natural language processing, pages 1070–1079. Association for Computational Linguistics.

Samuel L. Smith, David H. P. Turban, Steven Hamblin,and Nils Y. Hammerla. 2017. Offline bilingual wordvectors, orthogonal transformations and the invertedsoftmax. arXiv preprint arXiv:1702.03859.

Documents

Learning to Actively Learn Neural Machine Translation