Empirical Evaluation of Active Learning Techniques for ... · Empirical Evaluation of Active Learning Techniques for Neural MT Xiangkai Zeng1 Sarthak Garg 2Rajen Chatterjee Udhyakumar

Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo), pages 84–93Hong Kong, China, November 3, 2019. c©2019 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17

84

Empirical Evaluation of Active Learning Techniques for Neural MT

Xiangkai Zeng1∗ Sarthak Garg2 Rajen Chatterjee2 Udhyakumar Nallasamy2 Matthias Paulik2

1Carnegie Mellon University2Apple Inc.

[email protected],{sarthak garg, rajen c, udhay, mpaulik}@apple.com

Abstract

Active learning (AL) for machine translation(MT) has been well-studied for the phrase-based MT paradigm. Several AL algorithmsfor data sampling have been proposed overthe years. However, given the rapid advance-ment in neural methods, these algorithms havenot been thoroughly investigated in the con-text of neural MT (NMT). In this work, weaddress this missing aspect by conducting asystematic comparison of different AL meth-ods in a simulated AL framework. Our exper-imental setup to compare different AL meth-ods uses: i) State-of-the-art NMT architectureto achieve realistic results; and ii) the samedataset (WMT’13 English-Spanish) to havefair comparison across different methods. Wethen demonstrate how recent advancementsin unsupervised pre-training and paraphrasticembedding can be used to improve existingAL methods. Finally, we propose a neural ex-tension for an AL sampling method used inthe context of phrase-based MT - Round TripTranslation Likelihood (RTTL). RTTL uses abidirectional translation model to estimate theloss of information during translation and out-performs previous methods.

1 Introduction

Active learning (AL) is an iterative supervisedlearning procedure where the learner is able toquery an oracle for labeling new data points. Sincethe learner chooses the data points for annotation,the amount of labeling needed to learn a conceptcan be much lower than annotating the whole un-labeled dataset (Balcan et al., 2009). This ap-proach is useful in low-resource scenarios whereunlabeled data is abundant but manual labeling isexpensive. AL has been successfully applied tomany areas of NLP like classification, sequencelabeling, spoken language understanding (Cohn

∗ Work done during internship at Apple Inc.

et al., 1994; Guo and Greiner, 2007; Dagan andEngelson, 1995; Settles and Craven, 2008; Turet al., 2005) as well as machine translation (MT)(Ambati, 2011; Haffari et al., 2009; Eck, 2008;Peris and Casacuberta, 2018; Zhang et al., 2018).In MT, most of the AL methods have been investi-gated under the phrase-based paradigm. Althoughneural MT has dominated the field (Barrault et al.,2019), there has only been limited effort to in-vestigate and compare existing AL algorithms inthis newer paradigm. The few recently publishedpapers in this direction (Peris and Casacuberta,2018; Zhang et al., 2018) use LSTM-based MTsystems, whereas, the latest state-of-the-art sys-tems are based on the Transformer architecture(Vaswani et al., 2017). Moreover, these papers ei-ther investigate different algorithms of the sameclass or compare only a handful of methods fromdifferent classes. Thus a global picture showingthe effect of different AL methods on the samedataset for the state-of-the-art (SotA) MT systemhas been missing.

In this work, we fill this missing gap by per-forming a comprehensive evaluation of differentAL algorithms on a publicly available dataset(WMT’13) using the SotA NMT architecture. Tomake our analysis thorough, we take into accountdifferent evaluation metrics to avoid any bias aris-ing because of similarity between the evaluationmetric and some components of the AL algorithm.Finally, we propose two extensions of existingAL algorithms. One leverages recent advancesin paraphrastic embeddings (Wieting and Gimpel,2018) and other is based on round-trip transla-tion - a neural variant of the approach proposedin phrase-based MT (Haffari et al., 2009). Bothof these approaches outperform existing methodswith the latter showing the best results.

85

2 Active Learning Framework

We simulate AL in a batch setup because it is morepractical to send batches of data for manual trans-lation rather than a single sentence at disjoint in-tervals of time. Algorithm 1 summarizes the pro-cedure. It expects: i) a labeled parallel corpus (L),which is used to train the NMT system (M) ; ii) anunlabeled monolingual corpus (U), which is usedto sample new data points for manual translation;iii) a scoring function (ψ), which is used to es-timate the importance of data points in (U); andiv) batch size (B), which indicates the number ofdata points to sample in each iteration.1 In prac-tice, the AL algorithm will iterate until we exhaustthe budget for annotation (step 2). However, inour simulation we already have reference transla-tions for all the unlabeled data points (see Foot-note 1), therefore, we iterate until we exhaust allthe data points in U . In each iteration, we firsttrain an NMT system from scratch using L (step3). We then score all the sentences in U with ψ thattakes in to account L, U , andM (step 4-6). The ψfunction is a key component in all AL algorithms,which is discussed in detail along with its variantsin the next section. We then select the highest scor-ing B sentences for manual translation (step 7-8).These sentences are removed from U (step 9), andadded to L along with their reference translations(step 10). The algorithm then proceeds to step 2for the next round.

Algorithm 1 Batch Active Learning for NMT1: Given: Parallel data L, Monolingual source

language data U , Sampling strategy ψ(·),Sampling batch size B.

2: while Budget 6= EMPTY do3: M = TrainNMTsystem(L) ;4: for x ∈ U do5: f(x) = ψ(x,U ,L,M);6: end for7: XB = TopScoringSamples(f(x),B);8: YB = HumanTranslation(XB) ;9: U = U −XB;

10: L = L ∪ {XB, YB} ;11: end while12: return L

1In our simulation, U is basically the source side of a par-allel corpus L′ (L′ 6= L), and to label new data points fromU we simply extract the corresponding references from L′rather than asking a human annotator.

3 Methodology

In this section we outline the AL methods - i.e.the scoring functions (ψ), which have been pro-posed to work best for NMT, SMT and varioussequence labeling tasks (Peris and Casacuberta,2018; Zhang et al., 2018; Ambati, 2011; Haffariet al., 2009; Settles and Craven, 2008). Theseapproaches can be broadly categorized into twoclasses: model-driven and data-driven.

Model-driven approaches sample instancesbased on the model, the labeled dataset and the un-labeled dataset, i.e. ψ(x, . . . ) = ψ(x,M,U ,L).These methods receive direct signal from themodel, which can potentially help in samplingmore sentences from regions of the input space,where the model is weak. We first describe severalmodel-driven approaches from the above works,all of which sample instances where the modelMis least certain about the prediction. We then pro-pose Round Trip Translation Likelihood, a neu-ral extension of an existing method, which outper-forms other model-driven methods substantially.

Data-driven approaches on the other hand onlyrely on U and L to sample sentences, i.e.ψ(x, . . . ) = ψ(x,U ,L). Since these methods aremodel independent, model training in step 3 of Al-gorithm 1 can be ignored, making these methodscomputationally faster. We summarize various ex-isting data-driven approaches from MT literatureand demonstrate how these approaches can bene-fit considerably from sentence embeddings specif-ically trained for capturing semantic similarity.

3.1 Model-Driven

In this class of methods, we explore uncertaintysampling (Lewis and Catlett, 1994) strategies thathave been widely used in MT. In this strategy anunlabeled example x is scored with some measureof uncertainty in the probability distribution overthe label classes assigned by the model pM(y|x).In the case of classification tasks, using the en-tropy H(pM(y|x)) is the most obvious choice,but in the case of structure prediction tasks, thespace of all possible labels is usually exponen-tial, making entropy calculation intractable. Set-tles and Craven (2008) found two heuristics: LeastConfidence and N-best Sequence Entropy, whichseemed to be the most effective estimators ofmodel uncertainty across for two sequence label-ing tasks. In addition to these, we also investigateCoverage Sampling (Peris and Casacuberta, 2018)

86

proposed for interactive NMT, and our version ofRound Trip Translation Likelihood inspired fromthe work in phrase-based MT (Haffari et al., 2009).

3.1.1 Least Confidence (LC)This method estimates the model uncertainty ofa source sentence x by averaging token-level logprobability of the corresponding decoded transla-tion y. In our formulation, we further add lengthnormalization to avoid any bias towards the lengthof the translations.

ψLC(x,M) = − 1

Llog pM(y|x), (1)

where L denotes the length of y.

3.1.2 N-best Sequence Entropy (NSE)Another tractable approximator of model uncer-tainty is computing the entropy of the n-best hy-pothesis. Corresponding to a source sentence x,let N = {y1, y2 . . . yn} denote the set of n-besttranslations. The normalized probability P of eachhypothesis can be computed as:

∀y ∈ N , P (y) =pM(y|x)∑y∈N pM(y|x)

. (2)

Each source sentence is scored with the entropy ofthe probability distribution P :

ψNSE(x,M) = −∑y∈N

P (y) log P (y). (3)

3.1.3 Coverage Sampling (CS)Under-translation is a well known problem inNMT (Tu et al., 2016), wherein not all source to-kens are translated during decoding. The attentionmechanism in LSTM based encoder-decoder ar-chitecture (Bahdanau et al., 2015) can model wordalignment between translation and source to somedegree. The extent of coverage of the attentionweights over the source sentence can be an indi-cator of the quality of the translation. Peris andCasacuberta (2018) proposed Coverage Sampling(CS), which uses this coverage to estimate uncer-tainty. Formally:

ψCS(x,M) = −∑|x|

j=1 log(min(∑|y|

i=1 αi,j , 1))

|x|(4)

where x and y are the source sentence and the de-coded translation respectively, |·| denotes the num-ber of tokens and αi,j denotes the attention proba-bility on the jth word of x while predicting the ith

word of the y.

3.1.4 Round Trip Translation Likelihood(RTTL)

Ott et al. (2018) showed that even a well trainedNMT model does not necessarily assign higherprobabilities to better translations. This behaviorcan be detrimental for methods like LC in whichsentences with highly probable translations are notselected for human translations. In this scenariowe assume that a low quality translation will losesome source-side information and it will becomedifficult to reconstruct the original source fromthis translation. To this end, we train models Mand Mrev to translate from source language totarget language and the reverse direction respec-tively. Mrev is identical to M except that it istrained on data obtained by flipping source and tar-get sentences in L. Formally, for any source sen-tence x of length L, we first translate it to a targetsentence y usingM. Then we translate y back us-ing Mrev, but instead of decoding, we computethe probability of the original source sentence xand use it as a measure of uncertainty.

y ≈ argmaxy

pM(y|x). (beam search) (5)

ψRTTL(x,M,Mrev) = −1

Llog pMrev(x|y).

(6)

RTTL is inspired by one of the methods proposedby Haffari et al. (2009), but differs from it in termsof modeling uncertainty. In their formulation, xis first translated to y like us but instead of scor-ing the likelihood of x given y, underMrev, theyuseMrev to translate y to a new source sentencex and measure uncertainty using sentence-levelBLEU between x and x. They showed that theirapproach did not perform better than a randombaseline, however, in our experiments, RTTL out-performs the random baseline as well as all othermodel-driven methods. We suspect that this mightbe due to model log probability being a much finergrained metric than sentence-level BLEU.

3.2 Data-DrivenThe data-driven approaches usually score sen-tences based on optimizing either one or a trade-

87

off between the following two metrics:

• Density: This metric scores sentences basedon how similar they are with respect to the en-tire data in U . In other words, sentences withhigher likelihood under the data distributionof U are scored higher. This strategy assumesthat the test set has the same distribution as U ,which makes achieving good translations onthe dense regions of U more important.

• Diversity: This metric compliments theabove and encourages sampling sentenceswhich are less similar to the data in L. Thiseventually leads to L containing a diverse setof sentences, leading to better generalizationperformance of modelM.

A key component in the above two metrics ishow the similarity between two sentences is mea-sured. We select the two common practices in lit-erature are using n-gram overlap and cosine sim-ilarity between sentence embeddings. In the sec-tions below, we describe the formulation of var-ious data-driven methods based on how sentencesimilarity is measured.

3.2.1 N-gram OverlapAmbati (2011) and Eck (2008) investigated den-sity and diversity metrics using n-gram overlapfor phrase-based MT and concluded that the bestapproach is to combine both of them together inthe scoring function. Therefore, we select Den-sity Weighted Diversity method from the formerand Static Sentence Sorting from the latter in ourstudy. Both methods use the following notations:

• I: denotes the indicator function,

• n-gram(x): denotes the multiset of n-gramsin a sentence (or a set of sentence) x,

• #(a|X ): denotes the frequency of an n-grama in n-gram(X ).

Density Weighted Diversity (DWDS) com-bines the density and diversity metrics using a har-monic mean. Equation 7 and 8 respectively definethe density (α) and diversity (β) metrics, whichare combined together in Equation 9 to obtain theDWDS scoring function.

α(x,U ,L) =

∑s∈n-gram(x)

#(s|U)e−λ#(s|L)

|n-gram(x)||n-gram(U)|(7)

Here, λ is used as a decay parameter to give dis-count the n-grams which have already been seenin the bilingual data.

β(x,U ,L) =

∑x∈n-gram(x)

I(s /∈ n-gram(L))

|n-gram(x)|(8)

ψDWDS(x,U ,L) =α(x,U ,L)β(x,U ,L)

kα(x,U ,L) + β(x,U ,L)(9)

Here, k controls the relative weighting of α and β.

Static Sentence Sorting (SSS) is a much sim-pler formulation which samples sentences fromdense regions of U , containing n-grams which areabsent in L.

ψSSS(x,U ,L) =

∑s∈n-gram(x)

I(s /∈ L)#(s|U)

|x|(10)

3.2.2 Cosine SimilarityZhang et al. (2018) proposed S-Score (SS) to usecosine similarity between sentence embeddingsrather than n-gram overlap as a measure of sen-tence similarity. S-Score mainly relies on thediversity metric for selection. It samples sen-tences from U which are furthest from their near-est neighbors in L. Essentially sentences whichare semantically different from all the sentences inL would be selected. Let e(x) denote the embed-ding vector of the sentence x and cos(·, ·) denotethe cosine similarity, then S-Score is defined as:

ψSS(x,L) = miny∈L

cos(e(x), e(y)) (11)

Zhang et al. (2018) used learnt sentence em-beddings starting from fasttext (Bojanowskiet al., 2017) and fine-tuned using Paragraph Vec-tor (Le and Mikolov, 2014).

To better understand how recent advances in un-supervised pre-training can benefit active learn-ing, we perform an ablation study of the S-Scoremethod with varying the source of embeddings.We experiment with the following three increas-ingly expressive sentence representations:

Bag of words (SS-BoW): This is the simplestmethod in which the sentence embeddings arecomputed by taking the average of all the wordembeddings. The word embeddings are obtainedfrom the fasttext tool.

88

Contextual embedding (SS-CE): In thismethod, we leverage unsupervised pre-trainingtechniques like BERT which have significantlyadvanced the SotA in NLP (Devlin et al., 2019).Specifically, we train the Transformer Encoderusing the Masked Language Modeling (MLM)objective proposed in BERT (Devlin et al., 2019).We then compute the sentence embedding byaveraging outputs from the trained encodercorresponding to all the sentence tokens.

Paraphrastic embedding (SS-PE): The sen-tence embedding methods listed above and thoseused by Zhang et al. (2018) are all trained withthe objective of predicting tokens based on theircontext. While this allows the embeddings to besomewhat correlated with the semantics, explic-itly fine-tuning the embeddings on semantic simi-larity can be helpful for our use case. Therefore,we fine-tune the contextual embedding model dis-cussed above on the paraphrase task as proposedin Wieting and Gimpel (2018).

Wieting and Gimpel (2018) created adataset2containing pairs of English paraphrasesby back-translating the Czech side of an English-Czech corpus. We fine-tune the embeddings of theparaphrase pairs to be close to each other using acontrastive loss. We specifically choose this taskbecause it does not utilize any supervised humanannotated corpus for semantic similarity whileachieving competitive performance on SemEvalsemantic textual similarity (STS) benchmarks.

We show that using contextual sentence em-beddings does not give any noticeable gains oversimply using bag of words embeddings, howeverfine-tuning the embeddings on semantic similaritytasks improves the performance of S-Score sub-stantially, enabling it to outperform other data-driven approaches.

4 Experiments

4.1 DatasetOur setup is based on the WMT’13 English-Spanish news translation task. We use the Eu-roparl and News Commentary Corpus consistingof ∼ 2M sentence pairs. We randomly sample10% of the whole bilingual data to create the baseparallel dataset L (∼ 200K) which is used to train

2https://drive.google.com/file/d/19NQ87gEFYu3zOIp_VNYQZgmnwRuSIyJd/view?usp=sharing

an initial NMT model. We then randomly sam-ple 50% from the remaining data to the unlabeleddataset U (∼ 1M) used for simulating the AL ex-periments. Note that we do the random samplingjust once and fix L and U for all the experimentsfor fair comparison. Since we experiment in a sim-ulated AL framework, the target sentences in U arehidden while scoring source sentences with differ-ent AL strategies. Once the AL algorithm samplesa batch B containing 100k source sentences fromU , the sampled sentences along with their corre-sponding “hidden” translations are added toL. Weuse newstest-2012 as the validation set andnewstest-2013 as the test set, each consist-ing of about 3000 sentence pairs. For training thecontextual embeddings, we use the English NewsCrawl corpus from years 2007-17, consisting of∼ 200M sentences. For preprocessing, we applythe Moses tokenizer (Koehn et al., 2007) withoutaggressive hyphen splitting and with punctuationnormalization. We learn a joint source and targetByte-Pair-Encoding (BPE, Sennrich et al. (2016))on the whole bilingual data with 32k merge oper-ations.

4.2 Training and Model Hyperparameters

For the NMT models in all the experiments, weuse the base Transformer configuration with 6encoder and decoder layers, 8 attention heads,embedding size of 512, shared input and outputembeddings, relu activation function and sinu-soidal positional embeddings. We train with abatch size of 2000 tokens on 8 Volta GPUs us-ing half-precision for 30 epochs. Furthermore weuse Adam optimizer (Kingma and Ba, 2015) witha learning rate of 0.001, β1 = 0.9, β2 = 0.98,learning rate warmup over the first 4000 steps andinverse square root learning rate scheduling. Wealso apply dropout and label smoothing of 0.1each. We average the weights of 5 checkpointswith the best validation loss and run inference witha beam size of 5.

While training the Transformer Encoder usingthe Masked Language Modeling (MLM) objec-tive, we switch to the big configuration with anembedding size of 1024 and 16 attention heads.The masking probabilities are the same as de-scribed in Devlin et al. (2019), however insteadof pair of sentences, we use text streams span-ning across multiple sentences (truncated at 256tokens) (Lample and Conneau, 2019). This model

https://drive.google.com/file/d/19NQ87gEFYu3zOIp_VNYQZgmnwRuSIyJd/view?usp=sharing



89

is trained using 64 Volta GPUs, each processinga batch of 128 segments. We use a learning rateof 0.0003, with other hyperparameters similar tothe NMT model. The above model is fine-tunedon the task of Paraphrastic Sentence Embeddings.Specifically we use a margin of 0.4, learning rate0.0005, batch size 3500 tokens and megabatchesconsisting of 32 batches. We train for 5 epochsusing 8 Volta GPUs. Lastly for DWDS, we setk = 1, λ = 1. For both the n-gram based methods,we consider up to tri-grams. In case of NSE, werestrict the n-best list to be of size 5. Our baselineis a system that randomly selects sentences fromthe unlabeled data.

5 Results and Discussion

In this section, we compare the performance ofdifferent AL algorithms of each class: model-driven and data-driven. For a comprehensive com-parison, we evaluate the best approaches of bothclasses on:

• Various MT evaluation metrics. N-gramoverlap based metrics like BLEU (Papineniet al., 2002) might be biased towards ALmethods based on n-gram similarity (DWDS,SSS). For a fair comparison, we evaluatethe AL approaches on BLEU, TER (Snoveret al., 2006), which is based on edit distance,and BEER (Stanojevic and Sima’an, 2015),which uses a linear model trained on humanevaluations dataset.

• Out-of-domain evaluation sets. Since ALalgorithms are sensitive to the labeled (L) andunlabeled data (U) distributions, it is possi-ble that some AL algorithms perform worse,when evaluated on out-of-domain test sets.To compare the robustness of different ALapproaches, we evaluate them on a test setsourced from biological domain, which isvery different from the training data distribu-tion (parliament speech and news).

• Evaluation sets without any translationesesource sentences. Translationese refers tothe distinctive style of sentences which aretranslated by human translators. These sen-tences tend to be a simplistic, standardized,explicit and lack complicated constructs likeidioms etc. These properties might makeit easy to reconstruct source sentences from

the translationese domain, hence discourag-ing them to be sampled by RTTL. Presenceof translationese source sentences in the testsets might unfairly penalize RTTL.

5.1 Model-Driven Approaches

Figure 1 and Table 1 compares the results ofmodel-driven approaches and random samplingbaseline. We observe that CS performs worse thanthe random baseline. This is in contrast to theresults reported in Peris and Casacuberta (2018)where it is amongst the best performing methods.The performance of CS is dependent upon the as-sumption that attention probabilities are good atmodeling word alignments. While this assump-tion is valid in the case Peris and Casacuberta(2018), which uses attentional sequence-sequencemodels with LSTMs/GRUs, it breaks down inthe presence of multi-layered, multi-headed atten-tion mechanism of the Transformer architecture.Upon closer inspection, we found that this methodsampled very long sentences with rare words andmany BPE splits, resulting in sub-optimal perfor-mance. LC has slightly better performance thanNSE, while RTTL outperforms all the other meth-ods consistently by a non-trivial margin. Thisdemonstrates that our proposed extension is an ef-fective approximation of model uncertainty.

Figure 1: Results of model-driven approaches.

5.2 Data-Driven Approaches

Figure 2 shows the results of n-gram based data-driven approaches. As illustrated in Figure 2,these computationally inexpensive methods canconsistently outperform the random baseline by a

90

Table 1: Results for Model-Driven and Data-Driven approaches. We report the average BLEU scores across 20active learning iterations. Best methods within each category are boldened.

Random Model-Driven Data-Driven (N-Gram) Data-Driven (Embedding)LC CS NSE RTTL DWDS SSS SS-BoW SS-CE SS-PE

29.54 30.06 29.35 29.93 30.29 30.22 30.21 29.84 30 30.17

large margin. In spite of modeling density and di-versity in very different ways, both the methodsachieve similar performance.

Figure 3 shows the results of embedding baseddata-driven approaches (SS) corresponding to dif-ferent sources of embeddings. It is noteworthythat using bag-of-words (SS-BoW) and contextualembeddings (SS-CE) results in roughly the sameperformance, barely beating the random baseline.However, fine-tuning the contextual embeddingson the paraphrase task (SS-PE), brings about alarge performance gain, emphasizing the effective-ness of fine-tuning on semantic similarity tasks forAL.

The above trends are inline with the results re-ported in Table 1 as well.

Figure 2: Results of data-driven approaches based onn-gram similarity.

5.3 Performance on Different EvaluationMetrics

Figure 4 compares the top three AL methods(RTTL, DWDS, SS-PE) using BLEU. All threemethods are quite competitive, with RTTL and SS-PE performing slightly better than DWDS in thebeginning. Figures 4 and 5 show consistent per-formance trends using all the three metrics. It isworth noting from figure 4, that all the methods

Figure 3: Results of the SS method with varioussources of embeddings.

are able to achieve the same BLEU (as with usingthe whole bitext) with using only 70% of the bi-text. This outlines that AL can be quite effectivefor NMT.

Figure 4: Results of best performing approaches withBLEU.

91

Figure 5: Results of best performing approaches withBEER (B-X) and TER (T-X) evaluation metrics.

5.4 Evaluation on Test Sets WithoutTranslationese

Zhang and Toral (2019) brought to light the ef-fect of having translationese in the source side ofthe evaluation sets used in WMT over the years.The situation is even worse for newstest2013which contains translations of sentences from 5different languages, in the source side. We create anew test set, by collecting ∼ 2000 sentences fromnewstest2009-13 except newstest2012(since we use newstest2012 as validation set)which are originally from English. The BLEUscores of all the methods are much higher on thistest set corrected for translationese (∼38 BLEU),as compared to newstest2013 (∼31 BLEU),however the relative performance trends remainthe same.

5.5 Evaluation on Out-of-Domain Test Sets

Figure 6 shows the results on out-of-domain testset from the WMT16 Shared Task on BiomedicalTranslation. It can be observed from figure 6 that,while RTTL and DWDS are quite robust to thetest domain, and strongly outperform the randombaseline, there is some degradation in the perfor-mance of SS-PE.

6 Related Work

Early work on AL for MT includes Ambati (2011);Eck (2008); Haffari et al. (2009) among others.These papers investigated the AL approaches forphrase-based MT systems. Given that the currentSotA MT systems are neural-based, in this work,we investigate the effectiveness of their proposedmethods in the neural paradigm. Couple of works

Figure 6: Results of the best performing approaches onBiomedical Translation test set

that did investigate the AL methods for NMT arePeris and Casacuberta (2018) and Zhang et al.(2018). Both of these used RNN/LSTM basedNMT architecture, whereas, we use the latestSotA Transformer in our investigation. Peris andCasacuberta (2018) used an interactive NMT setupand mostly focused on model-driven approachesdisregarding data-driven methods. Zhang et al.(2018) did compare methods from both the classesbut considered only a handful of methods. Ourwork is closer to Zhang et al. (2018), but we covermuch a wider spectrum of methods in AL. Wealso go one step further and show that the cosinesimilarity based methods proposed in Zhang et al.(2018) are more effective when the embeddingsare optimized for the paraphrase task. As far as weknow, most of the prior work concluded that data-driven methods outperform model-driven meth-ods, however, our model-driven RTTL formula-tion obtains slight gain over the best data-drivenmethod.

7 Conclusion

In this work, we performed an empirical evalu-ation of different AL methods for the state-of-the-art neural MT architecture, a missing aspectin prior work. We explored two classes of ap-proaches: data-driven and model-driven, and ob-served that all the methods outperform a randombaseline, except coverage sampling which relieson the attention mechanism. Coverage samplingwas shown to be amongst the best approaches inprior work that used LSTM-based NMT model.Given Transformer’s more complex attention ar-

92

chitecture (multi-headed and multi-layered), it ap-pears that the attention scores are not reliableenough to be used with the AL methods.

From our ablation study on using differentsources of embeddings, we discovered that opti-mizing the embeddings towards a semantic sim-ilarity task can give significant performance im-provements in data-driven AL methods. Also, forthe first time, we observed that a model-drivenapproach can outperform data-driven methods.The improvement was more evident in the out-of-domain evaluation results. This was possiblewith our proposed neural extension - RTTL, whichcomputes the likelihood score of re-constructingthe original source from its translation using a re-verse translation model. Overall, we observed thatthe performance trends of different AL methodswere consistent with all the three evaluation met-rics (BLEU, BEER, and TER) and on differentevaluation sets (in-domain and out-of-domain).

Acknowledgments

We would like to thank Tom Gunter, AndrewFinch, Russ Webb and the anonymous reviewersfor their helpful comments. Many thanks to theSiri Machine Translation Team for helpful discus-sions and support.

ReferencesVamshi Ambati. 2011. Active learning and crowd-

sourcing for machine translation in low resourcescenarios. Ph.D. thesis, Carnegie Mellon Univer-sity.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In International Con-ference on Learning Representations, San Diego,CA, USA.

Maria-Florina Balcan, Alina Beygelzimer, and JohnLangford. 2009. Agnostic active learning. Journalof Computer and System Sciences, 75(1):78–89.

Loıc Barrault, Ondrej Bojar, Marta R. Costa-jussa,Christian Federmann, Mark Fishel, Yvette Gra-ham, Barry Haddow, Matthias Huck, Philipp Koehn,Shervin Malmasi, Christof Monz, Mathias Muller,Santanu Pal, Matt Post, and Marcos Zampieri. 2019.Findings of the 2019 conference on machine transla-tion (wmt19). In Proceedings of the Fourth Confer-ence on Machine Translation, pages 128–188. Asso-ciation for Computational Linguistics.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors with

subword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

David Cohn, Les Atlas, and Richard Ladner. 1994. Im-proving generalization with active learning. Ma-chine Learning - Special issue on structured connec-tionist systems, 15(2):201–221.

Ido Dagan and Sean P Engelson. 1995. Committee-based sampling for training probabilistic classifiers.In Machine Learning Proceedings 1995, pages 150–157. Elsevier, Tahoe City, California, USA.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186, Minneapolis, Minnesota.

Matthias Eck. 2008. Developing deployable spo-ken language translation systems given limited re-sources. Ph.D. thesis, Verlag nicht ermittelbar.

Yuhong Guo and Russell Greiner. 2007. Optimisticactive-learning using mutual information. In Pro-ceedings of the 20th International Joint Conferenceon Artificial Intelligence, volume 7, pages 823–829,Hyderabad, India.

Gholamreza Haffari, Maxim Roy, and Anoop Sarkar.2009. Active learning for statistical phrase-basedmachine translation. In Proceedings of HumanLanguage Technologies: The 2009 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 415–423,Boulder, Colorado, USA. Association for Computa-tional Linguistics.

Diederik P Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization. In Interna-tional Conference on Learning Representations, SanDeigo, CA, USA.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-dra Constantin, and Evan Herbst. 2007. Moses:Open source toolkit for statistical machine transla-tion. pages 177–180, Prague, Czech Republic.

Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprintarXiv:1901.07291.

Quoc Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In Inter-national Conference on Machine Learning, pages1188–1196, Beijing, China.

David D Lewis and Jason Catlett. 1994. Heteroge-nous uncertainty sampling for supervised learning.

http://www.cs.cmu.edu/afs/.cs.cmu.edu/Web/People/jgc/publication/Active%20Learning%20Crowdsourcing.pdf



https://arxiv.org/pdf/1409.0473.pdf


http://www.cs.cmu.edu/~ninamf/papers/a2.pdf

http://www.statmt.org/wmt19/pdf/WMT0001.pdf

http://www.statmt.org/wmt19/pdf/WMT0001.pdf

https://www.aclweb.org/anthology/papers/Q/Q17/Q17-1010/

https://www.aclweb.org/anthology/papers/Q/Q17/Q17-1010/

https://link.springer.com/article/10.1007/BF00993277

https://link.springer.com/article/10.1007/BF00993277



https://www.aclweb.org/anthology/N19-1423



http://isl.anthropomatik.kit.edu/cmu-kit/downloads/PhD_2008_07_Developing_Deployable_Spoken_Language_Translation_Systems_given_Limited_Resources.pdf



https://www.ijcai.org/Proceedings/07/Papers/132.pdf

https://www.ijcai.org/Proceedings/07/Papers/132.pdf





https://www.aclweb.org/anthology/P07-2045





http://proceedings.mlr.press/v32/le14.html

http://proceedings.mlr.press/v32/le14.html

93

In Machine Learning Proceedings 1994: Proceed-ings of the Eighth International Conference, pages148–156, New Brunswick, NJ, USA. Morgan Kauf-mann Publishers Inc.

Myle Ott, Michael Auli, David Grangier, et al. 2018.Analyzing uncertainty in neural machine translation.In International Conference on Machine Learning,pages 3953–3962, Stockholm, Sweden.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automaticevaluation of machine translation. pages 311–318,Philadelphia, PA, USA.

Alvaro Peris and Francisco Casacuberta. 2018. Activelearning for interactive neural machine translation ofdata streams. In Proceedings of the 22nd Confer-ence on Computational Natural Language Learning,pages 151–160, Brussels, Belgium.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. pages 1715–1725, Berlin, Germany.

Burr Settles and Mark Craven. 2008. An analysis of ac-tive learning strategies for sequence labeling tasks.In Proceedings of the 2008 Conference on Empiri-cal Methods in Natural Language Processing, pages1070–1079, Waikiki, Honolulu, Hawaii.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In Proceedings of Association for Machine Trans-lation in the Americas, pages 223–231, Cambridge,Massachusetts.

Milos Stanojevic and Khalil Sima’an. 2015. Beer 1.1:Illc uva submission to metrics and tuning task. InProceedings of the Tenth Workshop on StatisticalMachine Translation, pages 396–401, Lisbon, Por-tugal.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), volume 1,pages 76–85.

Gokhan Tur, Dilek Hakkani-Tur, and Robert ESchapire. 2005. Combining active and semi-supervised learning for spoken language under-standing. Speech Communication, 45(2):171–186.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. pages 1–11, Long Beach, CA, USA.

John Wieting and Kevin Gimpel. 2018. Paranmt-50m:Pushing the limits of paraphrastic sentence embed-dings with millions of machine translations. In Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 451–462, Melbourne, Australia.

Mike Zhang and Antonio Toral. 2019. The effect oftranslationese in machine translation test sets. arXivpreprint arXiv:1906.08069.

Pei Zhang, Xueying Xu, and Deyi Xiong. 2018. Ac-tive learning for neural machine translation. In 2018International Conference on Asian Language Pro-cessing (IALP), pages 153–158, Bandung, Indone-sia. IEEE.

http://proceedings.mlr.press/v80/ott18a.html

https://www.aclweb.org/anthology/P02-1040.pdf

https://www.aclweb.org/anthology/P02-1040.pdf

https://www.aclweb.org/anthology/K18-1015



https://doi.org/10.18653/v1/P16-1162

https://doi.org/10.18653/v1/P16-1162

https://www.aclweb.org/anthology/D08-1112

https://www.aclweb.org/anthology/D08-1112

https://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf

https://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf

https://www.aclweb.org/anthology/W15-3050

https://www.aclweb.org/anthology/W15-3050



https://www.cs.princeton.edu/~schapire/papers/active-semisup-slu.pdf



https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf






https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8629116

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8629116

Documents

Empirical Evaluation of Active Learning Techniques for ... · Empirical Evaluation of Active Learning Techniques for Neural MT Xiangkai Zeng1 Sarthak Garg 2Rajen Chatterjee Udhyakumar