The Source-Target Domain Mismatch Problem in Machine

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 1519–1533April 19 - 23, 2021. ©2021 Association for Computational Linguistics

1519

The Source-Target Domain Mismatch Problem in Machine Translation

Jiajun Shen�† Peng-Jen Chen�† Matt Le† Junxian He•∗ Jiatao Gu†Myle Ott† Michael Auli† Marc’Aurelio Ranzato†

†Facebook AI Research•Carnergie Mellon University

{jiajunshen,pipibjc,mattle,jgu}@fb.com{myleott,michaelauli,ranzato}@fb.com [email protected]

Abstract

While we live in an increasingly intercon-nected world, different places still exhibit strik-ingly different cultures and many events weexperience in our every day life pertain onlyto the specific place we live in. As a result,people often talk about different things in dif-ferent parts of the world. In this work westudy the effect of local context in machinetranslation and postulate that this causes thedomains of the source and target language togreatly mismatch. We first formalize the con-cept of source-target domain mismatch, pro-pose a metric to quantify it, and provide em-pirical evidence for its existence. We con-clude with an empirical study of how source-target domain mismatch affects training ofmachine translation systems on low resourcelanguages. While this may severely affectback-translation, the degradation can be allevi-ated by combining back-translation with self-training and by increasing the amount of targetside monolingual data.

1 Introduction

The use of language greatly varies with the ge-ographic location (Firth, 1935; Johnstone, 2010).Even within places where people speak the samelanguage (Britain, 2013), there is a lot of lexicalvariability due to change of style and topic distribu-tion, particularly when considering content postedon social media, blogs and news outlets. For in-stance, while a primary topic of discussion betweenBritish sport fans is cricket, American sport fansare more likely to discuss other sports such as base-ball (Leech and Fallon, 1992).

The effect of local context in the use of languageis even more extreme when considering regionswhere different languages are spoken. Despite the

∗Work done during an internship at Facebook AI Research.�Equal contribution.

increasingly interconnected world we live in, peo-ple in different places tend to talk about differentthings. There are several reasons for this, fromcultural differences due to geographic separationand history, to the local nature of many events weexperience in our every day life.

This phenomenon has not only interesting socio-linguistic aspects but it has also strong implica-tions in machine translation (MT) (Bernardini andZanettin, 2004). In particular, machine translationaims at automatically translating content in twolanguages that are often spoken in very distant ge-ographic locations by people with rather differentcultures.

As of today, most MT research has been basedon the often implicit assumption that content inthe two languages is comparable. Sentences com-prising the parallel dataset used for training areassumed to cover the same topic distribution, re-gardless of the originating language. Even whenthere exist a mismatch between the source and thetarget domain, the dataset creator is assumed tohave made the effort to equalize the two distribu-tions.

The major contribution of this work is to raiseawareness in the MT community that this assump-tion may not hold in many real world settings of in-terest, which often involve distant and low-resourcelanguage pairs and for content produced every dayon the Internet by means of blogs, social platformsand news outlets. The goal of an MT system is totranslate source sentences sampled from the sourcedomain to the target language. Training an MTsystem on a comparable corpus may lead to poorgeneralization unless the test domain matches thedomain of the training data. Likewise, training onan uncurated corpus exhibiting mismatch betweenthe source and the target domain may also workpoorly when naïvely applying popular methods likeback-translation (Sennrich et al., 2015).

1520

Notice that there may very well be several fac-tors contributing to such mismatch, such as changeof register, different functional use of the text inthe two languages, and difference in the quality ofthe data generation process of the two languages.Regardless of what is contributing to the observedmismatch, it is important to be aware of its exis-tance and effect on MT algorithms.

The source-target domain mismatch (STDM)can be understood as an instance of multi-domainMT (see Fig. 1 for an illustration and §3 for a for-mal definition), whereby part of the parallel datasetand the source monolingual dataset are “in-domain”because they originate from the source domain, andthe remaining part of the parallel dataset as well asthe target monolingual data are “out-of-domain”.There already exist several techniques for domainadaptation, like domain tagging (Caswell et al.,2019) and dataset weighting (Edunov et al., 2018;Wang et al., 2017; van der Wees et al., 2017), whichare applicable and which we also employ in thiswork. However, these may not be enough to im-prove generalization on low resource languagesbecause STDM effectively decreases the alreadyscarce amount of useful (in-domain) parallel data,hindering good generalization. It is therefore im-portant to quantify STDM (§4) and consider howSTDM affects methods that leverage monolingualdata (§5).

For instance, STDM may negatively impact theeffectiveness of back-translation because, evenif the backward model was perfect, the back-translated data is out-of-domain relative to thesource domain from which we aim to translate.Empirically we found that this is the case both ina controlled setting (§6.1) as well as in realisticdatasets (§6.2). However, this issue can be com-pensated by adding more target-side monolingualdata and by combining back-translation with self-training (Yarowski, 1995).

2 Related Work

The observation that topic distributions and var-ious kinds of lexical variabilities depend on thelocal context has been known and studied for along time (Firth, 1935). For instance, Firth (1935)says “Most of the give-and-take of conversation inour everyday life is stereotyped and very narrowlyconditioned by our particular type of culture”. Inher seminal work, Johnstone (2010) analyzed therole of place in language, focusing on lexical vari-

ations within the same language, a subject furtherexplored by Britain (2013). Some of these workswere the basis for later studies that introduced com-putational models for how language changes withgeographic location (Mei et al., 2006; Eisensteinet al., 2010).

In the field of topic modeling, there has been anew sub-field emerging over the past 10 years fo-cusing on modeling multi-lingual corpora (Mimnoet al., 2009; Boyd-Graber and Blei, 2009; Gutier-rez et al., 2016). However, only recently had re-searchers dropped assumptions on the use of paral-lel and comparable corpora (Hao and Paul, 2018;Yang et al., 2019). While some works do investi-gate issues related to STDM (Gutierrez et al., 2016),like how named entities receive a different distribu-tion over words in different languages (Lin et al.,2018), none of these works have analyzed how theoverall topic distribution of data originating in thesource and target language differ.

In MT, researchers have often made an ex-plicit assumption on the use of comparable cor-pora (Fung and Yee, 1998; Munteanu et al., 2004;Irvine and Callison-Burch, 2013), i.e. corpora inthe two languages that roughly cover the same setof topics. Unfortunately, monolingual corpora areseldom comparable in practice. Leech and Fallon(1992) analyzes two comparable corpora, one inAmerican English and the other in British English,and demonstrate differences that reflect the cul-tures of origin. Similarly, Bernardini and Zanettin(2004) observes that parallel datasets built for MTexhibit strong biases in the selection of the originaldocuments, making the text collection not quitecomparable.

The non-comparable nature of MT datasets iseven more striking when considering low resourcelanguage pairs, for which differences in local con-text and cultures are often more pronounced. Re-cent studies (Søgaard et al., 2018; Neubig and Hu,2018) have warned that removing the assumptionon comparable corpora strongly deteriorates perfor-mance of lexicon induction techniques which areat the foundation of MT.

Back-translation (Sennrich et al., 2015) has beenthe workhorse of modern neural MT, enabling veryeffective use of target side monolingual data. Back-translation is beneficial because it helps regulariz-ing the model and adapting to new domains (Burlotand Yvon, 2018). However, the typical settingof current MT benchmarks as popularized by re-

1521

NYC

Everes

t

Yosemite

Rome

Kathman

du mono

TEST

parallel human translation

mono

parallelhuman translation

Source Language

Target Language

NYC

Everes

t

Yosemite

Rome

Kathman

du

distribution over topics

Latent SpaceDS

<latexit sha1_base64="+4g7QrHz8quNx3pLpZZQAime8gE=">AAAGTXichVRbaxNBFN7WpK3x1uqjLwdLIYtryEZBQQptTcEHK5VeIZsss5NJM83e2JktWbbzB30RfPNf+OKDIuLspXGTbONA4GTO9835zjdnx/Jtyniz+W1p+U6lurK6drd27/6Dh4/WNx6fMi8MMDnBnu0F5xZixKYuOeGU2+TcDwhyLJucWaN3Sf7sigSMeu4xj3zSddCFSwcUIy63zI0K3jIcxIcY2XFbwDYYMdTHWqSaFAxhxnRb1xoN7aOo/cMdiB7LkOMeMy9T3GWGOzDZDJJnyKjHzVGKHN0geRHZFr3YcCxvHKNwLG6EhNrVjJBDSfLr0fVYrW3J6mAw6sCUssKhuxOh9VSpBoaFgjgS5qV6u2hpg4FDH6aOkfmMWgB+EHWDDwlHqizyAgzbuwCpDa4hUVeA3bQm2rv7pZxxwoHn4N7COzoupeWSysi7E+cz1FiYIy27BXXRLcy3zssl7ZVLSrrPC6r/O7jc5vla3OPIFgvMTv/Lme8jU59k5PkTKT1exLRm/JNDcp2MsrrgblnomDHTuMj0Wla8L11Mv5SZGWyLHCigAzAl9C1IHdCd8sXMmySu5hIxa9NUfkgX5IdUI+6CvEsy/m09lvjOQr/gOsgRyx0sA8t5KGBrxQ/7yFzfbDaa6YL5QM+DTSVfh+b6V6Pv4dAhLsc2YqyjN33ejVHAKbaJqBkhIz7CI3RBOjJ0kUNYN05fQwFbcqcPAy+QP5dDultkxMhhLHIsiUxEstlcslmW64R88KYbU9cPOXFxVmgQ2sA9SJ5W6NOAYG5HMkA4oFIr4CEKEObyAa5JE/TZlueD01ZDf9lofXq1ubOX27GmPFWeKXVFV14rO8p75VA5UXDlc+V75WflV/VL9Uf1d/VPBl1eyjlPlKm1svoXxgEXYw==</latexit>

DT<latexit sha1_base64="nMop9S8+0CEhGAZFr34nZGQi1Q4=">AAAGTXichVRbaxNBFN5q0tZ4a/XRl4OlkMU1ZKugIIW2puCDlYq9QTZZZieTZpq9sTNbdtnOH/RF8M1/4YsPioizl8ZNso0DgZM53zfnO9+cHcu3KePt9relW7dr9eWV1TuNu/fuP3i4tv7ohHlhgMkx9mwvOLMQIzZ1yTGn3CZnfkCQY9nk1Bq/TfOnlyRg1HOPeOyTnoPOXTqkGHG5Za7X8KbhID7CyE46ArbBSKAZabFqUjCEmdBtXWu1tA+i8Q93IPosR0Z9Zl5kuIscd2CyGSTPkXGfm+MMOb5G8jKyI/qJ4VhelKAwEtdCQu1yRsihJPnN+CpSG5uyOhiMOjClrHTo7kRoM1OqgWGhIImFeaHeLFraYODQh6ljZD6nloDvRdPgI8KRKos8B8P2zkFqgytI1ZVg162Jzu5+JSdKOfAM3Bt4n44qaYWkKvLuxPkcFQlzrOW3oC66hfnWebWkvWpJafdFQfV/B1fbPF+LexzZYoHZ2X858wNk6pOMPH8ipc/LmK0Z/+SQXKWjrC64WxY6ZsI0LnK9lpXsSxezL2VmBjuiAAroAkwJfQNSB/SmfDGLJomruUTM2jSVH9EF+RHViLsg75Kcf1OPFb6z0C+5DnLECgerwHIeSthG+cM+Mtc22q12tmA+0ItgQynWobn21Rh4OHSIy7GNGOvqbZ/3EhRwim0iGkbIiI/wGJ2Trgxd5BDWS7LXUMCm3BnA0Avkz+WQ7ZYZCXIYix1LIlORbDaXblbluiEfvu4l1PVDTlycFxqGNnAP0qcVBjQgmNuxDBAOqNQKeIQChLl8gBvSBH225fngZKulv2htfXy5sbNX2LGqPFGeKk1FV14pO8o75VA5VnDtc+177WftV/1L/Uf9d/1PDr21VHAeK1NreeUvx4UXZA==</latexit>

politics

sports

trave

l

politics

sports

trave

l

distribution over topics distribution over words

distribution over words

Figure 1: Toy illustration of STDM in MT. There are twodomains, the source domain DS (top) and the target domainDT (bottom). We postulate that in a latent concept space thesetwo domains differ because the topic distributions are differ-ent (e.g., in the source domain politics is more popular thantravel) and because for the same topic the word distributionsare different (e.g., the word “Everest” is more common than“Yosemite” in the travel topic of the target domain). On theright hand side, we show how STDM manifests in machinetranslation datasets. All data originating in the source lan-guage belongs to the source domain, this includes a portionof the parallel dataset, the source side monolingual datasetand the test set we eventually would like to translate. Emptyboxes represent human translated data in the parallel trainingdataset.

cent WMT competitions (Bojar et al., 2019) is amismatch between training and test sets, as op-posed to a mismatch between source and targetdomains as in this work. Self-training (Ueffing,2006; Yarowski, 1995; He et al., 2020) has thenbeen employed to make better use of source sidemonolingual data as this is in-domain with the textwe would like to translate at test time. Finally, thereis a vast literature on domain adaptation which hasso far mostly focused on domain shift betweentraining and test distribution, and presence of mul-tiple domains. Finetuning (Freitag and Al-Onaizan,2016), domain tagging (Caswell et al., 2019) andvarious kinds of dataset weighting (Wang et al.,2017; van der Wees et al., 2017) are among themost popular methods to cope with domain issues,and this is also what we use in this work.

3 The STDM Problem

In this section we formalize the definition ofSource-Target Domain Mismatch (STDM); this isan intrinsic property of the data which is indepen-dent of the particular MT system under considera-tion. In practice, there might be several factors con-tributing to STDM. Here, we are going to consideronly the difference in topic distributions, since thisis what we can easily quantify.

We assume there exists a latent concept spaceshared across all languages. The process to gener-

ate a sentence follows the standard data generationprocess used in topic modeling, whereby we firstsample a distribution over topics, πi ∼ Π wherei is an index over topics, and then a distributionover words for each topic, wij ∼ πi, where j in-dexes the words in the dictionary. Next, we assumethere are two distinct domains, the source domainDS and the target domain DT . These two domainsdiffer in both the distribution over topics Π, andthe distribution over words given a certain topicπi, as depicted in Fig. 1. For the sake of concise-ness, we will refer to zs and zt as sentences in theconcept space generated from domain DS and DT ,respectively.

Let’s imagine now that we have generated twosets of sentences in each domain. What we ob-serve in practice is their realization in each lan-guage, src(zs) and tgt(zt), where src and tgt mapsentences from the concept space to the source andtarget language, respectively. Finally, let’s denotewith hs→t and ht→s the functions representing hu-man translations of source sentences in the targetlanguage and vice versa.

In the simplest setting, a machine transla-tion dataset is composed of parallel and mono-lingual datasets. Using the notation intro-duced above, the parallel dataset is denotedby P = {(src(zs), hs→t(src(zs))}zs∼DS ∪{(ht→s(tgt(zt)), tgt(zt)}zt∼DT . The first set orig-inates in the source language and belongs to thesource domain, while the second set originates inthe target language and belongs to the target do-main. We then have a source side monolingualdataset,MS = {src(zs)}zs∼DS , and a target sidemonolingual dataset,MT = {tgt(zt)}zt∼DT , be-longing to the source and target domains, respec-tively. Most importantly, the test set which wewould like to eventually translate contains sen-tences in the source language, all belonging to thesource domain. The existence of distinct sourceand target domains and datasets derived from thesetwo domains as described above define the STDMproblem.

While in previous MT studies, there is a mis-match between the training and the test distribution,STDM is a particular case of multi-domain training,with the test set matching one of the training do-mains. We would like to a) understand the effectsof such mismatch and b) understand how to bestleverage the out-of-domain data originating fromthe target language (target monolingual dataset and

1522

Figure 2: Topic distribution of Wikipedia pages written inEnglish and Chinese.

portion of the parallel dataset originating in thetarget language).

3.1 Empirical EvidenceIn this section we first provide anecdotal evidencethat documents originating in different languagespossess different distributions over topics. Wetrain two topic classifiers (see Appendix A for de-tails), one for Chinese and the other for English,using the Wikipedia annotated data from Yuan et al.(2018). We apply this classifier to 20,000 docu-ments randomly sampled from English and Chi-nese Wikipedia. Fig. 2 shows that according to thisclassifier, English Wikipedia has more pages aboutentertainment and religion than Chinese Wikipedia,for instance.

Second, we refer the readers to Leech and Fal-lon (1992)’s study to find evidence that corporaoriginating in different places may have differentword distributions for the same set of topics. In thatstudy, Leech and Fallon (1992) analyzed a Britishand an American corpus constructed using exactlythe same distribution of topics, and yet the worddistribution was different, reflecting the differentcultural biases of the two countries.

4 Metric: The STDM Score

Given the framework and assumptions introducedin §3, in this section we are going to discuss a prac-tical way to measure STDM. Ideally, we would liketo measure a distance between two sample distri-butions, zs ∼ DS and zt ∼ DT . Unfortunately,we have no access to such latent space. What weobserve are realizations in the source and target lan-guage. However, it is also an open research ques-tion (Hao and Paul, 2018; Yang et al., 2019) howto compare the distribution of {src(zs)} against{tgt(zt)}, since these are two possibly incompara-ble corpora in different languages.

In this work, we therefore leverage the exis-tence of a parallel corpus and compare the dis-tribution of AT = {tgt(zt)}zt∼DT with AS ={hs→t(src(zs))}zs∼DS . The underlying assump-tions are a) we know the originating language ofeach training example, b) the effect of the change ofthe word distribution is negligible compared to theshift in topic distribution, and c) the effect of trans-lationese (Baker, 1993; Zhang and Toral, 2019;Toury, 2012) is negligible compared to the actualSTDM, and therefore, we can ignore changes tothe distribution brought by the mapping hs→t (wevalidate this assumption in §C).

Under these assumptions, we define the scoreas a measure of the topic discrepancy between ASand AT . Let A = AS ∪ AT be the concatenationof the corpus originating in the source and targetlanguage. We first extract topics using LSA. LetA ∈ R(nS+nT )×k be the TF-IDF matrix derivedfrom A where the first nS rows are representationstaken from AS , the bottom nT rows are represen-tations of AT , and k is the number of words in thedictionary. The SVD decomposition of A yields:A = USV = (U

√(S))(

√(S)V ) = U V . Matrix

U collects topic representations of the original doc-uments; let’s denote by US the first nS rows corre-sponding toAS and UT the remaining nT rows cor-responding to AT . Let C = U U ′ =

[CSS CST

CST ′ CTT

],

where CSS = USUS ′, CST = USUT ′ andCTT = UT UT ′. The STDM score is defined as:

score =sST + sTS

sSS + sTT ,with sAB =1

nAnB

nA∑i=1

nB∑j=1

CABi,j (1)

where sAB measures the average similarity betweendocuments of set A to documents of set B. Thescore measures the cross-corpus similarity normal-ized by the within corpus similarity. In the extremesetting where DS and DT are fully disjoint, thenwe would have that the off-diagonal block CST isgoing to be a zero matrix and therefore the score isequal to 0. When the two domains perfectly matchinstead, sSS = sTT = sST = sTS, and therefore, thescore is equal to 1. In practice, we expect a scorein the range [0, 1]. Note that we opted for this met-ric because of its simplicity, but other methods toextract topics and measure domain mismatch couldhave been used.

4.1 A Controlled Setting

Similarly to Kilgarriff and Rose (1998), we intro-duce a synthetic benchmark to finely control the

1523

α 0 0.25 0.5 0.75 1.0STDM score 0.29 0.55 0.78 0.93 0.99

Table 1: STDM score as a function of the parameter α con-trolling the STDM in the synthetic setting.

domain of the target originating data, and thereforethe amount of STDM. The objective is to assesswhether the STDM score defined in Eq. 1 captureswell the expected amount of mismatch, since wehave full control over how the data was generatedand its domain.

The key idea of this controlled setting is to usedata from two very different domains, assign thesource domain to one of them and the target domainto a convex combination of the two. In this workwe use EuroParl (Koehn, 2005) as our source origi-nating data, while our target originating data con-tains a mix of data from EuroParl and OpenSubti-tles (Lison and Tiedemann, 2016). Specifically, weconsider a French to English translation task witha parallel dataset composed of 10,000 sentencesfrom EuroParl (which “originates” in French) and10,000 sentences from the target domain (whichis “originates” in English)1. Let α ∈ [0, 1], thedomain of the target is set to: α EuroParl + (1−α)OpenSubtitles. α controls how similar the targetdomain is to the source domain.

For instance, when α = 0 then the target do-main (OpenSubtitles) is totally out-of-domain withrespect to the source domain (EuroParl). Whenα = 1 instead, the target domain matches perfectlythe source domain. For intermediate values of α,the match is only partial. Notice that even whenα = 0, we assume that the parallel dataset is com-prised of two halves, one originating from the Eu-roParl domain (the “French originating” data) andone from OpenSubtitles (the “English originating”data).

Next, we evaluate the STDM score as a functionof α. As we can see from Table 1 and as desired,the STDM score increases fairly linearly as weincrease the value of α. We refer the reader toAppendix B for experimental details.

4.2 STDM Score of Various Datasets

We now evaluate the STDM score on real data.We consider six language pairs, German-English,

1Clearly, EuroParl does not originate all in French. How-ever, the chosen domains are so distinct that difference intopic distribution between EuroParl and OpenSubtitles willdominate discrepancies caused by other factors such as theactual origin of the data.

De-En Fi-En Ru-En Ne-En Zh-En Ja-EnWMT 0.79 0.79 0.76 - 0.65 -MTNT - - - - - 0.69SMD 0.81 0.71 0.71 0.64 0.71 0.61

Table 2: STDM score on several language pairs using paralleldata from WMT, MTNT and from a social media platform(SMD) test sets.

Finnish-English, Russian-English, Nepali-English,Chinese-English and Japanese-English. We an-alyze datasets from WMT, MTNT (Michel andNeubig, 2018) and from a social media platform(SMD). For each language, we sample 5000 sen-tences from WMT newstest sets and MTNT dataset,and 20000 sentences from SMD. We then merge allthese datasets and their English translations to com-pute a common set of topics, making STDM scorescomparable across language pairs and datasets.

The results in Table 2 are striking. First, WMTdatasets, except for Chinese, show relatively mildsigns of STDM and negligible difference acrosslanguage pairs, suggesting that the data curationprocess of WMT datasets have made source andtarget originating corpora rather comparable. Thedistribution of WMT Chinese originating data in-stead is rather different because it contains muchmore local news, while the other languages aremostly about international news which are largelylanguage independent. Interestingly, En-De dataderived from social media data has even milderSTDM, Fi-En and Ru-En have more substantialSTDM. Instead, MTNT and SMD exhibit strongsigns of STDM for distant languages like Nepali,Chinese and Japanese. This agrees well with our in-tuition that STDM is more severe for more distantlanguages associated to more diverse cultures.

5 Machine Translation Baselines

In this section, we turn our attention to howSTDM affects training of MT systems. We con-sider state-of-the-art neural machine translation(NMT) systems based on the transformer architec-ture (Vaswani et al., 2017) with subword vocabular-ies learned via byte-pair encoding (BPE) (Sennrichet al., 2015). In order to adapt to the different do-mains, we employ domain tagging (Zheng et al.,2019) by adding a domain token to the input sourcesentence, and we cross validate the weights be-tween in-domain and out-domain data2. We also

2In the controlled setting of §6.1 we found that taggingyields a small but consistent improvement by up to 1 BLEUpoint, and dataset weighting yields an improvement of upto 0.3 BLEU. ST and BT which are the focus of this study,

1524

use label smoothing (Szegedy et al., 2016) anddropout (Srivastava et al., 2014) to improve gen-eralization, as we focus on low resource languagepairs where models tend to severely overfit. Fi-nally, we explore ways to leverage both target andsource side monolingual data via back-translationand self-training which we review next.

We simplify our notation and denote with xs =src(zs) and yt = tgt(zt) the source and targetoriginating sentences, ys = hs→t(x

s) and xt =ht→s(y

t) the corresponding human translations,and ys and xt the corresponding machine trans-lations. The superscript always specifies the do-main. We assume access to a parallel datasetP = {(xs, ys)} ∪ {(xt, yt)}, a source side mono-lingual datasetMs = {xs} and a target side mono-lingual datasetMt = {yt}. The test side consistsof sentences from the source domain. We studySTDM by training with the following algorithms:

Back-translation (BT) (Sennrich et al., 2015)is a very effective data augmentation techniquethat leverages Mt. The algorithm proceeds inthree steps. First, a reverse MT system is trainedfrom target to source using the provided paral-lel data:

←−θ = arg maxθ E(x,y)∼P log p(x|y; θ).

Then, the reverse model is used to translate the tar-get monolingual data: xt ≈ arg maxz p(z|yt;

←−θ ),

for yt ∼ Mt. The maximization is typically ap-proximated by beam search. Finally, the forwardmodel is trained over the concatenation of theoriginal parallel and back-translated data:

−→θ =

arg maxθ E(x,y)∼Q log p(y|x; θ) with Q = P ∪{xt, yt}yt∼Mt . In practice, the parallel data isweighted more in the loss, with a weight selectedvia hyper-parameter search on the validation set.

Self-Training (ST) (He et al., 2020) is anothermethod for data augmentation that instead lever-agesMs; see Alg. 1 in Appendix D.

Also this algorithm proceeds in three steps. First,a forward MT system is trained from source totarget using the provided parallel data:

−→θ =

arg maxθ E(x,y)∼P log p(y|x; θ). Then, this modelis used to translate the source monolingual data:ys ≈ arg maxz p(z|xs;

−→θ ), for xs ∼Ms. Finally,

the forward model is retrained over the concatena-tion of the original parallel and forward-translateddata:

−→θ = arg maxθ E(x,y)∼Q log p(y|x; θ) with

Q = P ∪ {xs, ys}yt∼Ms . As with BT, the paralleldata is weighted more in the loss.

improve by more than 2 BLEU points instead.

Figure 3: BLEU score in Fr-En as a function of the amountof STDM. The target domain is fully out-of-domain whenα = 0, and fully in-domain when α = 1.

ST + BT also proceeds in three steps. First, wetrain an initial forward and reverse model using theparallel dataset. Second, we back-translate targetside monolingual data using the reverse model anditeratively forward translate source side monolin-gual data using the forward model. We then retrainthe forward model from random initialization us-ing the union of the original parallel dataset, thesynthetic back-translated data, and the syntheticforward translated data at the last iteration of theST algorithm. This combined algorithm aims atleveraging the strengths of both ST and BT: the useof in-domain source monolingual data and the useof synthetic data with correct targets, respectively.

6 Machine Translation Results

In this section, we first study the effect of STDMon NMT using the controlled setting introduced in§4.1 which enables us to assess the influence ofvarious factors, such as the extent to which targetoriginating data is out-of-domain, and the effect ofmonolingual data size. We then report experimentson genuine low resource language pairs, namelyNepali-English and English-Myanmar. We reportSACREBLEU (Post, 2018).

6.1 Controlled Setting

In the default setting, we have a parallel datasetwith 20,000 parallel sentences. 10,000 are in-domain source originating data (EuroParl) and theremaining 10,000 are target originating data froma mix of domains, controlled by α ∈ [0, 1]: α Eu-roParl + (1 − α) OpenSubtitles. The source sidemonolingual dataset has 100,000 French sentencesfrom EuroParl. The target side monolingual datasethas 100,000 English sentences from: α EuroParl +(1− α) OpenSubtitles. Finally, the test set consistsof novel French sentences from EuroParl which wewish to translate to English.

1525

Figure 4: BLEU as a function of the amount of monolingualdata when α = 0.

We tune model hyper-parameters (e.g., numberof layers and hidden state size) and BPE size onthe validation set. Based on cross-validation, whentraining on datasets with less than 300k parallelsentences (including those from ST or BT), we usea 5-layer transformer with 8M parameters. Thenumber of attention heads, embedding dimensionand inner-layer dimension are 2, 256, 512, respec-tively. When training on bigger datasets, we use abigger transformer with 5 layers, 8 attention heads,1024 embedding dimension, 2048 inner-layer di-mension and a total of 110M parameters. We findthat using bigger model can better utilize the mono-lingual data, but we do not find the bigger modelbenefits when training with less than 300k parallelsentences. The full list of hyper-parameters can befound in Appendix E.

Varying amount of STDM. In Fig. 3, we bench-mark our baseline approaches while varying α (see§4.1 and Tab. 1), which controls the overlap be-tween source and target domain.

First, we observe improved BLEU (Papineniet al., 2002) scores for all methods as we increaseα. Second, there is a big gap between the baselinetrained on parallel data only and methods whichleverage monolingual data. Third, combining STand BT works better than each individual method,confirming that these approaches are complemen-tary. Finally, BT works better than ST but the gapreduces as the target domain becomes increasinglydifferent from the source domain (small values ofα). In the extreme case of STDM (α = 0), SToutperforms BT. In fact, we observe that the gainof BT over the baseline decreases as α decreases,despite that the amount of monolingual data andparallel data remains constant across these experi-ments, thus showing that BT is less effective in thepresence of STDM.

Figure 5: BLEU when using only source originating in-domain data (blue bars) or also out-of-domain target origi-nating data (green bars) for α = 0.

Varying amount of monolingual data. We nextexplore how the quantity of monolingual data af-fects performance and if the relative gain of STover BT when α = 0 disappears as we provide BTwith more monolingual data. The experiment inFig. 4 shows that a) the gain in BLEU tapers offexponentially with the amount of data (notice thelog-scale in the x-axis), b) for the same amountof monolingual data ST is always better than BTand by roughly the same amount, and c) BT wouldrequire about 3 times more target monolingual data(which is out-of-domain) to yield the performanceof ST. Therefore, increasing the amount of datacan compensate for domain mismatch.

Varying amount of in-domain data. Now weexplore whether, in the presence of extreme STDM(α = 0), it may be worth restricting the trainingdata to only contain in-domain source originatingsentences. In this case, the parallel set is reducedto 10,000 EuroParl sentences, the target side mono-lingual data is removed and back-translation is per-formed on the target side of the parallel dataset.Fig. 5 demonstrates that in all cases it is better toinclude the out-of-domain data originating on thetarget side (green bars). Particularly in the lowresource settings considered here, neural modelsbenefit from all available examples even if theseare out-of-domain.

Finally, we investigate how to construct a paral-lel dataset when STDM is significant (α = 0), i.e.the target domain is OpenSubtitles. If we have atranslation budget of 20,000 sentences, is it best totranslate 20,000 sentences from EuroParl or to alsoinclude sentences from OpenSubtitles? This is notobvious when training with BT, since the backwardmodel may benefit from in-domain OpenSubtitlesdata. In order to answer this question, we considera parallel dataset with 20,000 sentences defined as:β EuroParl + (1−β) OpenSubtitles, with β ∈ [0, 1].

1526

Figure 6: BLEU score as a function of the proportion of par-allel data originating in the source and target domain domain.When β = 0 all parallel data originates from OpenSubtitles,when β = 1 all parallel data originates from EuroParl. Sourceand target monolingual corpora have 900,000 sentences fromEuroParl and OpenSubtitles, respectively. The blue curvesshow BLEU in the forward direction (Fr-En translation ofEuroParl data). The red curves show BLEU in the reversedirection (En-Fr translation of OpenSubtitles sentences).

When β = 0, the parallel dataset is out-of-domain;when β = 1 the parallel data is all in-domain. Thetarget side monolingual dataset is fixed and con-tains 900,000 sentences from OpenSubtitles.

Fig. 6 shows that taking all sentences from Eu-roParl (β = 1) is optimal when translating fromFrench (EuroParl) to English (blue curves). At highvalues of β, we observe a slight decrease in accu-racy for models trained only on back-translateddata (dotted line), confirming that BT loses its ef-fectiveness when the reverse model is trained onout-of-domain data. However, this is compensatedby the gains brought by the additional in-domainparallel sentences (dashed line). In the more nat-ural setting in which the model is trained on bothparallel and back-translated data (dash-dotted line),we see monotonic improvement in accuracy with β.A similar trend is observed in the other direction(English to French, red lines). Therefore, if thegoal is to maximize translation accuracy in bothdirections, an intermediate value of β (≈ 0.5) ismore desirable.

6.2 Low-Resource MT

We now measure STDM on two low-resource lan-guage pairs and verify whether in practice BT’s per-formance deteriorates as expected when the STDMscore is low, while the combination of ST+BT of-fers better generalization. We consider two low-resource language pairs, Nepali-English (Ne-En)and English-Myanmar (En-My). Nepali and Myan-mar are spoken in regions with unique local contextthat is very distinct from English-speaking regions,and thus these make good language pairs for study-

Model Ne→ En En→MySTDM score=0.64 STDM score=0.27

baseline 20.4 28.1BT 22.3 30.0ST 22.1 31.9ST + BT 22.9 32.4

Table 3: BLEU scores for the Nepali to English and Englishto Myanmar translation task.

ing the STDM setting in real life.

Data. The Ne-En parallel dataset is composed of40,000 sentences originating in Nepali and only7,500 sentences originating in English. There are5,000 sentences in the validation and test sets alloriginating in Nepali. We also have 1.8M mono-lingual sentences in Nepali and English, collectedfrom public posts from a social media platform.This dataset closely resembles our idealized set-ting of Fig. 1. The STDM score of this dataset is0.64 (see Tab. 2) and is analogous to our syntheticsetting (§6.1) where α is low but β is large.

The En-My parallel data is taken from the AsianLanguage Treebank (ALT) corpus (Thu et al., 2016;Ding et al., 2018, 2019) with 18,088 training sen-tences all originating from English news. The val-idation and test sets have 1,000 sentences each,all originating from English. Following Chenet al. (2019), we use 5M English sentences fromNewsCrawl as source side monolingual data and100K Myanmar sentences from Common Crawl astarget side monolingual data. The STDM Score iseven lower on this dataset, only 0.27. Comparingto our controlled setting this dataset would have βequal to 1 and presumably a small value of α, anideal setting for ST.

Models. We run model hyper-parameter sweepon the validation set and pick the best-performingmodel architecture (e.g., number of layers and hid-den layer sizes). On both datasets, the paralleldata baseline is a 5-layer transformer with 8 atten-tion heads, 512 embedding dimensions and 2048inner-layer dimensions, which consists of 42M pa-rameters. When training with BT and ST, we usea 6-layer transformer with 8 attention heads, 1024embedding dimensions, 2048 inner-layer dimen-sions, resulting in 186M parameters. The detailedhyper-parameters search range can be found in Ap-pendix §E.

1527

Results. In Table 3, we observe that on the Ne-En task augmenting the parallel dataset with ei-ther forward- or back-translated monolingual dataachieves almost 2 BLEU points improvement overthe supervised baseline. On the En-My task whereSTDM is more severe (with a value of 0.27 whichis similar to α = 0 in Tab. 1), BT outperformsthe baseline by 1.9 BLEU, while ST improves bytwice as much. This is perhaps not surprising sincesource side monolingual data is in-domain andabundant, while target side monolingual data isscarce and out-of-domain. On both tasks, com-bining ST and BT outperforms each individualmethod.

7 Conclusions

While the commonly used WMT datasets exhibitmild STDM, we find that less curated datasets, of-ten in more distant and lower resource languagepairs (§4.2), exhibit much stronger STDM. Howcan these findings inform us on how to better setup a MT system in practice? Our first recommen-dation is to be aware of possible STDM, and (i)check whether origin language information is avail-able. If this is available, then it may be possibleto (ii) quantitatively measure STDM as describedin §4. Next, (iii) be aware that when STDM is se-vere (STDM score is low), BT performance suffers(Fig. 3). However, (iv) we may be able to combatthis by increasing the amount of target side (out-of-domain) monolingual data (Fig. 4) and (v) bycombining BT with ST (Fig. 3).

Of course, the relative ratio of monolingual datain the source and target side and the actual degra-dation brought by STDM depend on the particularlanguage pair. The more distant two languages, themore difficult the learning task and the more data isneeded to learn it. And finally, the less parallel datathere is, the more monolingual data will be neededto compensate. The intricate dependency betweenall these factors merits future investigation.

References

Mona Baker. 1993. Corpus linguistics and translationstudies: Implications and applications. Text andtechnology: In honour of John Sinclair, 233:250.

Silvia Bernardini and Federico Zanettin. 2004. Whenis a universal not a universal. Translation universals:do they exist? John Benjamin publisher Edited byAnna Mauranen and Pekka Kujammak, pages 51–62.

Ondrej Bojar, Christian Federmann, Mark Fishel,Yvette Graham, Barry Haddow, Matthias Huck,Philipp Koehn, and Christof Monz. 2019. Find-ings of the 2019 conference on machine translation(wmt19). In Proc. of WMT.

Jordan Boyd-Graber and David M. Blei. 2009. Multi-lingual topic models for unaligned text. In Confer-ence on Uncertainty in Artificial Intelligence.

David Britain. 2013. Space, Diffusion and Mobility.Wiley publishers; Book Editor(s): J.K. ChambersNatalie Schilling First. Chapter 22.

Franck Burlot and François Yvon. 2018. Using mono-lingual data in neural machine translation: a system-atic study. In Empirical Methods in Natural Lan-guage Processing.

Isaac Caswell, Ciprian Chelba, and David Grangier.2019. Tagged back-translation. In Proceedings ofthe Fourth Conference on Machine Translation (Vol-ume 1: Research Papers), pages 53–63, Florence,Italy. Association for Computational Linguistics.

Peng-Jen Chen, Jiajun Shen, Matt Le, Vishrav Chaud-hary, Ahmed El-Kishky, Guillaume Wenzek, MyleOtt, and Marc’Aurelio Ranzato. 2019. Facebookai’s wat19 myanmar-english translation task submis-sion. In Workshop on Asian Translation.

Chenchen Ding, Hnin Thu Zar Aye, Win Pa Pa, KhinThandar Nwet, Khin Mar Soe, Masao Utiyama, andEiichiro Sumita. 2019. Towards Burmese (Myan-mar) morphological analysis: Syllable-based tok-enization and part-of-speech tagging. ACM Trans-actions on Asian and Low-Resource Language Infor-mation Processing (TALLIP), 19(1):5.

Chenchen Ding, Masao Utiyama, and Eiichiro Sumita.2018. NOVA: A feasible and flexible annotationsystem for joint tokenization and part-of-speechtagging. ACM Transactions on Asian and Low-Resource Language Information Processing (TAL-LIP), 18(2):17.

Sergey Edunov, Myle Ott, Michael Auli, and DavidGrangier. 2018. Understanding back-translation atscale. In Conference on Empirical Methods in Natu-ral Language Processing.

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith,and Eric P. Xing. 2010. A latent variable model forgeographic lexical variation. In Annual Meeting ofthe Association for Computational Linguistics.

John Rupert Firth. 1935. On sociological linguistics.Transactions of the Royal Society, pages 67–69.

Markus Freitag and Yaser Al-Onaizan. 2016. Fast do-main adaptation for neural machine translation.

Pascale Fung and Lo Yuen Yee. 1998. An ir approachfor translating new words from nonparallel, compa-rable texts. In The 17th International Conference onComputational Linguistics.

https://doi.org/10.18653/v1/W19-5206

1528

E.D. Gutierrez, Ekaterina Shutova annd Patricia Licht-enstein, Gerard de Melo, and Luca Gilardi. 2016.Detecting cross-cultural differences using a multilin-gual topic model. In Conference of the Associationfor Computational Linguistics.

Shudong Hao and Michael J. Paul. 2018. Learning mul-tilingual topics from incomparable corpora. In Inter-national Conference on Computational Linguistics(COLING).

Junxian He, Jiatao Gu, Jiajun Shen, and Marc’AurelioRanzato. 2020. Revisiting self-training for neuralsequence generation. In International Conferenceon Learning Representations.

Ann Irvine and Chris Callison-Burch. 2013. Combin-ing bilingual and comparable corpora for low re-source machine translation. In Proceedings of theeighth workshop on statistical machine translation,pages 262–270.

Barbara Johnstone. 2010. Language and place. R.Mesthrie and W. Wolfram, editors, CambridgeHandbook of Sociolinguistics. Cambridge Univer-sity Press.

Adam Kilgarriff and Tony Rose. 1998. Measures forcorpus similarity and homogeneity. In Conferenceon Empirical Methods in Natural Language Process-ing.

Diederik P. Kingma and Jimmy Ba. 2015. Adam:A Method for Stochastic Optimization. In Inter-national Conference on Learning Representations(ICLR).

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. MT Summit.

Taku Kudo and John Richardson. 2018. SentencePiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations, pages 66–71, Brussels, Belgium.Association for Computational Linguistics.

Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, and Marc’Aurelio Ranzato. 2018.Phrase-based & neural unsupervised machine trans-lation. In Empirical Methods in Natural LanguageProcessing (EMNLP).

Geoffrey Leech and Roger Fallon. 1992. Computer cor-pora: What do they tell us about culture? ICAMEJournal Computers in English Linguistics, (16).

Bill Yuchen Lin, Frank F. Xu, Kenny Q. Zhu, and Se-ung won Hwang. 2018. Mining cross-cultural dif-ferences and similarities in social media. In Confer-ence of the Association for Computational Linguis-tics.

P. Lison and J. Tiedemann. 2016. Opensubtitles2016:Extracting large parallel corpora from movie and tvsubtitles. In 10th International Conference on Lan-guage Resources and Evaluation (LREC).

Qiaozhu Mei, Chao Liu, and Hang Su. 2006. A prob-abilistic approach to spatiotemporal theme patternmining on weblogs. In WWW.

Paul Michel and Graham Neubig. 2018. Mtnt: Atestbed for machine translation of noisy text. In Con-ference on Empirical Methods in Natural LanguageProcessing.

David Mimno, Hanna M. Wallach, Jason Naradowsky,David A. Smith, and Andrew McCallum. 2009.Polylingual topic models. In Conference on Empiri-cal Methods in Natural Language Processing.

D.S. Munteanu, A. Fraser, and D. Marcu. 2004. Im-proved machine translation performance via parallelsentence extraction from comparable corpora. InProceedings of the Human Language TechnologyConference of the North American Chapter of the As-sociation for Computational Linguistics (NAACL).

Graham Neubig and Junjie Hu. 2018. Rapid adaptationof neural machine translation to new languages. InConference on Empirical Methods in Natural Lan-guage Processing (EMNLP), Brussels, Belgium.

Myle Ott, Michael Auli, David Grangier, andMarc’Aurelio Ranzato. 2018. Analyzing uncer-tainty in neural machine translation. In Interna-tional Conference on Machine Learning.

K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. 2002.Bleu: a method for automatic evaluation of machinetranslation. In Proceedings of the 40th Annual Meet-ing of the Association for Computational Linguis-tics.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Matt Post. 2018. A call for clarity in reporting BLEUscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Improving neural machine translation mod-els with monolingual data. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics, pages 86–96.

Anders Søgaard, Sebastian Ruder, and Ivan Vulic.2018. On the limitations of unsupervised bilingualdictionary induction. In Conference of the Associa-tion for Computational Linguistics (ACL).

https://doi.org/10.18653/v1/D18-2012

https://doi.org/10.18653/v1/D18-2012

https://doi.org/10.18653/v1/D18-2012

http://www.phontron.com/paper/neubig18emnlp.pdf

http://www.phontron.com/paper/neubig18emnlp.pdf

https://doi.org/10.18653/v1/W18-6319

https://doi.org/10.18653/v1/W18-6319

1529

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overfitting. Journal of Machine Learning Re-search, 15.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jon Shlens, and Zbigniew Wojna. 2016. Rethink-ing the inception architecture for computer vision.In IEEE conference on computer vision and patternrecognition.

Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, AndrewFinch, and Eiichiro Sumita. 2016. Introducing theasian language treebank (alt). In LREC.

Gideo Toury. 2012. Descriptive translation studies andbeyond: Revised edition. John Benjamins Publish-ing.

Nicola Ueffing. 2006. Using monolingual source-language data to improve mt performance. InIWSLT.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In Proc. of NIPS.

Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen,and Eiichiro Sumita. 2017. Instance weighting forneural machine translation domain adaptation. InConference on Empirical Methods in Natural Lan-guage Processing.

Marlies van der Wees, Arianna Bisazza, and ChristofMonz. 2017. Dynamic data selection for neural ma-chine translation. In Conference on Empirical Meth-ods in Natural Language Processing.

Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik.2019. A multilingual topic model for learningweighted topic links across corpora with low com-parability. In Conference on Empirical Methods inNatural Language Processing.

David Yarowski. 1995. Unsupervised word sense dis-ambiguation rivaling supervised methods. In An-nual Meeting of the Association for ComputationalLinguistics.

Michelle Yuan, Benjamin Van Durme, and JordanBoyd-Graber. 2018. Multilingual anchoring: In-teractive topic modeling and alignment across lan-guages. In Neural Information Processing Systems.

Mike Zhang and Antonio Toral. 2019. The effect oftranslationese in machine translation test sets. arXiv,abs/1906.08069.

Renjie Zheng, Hairong Liu, Mingbo Ma, BaigongZheng, and Liang Huang. 2019. Robust machinetranslation with domain sensitive pseudo-sources:Baidu-OSU WMT19 MT robustness shared task sys-tem report. In Proceedings of the Fourth Conference

on Machine Translation (Volume 2: Shared Task Pa-pers, Day 1), pages 559–564, Florence, Italy. Asso-ciation for Computational Linguistics.

https://doi.org/10.18653/v1/W19-5367

https://doi.org/10.18653/v1/W19-5367

https://doi.org/10.18653/v1/W19-5367

https://doi.org/10.18653/v1/W19-5367

1530

A Topic Classifiers

In this section we provide the experimental detailsof the findings reported in §3.1.

Dataset We use the WIKISHORT dataset providedby Yuan et al. (2018) 3. The dataset contains thou-sands of English and Chinese wikipedia articles,and each article is labeled with one of the six cate-gories, film, music, animals, politics, religion, andfood. We use the train split in the dataset for train-ing, validate with test split. The train split includes7730 and 7095 English and Chinese articles respec-tively. To make the classifier better differentiatearticles that is not one of the six categories, we uni-formly sample 1500 wikipedia articles from bothEnglish and Chinese wikidump 4 and add them intothe training data with ’other’ label.

Preprocessing WIKISHORT provides shortenedtext of Wikipedia articles. To generate the short-ened text of articles with ’other’ label, we use thetext from the first paragraph of the articles and fol-low similar preprocessing as in Yuan et al. (2018).For English articles, we lowercase the paragraph,tokenize by word, filter out stop words, lemmatizethe words and remove all the punctuations. ForChinese articles, we apply Stanford CoreNLP Tok-enizer 5 to segment the Chinese text.

Training We train two multi-nomial logistic re-gression models, one for English and one for Chi-nese. To extract features from the shortened text,we compute tf-idf weights from the training data.We use the LogisticRegression implementation ofSCIKIT-LEARN (Pedregosa et al., 2011) with de-fault setting to train the models. We tune the regu-larization term C on validation set. However, wedo not observe significant accuracy difference be-tween different C values, therefore we use defaultsetting C = 1.0. The accuracy on the test set ofthe English and Chinese classifiers are 92.5% and77.9%, respectively.

Prediction on Wikipedia articles To estimatethe category distribution of English and ChineseWikipedia articles, we uniformly sampled 8500 ar-ticles from wikidump for each language, follow thesame preprocessing and feature extraction as how

3https://github.com/forest-snow/mtanchor_demo#data

4https://dumps.wikimedia.org/5https://stanfordnlp.github.io/stanza/

tokenize.html

we does for the training data, and we use the classi-fiers to predict the category of each article. Afterremoving articles labeled “other”, we re-normalizethe distribution of the prediction of the main sixcategories and report the values in 2.

B Computing STDM score

In this section we provide the experimental detailsof the STDM evaluations performed in §4.1 and§4.2.

Dataset In §4.1, we have described the detailsfor constructing datasets from EuroParl and Open-Subtitles with different amount of STDM.

For the experiments in §4.2, we evaluate theSTDM score on datasets with known language ori-gins. We use three different data sources, includingWMT newstest sets, MTNT (Michel and Neubig,2018) and a social media platform (SMD). For theWMT newstest sets, we combine datasets from year2014 to 2019 and sample 5000 sentences for eachlanguage. For the analysis on MTNT dataset, weuse the train split in the dataset and sample 5000sentences for each language. For SMD, 20000 sen-tences are sampled for each language.

Preprocessing We use SentencePiece (Kudo andRichardson, 2018) to learn a BPE vocabulary ofsize 10000 over the combined English text corpusof all datasets. We preprocess sentences from alldatasets with BPE and remove sentences with lessthan 10 BPE tokens.

Topic Learning and STDM Score ComputingFor each dataset, we derive the TF-IDF matrix fromthe preprocessed sentences and perform an SVDdecomposition of the matrix. We retain the top 400eigenvalues and collect the corresponding topic rep-resentations of the original sentences. The topicrepresentations of the sentences are used to calcu-late the STDM score as described in §4.

C The Effect of Translationese

In §3 we have made the assumption that the ef-fect of translationese is negligible when estimatingSTDM. However, there are previous studies show-ing clear artifacts in (human) translations (Baker,1993; Zhang and Toral, 2019; Toury, 2012). In thissection we aim at assessing whether our STDMscore is affected by translationese.

We consider the WMT’17 De-En datasetfrom Ott et al. (2018) which contains double trans-lations of source and target originating sentences.

https://github.com/forest-snow/mtanchor_demo#data

https://github.com/forest-snow/mtanchor_demo#data

https://dumps.wikimedia.org/

https://stanfordnlp.github.io/stanza/tokenize.html

https://stanfordnlp.github.io/stanza/tokenize.html

1531

From this, we construct paired inputs and labels,{(hs→t(ht→s(tgt(zt))), 1)} ∪ {(tgt(zt), 0)}, andtrain two classifiers to predict whether or not theinput is translationese. The first classifier takes asinput a TF-IDF representation w of the sentence,while the second classifier takes only the corre-sponding topic distribution: V w. On this binarytask a linear classifier achieves 58% accuracy onthe test set with TF-IDF input representations, andonly 52% when given just the topic distribution.If we apply the same binary classifier in the topicspace to discriminate between sentences originat-ing in the source and target domain (tgt(zt) VS.hs→t(src(zs))), the accuracy increases to 64%.

We conclude that once we control for domaineffect (by discriminating the same set of sentencesin their original form versus their double transla-tionese form), the accuracy is much lower thanpreviously reported (Zhang and Toral, 2019), andworking in the topic space further removes trans-lationese artifacts. Therefore, the STDM scorecomputed in the topic space is unlikely affected bysuch artifacts and captures the desired discrepancybetween the source and the target domains.

D Self-Training

1 Data: Given a parallel dataset P and a sourcemonolingual datasetMs with Ns examples;

2 Noise: Let n(x) be a function that adds noise to theinput by dropping, swapping and blanking words;

3 Hyper-params: Let k be the number of iterationsand A1 < · · · < Ak ≤ NS be the number ofsamples to add at each iteration;

4 Train a forward model:−→θ = arg maxθ E(x,y)∼P log p(y|x; θ);

5 for t in [1 . . . k] do6 forward-translate data:

(ys, v) ≈ arg maxz p(z|xs;−→θ ), for xs ∈Ms,

where v is the model score;7 Let Ms ⊂Ms containing the top-At highest

scoring examples according to v;8 re-train forward model:

−→θ = arg maxθ E(x,y)∼Q log p(y|x; θ) withQ = P ∪ {n(xs), ys}xs∼Ms .

endAlgorithm 1: Self-Training algorithm.

Alg. 1 describes self-training, a data augmenta-tion method that leveragesMs, using the notationof §5. First, a baseline forward model is trained onthe parallel data (line 4). Second, this initial modelis applied to the source monolingual data (line 6).Finally, the forward model is re-trained from ran-dom initialization by augmenting the original par-allel dataset with the forward-translated data. As

with BT, the parallel dataset receives more weightin the loss. In order to increase robustness to pre-diction mistakes, we make the algorithm iterativeand add only the examples for which the modelwas most confident (line 3, loop in line 5 and line7). In our experiments we iterate three times. Wealso inject noise to the input sentences, in the formof word swap and drop (Lample et al., 2018), tofurther improve generalization (line 8).

E Hyper-parameters Used in MTExperiments

In this section, we report the hyper-parameters usedin Sec. 6.

In all our experiments we use the standardmachine learning methodology of model cross-validation to select hyper-parameters. We trainmodels with several random combination of hyper-parameters and select the best configuration basedon the performance on the validation set. We finallyreport results on the test set.

Data The full list of datasets we have consideredin this work with some basic statistics is reportedin Tab. 4

We jointly learn BPEs on both source and targetlanguages. First, we train an MT system on the par-allel data for each BPE setting, with values in theset {3000, 5000, 10000, 20000}. Then, we selectthe number of BPE tokens by selecting the settingthat yields the best performance on the validationset. We report below the value that worked best ineach experiment.

Loss and Optimizer All models are trainedusing cross-entropy loss with label smooth-ing (Szegedy et al., 2016) equal to 0.2. The op-timizer is Adam (Kingma and Ba, 2015) with beta1= 0.9, beta2 = 0.98 and warm-up steps 4000. Weshare embeddings across encoder and decoder, andinput and output lookup tables. We use fixedbatch size with 4000 tokens per GPU, and we trainwith 4 GPUs in fp16. Other optimization hyper-parameters are reported below.

Model Architecture and Training To decidethe model architecture hyper-parameters for dif-ferent amount of parallel and monolingual data, weuse random search to find the setting that yields thebest performance on validation set. The range ofvalues that we used in our random hyper-parametersearch are:

1532

MT Experiments Language Pair BPE size Origin Parallel data Monolingual data

Domain # sents(train/valid/test) Domain # sents

Controlled Setting Fr->En 5000 source Europarl 10K / 10K / 10K Europarl 900Ktarget OpenSubtitle 10K / 10K / 10K OpenSubtitle 900K

Low-Resource MTNe->En 5000 source Social Media 40K / 5K / 5K Social Media 1.8M

target Social Media 7.5K / 5K / 5K Social Media 1.8M

En-My 10000 source ALT(News) 18K / 1K / 1K NewsCrawl 5Mtarget - - CommonCrawl 100K

Table 4: Datasets used for the machine translation experiments of Sec. 6

• Layers: {4, 5, 6}

• Embedding dimension: {256, 512, 1024}

• Inner-layer dimension: {512, 1024, 2048,4096}

• Attention Heads: {2, 4, 8, 16}

We report the model architecture of each experi-ment in §E.1.

For each data point which corre-sponds to a particular combination of(P,Ms,St, α, β, training procedure), we userandom search to sweep over hyper-parameters.The hyper-parameters are dropout rate, learningrate, source side noise level for ST experimentsand upsample ratio between parallel data, back-translated data and self-translated data. For allexperiments, the dropout rates are {0.1, 0.2, 0.3,0.4, 0.5}, and the learning rate takes values in{0.0007, 0.001, 0.003, 0.005}.

E.1 Hyper-parameter for Controlled SettingExperiments

We use the same BPEs for all the experiments inthe controlled setting of §6.1. The BPE size is 5000and it is shared between English and French.

Varying amount of STDM. We use embeddingdimension 256, inner-layer dimension 512, 2 at-tention heads. We sweep the upsampling ratio ofparallel data, back-translated data and self-trainingdata in a range between 1 and 8.

We train the ST system with 2 iterations (k = 2),where A1 = 30K and A2 = 100K. In each itera-tion, we use random search to sweep over differentsource side noise, and we report the model withthe best performance based on validation set. Thevalues of input noise are as follows: Word shuffling{0, 2, 3}, word dropout {0.0, 0.1, 0.2}, word blank{0.0, 0.1, 0.2}.

Varying amount of monolingual data. Whenthe monolingual data has less than 300k sentences,we use embedding dimension 256, inner-layer di-mension 512 and 2 attention heads. For biggermonolingual datasets, we use embedding dimen-sion 1024, inner-layer dimension 2048 and 8 at-tention heads. We sweep the upsampling ratio ofparallel data, back-translated data and self-trainingdata in a range between 1 and 8.

We train the ST system with 5 iterations (k =5), where A1 = 10K, A2 = 30K, A3 = 100K,A4 = 300K and A5 = 900K. The values of inputnoise are as follows: Word shuffling {0, 2, 3}, worddropout {0.0, 0.1, 0.2}, word blank {0.0, 0.1, 0.2}.

Varying amount of in-domain data. Whentraining only with parallel data, we use embed-ding dimension 256, inner-layer dimension 512,and 2 attention heads. For models trained withBT-only and parallel + BT instead, we use embed-ding dimension 1024, inner-layer dimension 2048and 8 attention heads. We sweep the parallel dataupsampling ratio in a range between 1 and 16.

E.2 Hyper-parameter for Low-Resource MTExperiments

We use a joint BPE tokenization with 5000 tokensfor En→ Ne and 10000 tokens for En→My.

On both datasets (Ne→ En and En→My), thebaseline trained only on the parallel dataset is a5-layer transformer model with 512 embeddingdimensions, 2048 inner-layer dimensions and 8 at-tention heads. For models trained with BT and ST,we use a 6-layer transformer with 1024 embeddingdimensions, 2048 inner-layer dimensions and 8 at-tention heads. We sweep the parallel data and STdata ratio between 1 and 32.

For both language pairs, we train the ST systemswith 3 iterations (k = 3). For Ne → En, we useA1 = 600K, A2 = 1M, A3 = 1.8M. For En →My, we use A1 = 1M, A2 = 3M and A3 = 5M.The values of input noise are as follows: Word

1533

shuffling {0, 2, 3}, word dropout {0.0, 0.1, 0.2},word blank {0.0, 0.1, 0.2}.

Documents

The Source-Target Domain Mismatch Problem in Machine