18
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation Bryan Eikema University of Amsterdam [email protected] Wilker Aziz University of Amsterdam [email protected] Abstract Recent studies have revealed a number of pathologies of neural machine translation (NMT) sys- tems. Hypotheses explaining these mostly suggest that there is something fundamentally wrong with NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most of this evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at identifying the highest-scoring translation, i.e. the mode, under the model distribution. We argue that the evidence corroborates the inadequacy of MAP decoding more than casts doubt on the model and its training algorithm. In this work, we criticise NMT models probabilistically showing that stochastic samples following the model’s own generative story do reproduce various statistics of the training data well, but that it is beam search that strays from such statistics. We show that some of the known pathologies of NMT are due to MAP decoding and not to NMT’s statistical assumptions nor MLE. In particular, we show that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbi- trary. We therefore advocate for the use of decision rules that take into account statistics gathered from the model distribution holistically. As a proof of concept we show that a straightforward implementation of minimum Bayes risk decoding gives good results outperforming beam search using as little as 30 samples, confirming that MLE-trained NMT models do capture important aspects of translation well in expectation. 1 Introduction Recent findings in neural machine translation (NMT) suggest that modern translation systems have some serious flaws from a modelling perspective. This is based on observations such as: i) translations pro- duced via beam search typically under-estimate sequence length (Sountsov and Sarawagi, 2016; Koehn and Knowles, 2017), the length bias; ii) translation quality generally deteriorates with better approximate search (Koehn and Knowles, 2017; Murray and Chiang, 2018; Ott et al., 2018; Kumar and Sarawagi, 2019), the beam search curse; iii) the true most likely translation under the model (i.e., the mode of the distribution) is empty in many cases (Stahlberg and Byrne, 2019) and a general negative correlation exists between likelihood and quality beyond a certain likelihood value (Ott et al., 2018), we call this the inadequacy of the mode problem. A number of partly overlapping hypotheses have been formulated to explain these observations. They mostly suggest there is something fundamentally wrong with NMT as a model (i.e., its factorisation as a product of locally normalised distributions) or its most popular training algorithm (i.e., regularised maximum likelihood estimation, MLE for short). As we shall see these explanations make an unspoken assumption, namely, that identifying the mode of the distribution, also referred to as maximum a poste- riori (MAP) decoding (Smith, 2011), is in some sense the obvious decision rule for predictions. While this assumption makes intuitive sense and works well in unstructured classification problems, it is hardly justified in NMT, where the most likely translations are roughly as likely as one another and together account for very little probability mass, a claim we shall defend conceptually and provide evidence for arXiv:2005.10283v1 [cs.CL] 20 May 2020

Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

Is MAP Decoding All You Need?The Inadequacy of the Mode in Neural Machine Translation

Bryan EikemaUniversity of [email protected]

Wilker AzizUniversity of [email protected]

Abstract

Recent studies have revealed a number of pathologies of neural machine translation (NMT) sys-tems. Hypotheses explaining these mostly suggest that there is something fundamentally wrongwith NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most ofthis evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimedat identifying the highest-scoring translation, i.e. the mode, under the model distribution. Weargue that the evidence corroborates the inadequacy of MAP decoding more than casts doubton the model and its training algorithm. In this work, we criticise NMT models probabilisticallyshowing that stochastic samples following the model’s own generative story do reproduce variousstatistics of the training data well, but that it is beam search that strays from such statistics. Weshow that some of the known pathologies of NMT are due to MAP decoding and not to NMT’sstatistical assumptions nor MLE. In particular, we show that the most likely translations underthe model accumulate so little probability mass that the mode can be considered essentially arbi-trary. We therefore advocate for the use of decision rules that take into account statistics gatheredfrom the model distribution holistically. As a proof of concept we show that a straightforwardimplementation of minimum Bayes risk decoding gives good results outperforming beam searchusing as little as 30 samples, confirming that MLE-trained NMT models do capture importantaspects of translation well in expectation.

1 Introduction

Recent findings in neural machine translation (NMT) suggest that modern translation systems have someserious flaws from a modelling perspective. This is based on observations such as: i) translations pro-duced via beam search typically under-estimate sequence length (Sountsov and Sarawagi, 2016; Koehnand Knowles, 2017), the length bias; ii) translation quality generally deteriorates with better approximatesearch (Koehn and Knowles, 2017; Murray and Chiang, 2018; Ott et al., 2018; Kumar and Sarawagi,2019), the beam search curse; iii) the true most likely translation under the model (i.e., the mode ofthe distribution) is empty in many cases (Stahlberg and Byrne, 2019) and a general negative correlationexists between likelihood and quality beyond a certain likelihood value (Ott et al., 2018), we call this theinadequacy of the mode problem.

A number of partly overlapping hypotheses have been formulated to explain these observations. Theymostly suggest there is something fundamentally wrong with NMT as a model (i.e., its factorisation asa product of locally normalised distributions) or its most popular training algorithm (i.e., regularisedmaximum likelihood estimation, MLE for short). As we shall see these explanations make an unspokenassumption, namely, that identifying the mode of the distribution, also referred to as maximum a poste-riori (MAP) decoding (Smith, 2011), is in some sense the obvious decision rule for predictions. Whilethis assumption makes intuitive sense and works well in unstructured classification problems, it is hardlyjustified in NMT, where the most likely translations are roughly as likely as one another and togetheraccount for very little probability mass, a claim we shall defend conceptually and provide evidence for

arX

iv:2

005.

1028

3v1

[cs

.CL

] 2

0 M

ay 2

020

Page 2: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

in experiments. Unless the model distribution is extremely peaked about the mode for every plausibleinput, criticising the model in terms of properties of its mode can at best say something about the ad-equacy of MAP decoding. Unfortunately, as previous research has pointed out, this is seldom the case(Ott et al., 2018). Thus, pathologies about the mode cannot be unambiguously ascribed to NMT as amodel (its factorisation) nor to MLE, and inadequacies about the mode cannot rule out the possibilitythat the model captures important aspects of translation well in expectation.

In this work, we criticise NMT models as probability distributions estimated via MLE in varioussettings: we vary language pairs (en-ne, en-si), translation direction, and the testing regime (in- and out-of-domain). We observe that MLE succeeds in representing statistics of the data well in expectation, andthat some length and lexical biases are introduced by approximate MAP decoding (i.e., beam search).We demonstrate that translations produced via beam search are unreliable for they correspond to ‘rareevents’, which is due to the model spreading probability mass over a large set of translations, especiallyso when test data stray from the domain used for training. This spread is most likely ascribed to NMTas a model since its factorisation assigns probability mass to an unbounded set. The empty string toois an event that has very little support under the model. Thus, despite it often being the mode, it isnot a frequent problem. Finally, we observe that the set of likely samples obtained by stochasticallyfollowing the model’s own generative story is of reasonable quality, as assessed by automatic evaluationmetrics, and quality measurements do not exhibit large variance. This observation suggests we shouldbase decisions on statistics gathered from the distribution holistically. One such decision rule is minimumBayes risk (MBR) decoding (Bickel and Doksum, 1977; Kumar and Byrne, 2004). We show that astraight-forward implementation of MBR leads to consistent improvements across testing conditionscompared to beam search. In particular, MBR proves especially useful where NMT models are moreuncertain, for example, on out-of-domain inputs. This also makes intuitive sense as the more uncertainthe model becomes, the more arbitrary the mode of the distribution is.

To summarise our contributions: we criticise NMT models probabilistically showing that i) ancestralsamples reproduce various statistics of the training data well concerning length, lexical and word orderaspects of the data; we also show that ii) MAP decoding strays away from statistics of the training data;we show that iii) some, but not all, of the known pathologies of NMT are due to MAP decoding and notto NMT’s statistical assumptions or MLE, and that iv) MLE-trained NMT captures important aspects oftranslation well in expectation, motivating the exploration of sampling-based decision rules.

2 Observed Problems in NMT

Many studies have found that NMT suffers from a length bias: NMT underestimates length which hurtsthe adequacy of translations. Cho et al. (2014a) provide the first thorough account of this problem,finding that NMT systematically degrades in performance for longer sequences. Sountsov and Sarawagi(2016) identify the same bias in a chat suggestion task and argue that sequence to sequence modelsunderestimate the margin between correct and incorrect translations due to local normalisation. Laterstudies have also confirmed the existence of this bias in NMT (Koehn and Knowles, 2017; Stahlberg andByrne, 2019; Kumar and Sarawagi, 2019).

Notably, all these studies are in the context of doing beam search decoding. In fact, some studies havelinked the length bias to the beam search curse: the observation that large beam sizes hurt performancein NMT (Koehn and Knowles, 2017). Sountsov and Sarawagi (2016) already noted that larger beamsizes exacerbate the length bias. Later studies have also confirmed the same connection (Blain et al.,2017; Murray and Chiang, 2018; Yang et al., 2018; Kumar and Sarawagi, 2019). Murray and Chiang(2018) attribute both problems to local normalisation which they claim introduces label bias (Bottou,1991; Lafferty et al., 2001) to NMT. Yang et al. (2018) show that model likelihood negatively correlateswith translation length. These findings suggest that the mode suffers from length bias, likely therebyfailing to sufficiently account for adequacy. In fact, Stahlberg and Byrne (2019) show that the modeindeed greatly suffers from length bias and that oftentimes the mode even is the empty sequence.

However, the connection with the length bias is not the only reason for the beam search curse. Ott etal. (2018) find that the presence of copies in the training data can cause the model to assign too much

2

Page 3: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

probability mass to copies of the input, and that with larger beam sizes this copying behaviour becomesmore frequent. Cohen and Beck (2019) show that translations obtained with larger beam sizes oftenconsist of an unlikely prefix with an almost deterministic suffix. They empirically find such translationsto be of lower quality. In open-ended generation, Zhang et al. (2020) correlate model likelihood withhuman judgements for a fixed sequence length, thus eliminating any possible length bias issues. Theyfind that likelihood generally correlates positively with human judgements, up until an inflection point,after which the correlation becomes negative. An observation also made in a translation context withBLEU rather than human judgements (Ott et al., 2018). We call this general failure of the mode torepresent good translations in NMT the inadequacy of the mode problem. Yet we note that all theseproblems are very connected.

3 NMT and its Many Biases

Machine translation systems are trained on pairs of semantically equivalent sentences drawn from abilingual parallel corpus. Each pair consists of a sequence x in the source language and a sequencey in the target language of length |y|. Most NMT models are conditional models,1 that is, only thetarget sentence is given random treatment. Target words are drawn in sequence from a product of locallynormalised Categorical distributions without Markov assumptions (Sutskever et al., 2014; Cho et al.,2014b; Bahdanau et al., 2015):

p(y|x, θ) =

|y|∏j=1

Cat (yj |f (x, y<j ; θ)) . (1)

At each step, a neural network architecture f(·; θ) maps from the source sequence x and the prefixsequence y<j (empty for y1) to the parameters of a categorical distribution over the vocabulary of thetarget language.

NMT models are typically trained via regularised maximum likelihood estimation, which we refer toas MLE for short, where we search for the parameter θMLE that assigns maximum (regularised) likelihoodto a dataset of observations D:

θMLE = argmaxθ

− λR(θ) +∑

(x,y)∈D

log p(y|x, θ) . (2)

Regularisation techniques include weight decay, whereR(θ) is L2 penalty on the parameters of the net-work, and dropout (Srivastava et al., 2014). A local optimum of the objective is obtained by a stochasticgradient-based optimisation algorithm (Robbins and Monro, 1951; Bottou and Cun, 2004), where gradi-ents are computed automatically via backpropagation.

For a trained model with parameters θMLE and a given input x, a translation is predicted by search-ing for the sequence y? that maximises log p(y|x, θMLE), the mode of the model distribution. This is adecision rule also known under the name maximum a posteriori (MAP) decoding (Smith, 2011). Ex-act MAP decoding is generally not tractable due to the unbounded and exponentially growing searchspace and the model’s factorisation without independence assumptions. Therefore, the beam searchalgorithm (Boulanger-Lewandowski et al., 2012; Sutskever et al., 2014) is employed as a viable approx-imation.

It has been said that NMT suffers from a number of biases due to certain design decisions, ranging fromits factorisation to parameter estimation to search strategy. We review those biases here, also establishingsome connections with other structure prediction problems. We then discuss in Section 4 one bias thathas received very little attention and which, we argue, underlies many biases in NMT and explains someof the inadequacies discussed in Section 2.

1Though fully generative accounts do exist (Shah and Barber, 2018; Eikema and Aziz, 2019).

3

Page 4: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

Exposure bias MLE prescribes that model parameters be estimated conditioned on (ground-truth) ob-servations sampled from the training data. Clearly, the ground-truth is not available at test time, whenpredictions are formed by searching through the model distribution. This mismatch in training distribu-tion (i.e., samples from the data) and test distribution (i.e., samples from the model), known as exposurebias (Ranzato et al., 2016), has been linked to many of the pathologies of NMT and has motivatedmodifications or alternatives to MLE aimed at exposing the model to its own predictions already duringtraining (Bengio et al., 2015; Ranzato et al., 2016; Shen et al., 2016; Wiseman and Rush, 2016; Zhang etal., 2019). While exposure bias has been a point of critique mostly against MLE, we remark that MAPdecoding likely exacerbates its effects as we shall see.

Non-admissible heuristic search bias In beam search partial translations are ranked in terms of log-likelihood without regards to (or with crude approximations of) their future score, which may lead togood translations falling off the beam too early. In the context of pathfinding algorithms (Hart et al.,1968), this corresponds to searching with a non-admissible heuristic, that is, a heuristic that may under-estimate the likelihood of completing partial translations. This biased search accounts for some of thepathologies of Section 2 and has motivated variants of the algorithm aimed at comparing partial transla-tions more fairly (Huang et al., 2017; Shu and Nakayama, 2018). In parsing literature this is known asimbalanced probability search bias (Stanojevic and Steedman, 2020).

Label bias Where a conditional model makes independence assumptions regarding its inputs, localnormalisation prevents the model from revising its decisions leading to what is known as the label biasproblem (Bottou, 1991; Lafferty et al., 2001). This is a model specification problem which limits theclass of probability distributions that a model can represent (Andor et al., 2016). In a nutshell, the prob-lem concerns locally normalised models where access to deterministic inputs (i.e., the variables a modelconditions on, but which are not generated, and thus not scored, by the model itself) is limited. Whilethis is clearly the case in incremental parsing (Stern et al., 2017; Stanojevic and Steedman, 2020) andsimultaneous translation (Gu et al., 2017), where inputs are incrementally made available for condition-ing, this is not the case in standard NMT (Sountsov and Sarawagi, 2016, Section 5), where the inputis available for conditioning in all generation steps. While one cannot exclude the possibility that localnormalisation affects the kind of local optima we find in NMT, that is an issue orthogonal to label bias.

4 Biased Statistics and the Inadequacy of the Mode

Predictions in NMT are traditionally formed by searching for the mode of the distribution. We startby stating a statistical fact, namely, that the mode cannot be used to obtain unbiased statistics underthe model distribution. In other words, a critique about the mode is not a statement about the modelas a probability distribution, but rather a statement about the model as a means to derive a ranking oftranslations. Through MLE we learn an entire probability distribution, one over translations of the inputsequence. By criticising the model by its mode we neglect a lot of potentially valuable information.

Consider a scenario where the 1, 000 most likely outcomes under the model accumulate less than 1%of probability mass, something we will show to often be the case in our experiments. The entire set israther unlikely: there is 1 chance in a hundred that we will hit the set by sampling from the model andthe mode is only 1 of the one thousand ways in which we can hit the set. It is safe to say that the modeldoes not express a clear preference for its mode. If the mode in this case happened to be inadequate, sayan empty string, or some other inadequate translation, that would not be surprising. For example, themode might be quite different from the other outcomes in the set, and more generally, from every otheroutcome that is roughly as likely under the model. The model never emphasised the mode, our searchstrategy did. In other words, MAP decoding overstates the importance of the mode. More generally,pathfinding algorithms overstate the importance of the paths they enumerate because they disregard theamount of probability in and outside the sets they cover.

MAP decoding algorithms and their approximations bias the statistics by which we criticise NMT.They restrict our observations about the model to a single or a handful of outcomes which on their ownare rather rare. To gain insight about the model as a distribution, it seems more natural to use all of the

4

Page 5: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

information available to us, namely, all samples we can afford to collect, and search for frequent patternsin these samples. Evidence found that way better represent the model and its beliefs.2

At the core of our analysis is the concept of an unbiased sample from the model, which we can obtaineasily by ancestral sampling. That is, we iteratively sample from conditional distributions of the formCat(f(x, y<j ; θ)), each time extending the generated prefix y<j with an unbiased draw, until we gener-ate the end-of-sequence symbol. Also note that by following the model’s own generative story, unlikewhat happens in MAP decoding, we are imitating the procedure by which the model was trained. Onlywe replace samples from the data by samples from the model, thus shedding light onto the model’s fit. Inother words, if ancestral samples cannot reproduce statistics of the data well, we have an instance of poorfit.3 Another consequence of following the model’s own generative story is that there is no incrementalsearch. In other words, ancestral sampling does not need a heuristic future score and thus is not sus-ceptible to the non-admissible heuristic search bias. We remark, however, that ancestral sampling is nota decision rule. That is, predictions based on a single ancestral sample are not expected to outperformMAP decoding (or any other rule). Ancestral samples can be used to perform predictive checks thatpower diagnostics of model fit, as we do in Section 6, and to approximate alternative decision rules, aswe do in Section 7.6.

In sum, we argue that MAP decoding is a source of various problems and that it gives biased conclu-sions about NMT. In the next sections, we provide empirical evidence for these claims.

5 Data & System

We train our systems on Sinhala-English and Nepali-English. For both of these language pairs verylittle parallel data is available. We use the validation and test data collected in Guzman et al. (2019)and mimic their data setups for training data as well. Mimicking Guzman et al. (2019) we use Bibledata (Christodouloupoulos and Steedman, 2015), GNOME/KDE/Ubuntu documentation and GlobalVoices (version 2018q4) data from the OPUS repository (Tiedemann, 2012). For English-Nepali wealso use a translated version of the Penn Treebank4 and for English-Sinhala we additionally use OpenSubtitles (Lison et al., 2018). We use a filtered crawl of Wikipedia and Common Crawl released inGuzman et al. (2019), as well as the released FLORES development and test sets.

As we found that the training data consisted of many duplicate sentence pairs (i.e. sentence pairs thatoccurred more than once in the data), we removed all sentence pair duplicates from the training data,but left in those where only one side (source or target) of the data is duplicate to allow for paraphrases.As the FLORES dataset is out-of-domain compared to the training data, we also reserve a subset of thetraining data obtained by using stratified sampling from the individual datasets as held-out validation andtest sets. In this process we also removed any sentence that corresponded exactly to either the source ortarget side of a validation or test sentence from the training data.

We use a Transformer NMT system (Vaswani et al., 2017) using the same hyperparameters and op-timisation hyperparameters as the supervised setups in Guzman et al. (2019). We train two identicalsystems for each language pair: one on all available training data using FLORES validation for earlystopping, and one on a slightly smaller amount of training data with a held-out validation set for earlystopping and held-out test set for experiments.

A note on label smoothing Label smoothing (Szegedy et al., 2016) is a modification to MLE wherewe promote some amount of probability mass to be uniformly distributed across every class competingwith the observation. We found that label smoothing only improved performance on out-of-domain datausing beam search (1.3 BLEU on average across language pairs and translation directions). On held-outin-domain data there was no clear benefit in some cases hurting performance (−0.1 BLEU on average).

2As we see it, criticising NMT models through the lens of MAP decoding is not a problem per se. Where no other decisionrule is viable, it makes sense to criticise NMT’s suitability to MAP decoding. That said, where we seek changes to NMT’sstatistical assumptions, it makes sense to criticise the model with unbiased statistics.

3We prefer thinking of it as a case of bad fit, rather than exposure bias, since exposure bias seems to be acknowledged whenthere is a greater shift in distribution, such as due to mode-seeking search and its approximations.

4http://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm

5

Page 6: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

We thus disable label smoothing in all our experiments. From the point of view of our analysis, the mainreason for disabling it, however, is that it is not an MLE procedure and has too strong an impact on thedistribution that a model learns (Muller et al., 2019). We found it to severely (and negatively) impactthe fit of the model and greatly hurt performance on any sampling-based metric. Similar findings forsampling performance have been found by Graca et al. (2019).

6 Assessing the Fit of MLE-based NMT

We investigate the fit of the NMT model discussed in Section 5 on a held-out portion of the training data.This allows us to criticise MLE without confounders such as domain shift. We will turn to out-of-domaindata in Section 7 when we further examine the learned distributions. We compare unbiased samples fromthe model with gold-standard references and analyse statistics of several aspects of the data. If the MLEsolution is good, we would expect statistics of sampled data to closely match statistics of observed data.The same cannot be said about statistics of beam search outputs, at least not necessarily. Still, becauseNMT models are typically criticised that way, we include beam search outputs in the comparison. Forthat, we follow Guzman et al. (2019) and use beam size of 5 and a length penalty of 1.2.

We model all statistics using hierarchical Bayesian models. We include statistics of the gold-standardreferences, sampled translations and beam search translations in our analysis. For each type of statistics,we formulate a joint model over these three groups and make comparisons between them by looking atthe posterior distributions over the parameters of the analysis model. In order to reduce sparsity we alsoinclude statistics extracted from the training data in our analysis, and model the three test groups as afunction of the inferences made on the training data statistics.5 Our methodology follows that advocatedby Gelman et al. (2013) and Blei (2014).

6.1 Length Distribution

We extract target sequence lengths from references, samples and beam search outputs, as well as fromtraining data. These observations are modelled using a hierarchical Gamma-Poisson model. We modelthe test groups such that the mean Poisson rate for each test group is the mean Poisson rate of trainingdata scaled with a group-specific scaling variable. The scaling variables are tied by sharing a commonprior among all test groups. We infer Gamma posterior approximations for all unknowns using stochasticvariational inference (SVI) (Hoffman et al., 2013). After inferring the posteriors, we compare predictivesamples to the observed data in terms of first to fourth order moments to verify that the model fits theobservations well. Further details of the model are given in Appendix A.1.

We make comparisons between test groups by looking at how the expected posterior Poisson ratesfor each group are distributed. Note that the Poisson rate parameter is also its mean, and thus canbe interpreted as the mean target sequence length. In Figure 1 we show results on all four languagepairs. We observe that samples generated by NMT represent the length statistics on held-out test datareasonably well. Sometimes overestimating and sometimes underestimating the length statistics slightly.Although mostly, the expected posterior Poisson rates of samples and gold-standard references overlapa fair amount. Beam search outputs seem to stray a bit from the reference statistics, usually being lesssimilar to it than samples from the NMT model. In three out of four language pairs we see that beamsearch outputs are on average shorter than ancestral sampling outputs, a known phenomenon of beamsearch (Koehn and Knowles, 2017).

6.2 Lexical Distribution

To probe the NMT model for its ability to capture lexical aspects of the data we compare unigramand bigram statistics of references, samples and beam search outputs. We extract unigram and bigramcounts from target-side references, samples, beam search and training data and model all jointly in ahierarchical Bayesian model. Both unigram and bigram counts are modelled as Multinomial draws froma fixed vocabulary of BPEs. We put a Dirichlet prior on the Multinomial parameter and a Gamma

5Note that “training” data here only refers to the optimisation procedure of the NMT model. For the Bayesian analysismodel there is no distinction between training and test data other than how we choose to treat it in the model specification.

6

Page 7: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

16 18 200.00

0.25

0.50

0.75

1.00

1.25

1.50de

nsity

en-ne

16 17 18 19 20

ne-en

10 11 12 13

en-si

10 11 12

si-en

references samples beam search

Figure 1: Expected posterior Poisson rates for references, ancestral samples and beam search output.The Poisson rate is also the mean (and mode) of the distribution, thus can be interpreted as the meantarget sequence length.

0 50 100 150 2000.0000

0.0002

0.0004

dens

ity

en-ne

0 50 100 150 200

ne-en

0 20 40 60 80 100

en-si

0 20 40 60 80 100

si-en

0 2 4 6 80.000

0.002

0.004

dens

ity

0 2 4 6 8 10 0 5 10 15 20 0 5 10

reference samples beam search

Figure 2: Comparison of unigram (top) and bigram (bottom) statistics on all language pairs. Shows theposterior agreement with training data (sg and mg) for each group. The x-axes are in the order of 1e3.

prior on the Dirichlet concentration. First, a posterior Dirichlet concentration is inferred on trainingdata, then the test groups are modelled by scaling the expected posterior Dirichlet concentration ontraining data using group-specific scaling variables. Each group has one scaling variable sg for theunigram distribution, and one scaling variable mg for all bigram distributions. These scaling variablesare amenable to interpretation, they represent agreement with the training posterior. Again, we use SVI toapproximate posteriors for all unknowns. To confirm the fit of the analysis model, we compare posteriorpredictive samples to the observed data in terms of absolute frequency errors of unigrams and bigramsas well as ranking correlation. Further details of the probabilistic model can be found in Appendix A.2.

Figure 2 shows the posterior inferences for the scaling variables sg and mg. It is clear that referencesagree the most with training data, followed by samples from the model, followed by beam search outputs.It is difficult to conclude whether ancestral samples match the held-out test data statistics well. In somecases the amount of agreement with training data is notably less for samples than for held-out test data,in other cases they overlap. One thing we can observe consistently is that beam search statistics strayfrom gold-standard data statistics. In all cases beam search outputs shows the least amount of agreementwith training data, and are farthest away from the held-out test data. This shows that beam search alsoalters the lexical distribution and thereby again strays from the statistics of the data.

7

Page 8: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

0 0.6 1.2 1.8 2.40.000

0.002

0.004

0.006

0.008de

nsity

en-ne

0 0.8 1.6 2.4 3.2

ne-en

0 0.8 1.6 2.4 3.2

en-si

0 1.5 3 4.5 6

si-en

references samples beam search

Figure 3: Comparison of skip-bigram statistics on all language pairs. Shows the posterior agreementwith training data for the second word of the skip-bigram (mg) for each group. The x-axes are in theorder of 1e3.

6.3 Word OrderTo assess the ability of the NMT model to capture word order, we apply the model of Section 6.2 toskip-bigram counts. Skip-bigrams are constructed by taking any word in a sentence and pairing it withall words occurring later on in that same sentence. Skip-bigrams thus capture some word order infor-mation, more so than regular bigrams, as they preserve statistics about which words are likely to occurafter another word irrespective of the distance in between them. Skip-bigrams have the benefit of notintroducing additional sparsity that would make the analysis more difficult.

We fit the model of Section 6.2 on the skip-bigram counts using SVI and performed the same predictivechecks to confirm a good fit of the analysis model. The resulting posterior distributions for the bigramscaling variables mg are shown in Figure 3. We mostly observe the same trends as in Section 6.2. Gold-standard test data always shows most agreement with training data from all groups, tying with ancestralsamples of the NMT model for en-si. For three out of four language pairs ancestral samples of the modelshow a very similar amount of agreement with training data, indicating a good fit of the NMT modelon this aspect of the data. For three out of four language pairs beam search yet again is least similar totraining data. For si-en beam search shows more agreement with training data than ancestral sampling,something we also observed with length statistics in Section 6.1.

7 Examining the Model Distribution

In this set of experiments we further examine the probability distributions that the NMT models ofSection 5 learned. We inspect the translations in a large unbiased sample from each trained model. Togain further insight we also consider out-of-domain data, in our case the FLORES dataset.

7.1 Number of Likely TranslationsAn NMT model, by the nature of its model specification, assigns probability mass to each and everypossible sequence consisting of tokens in its vocabulary. Ideally, however, a well-trained NMT modelassigns the bulk of its probability mass to good translations of the input sequence, and only a negligibleamount to all other sequences.

We take 1, 000 unbiased samples from the NMT model for each input sequence and count the cu-mulative probability mass of the unique translations sampled. Figure 4 shows the average cumulativeprobability mass for all test sentences with 1 standard deviation around it, as well as the final cumulativeprobability values for each input sequence. For the held-out in-domain data we observe that, on average,between 27.9% and 57.8% of the probability mass is covered after 1, 000 samples. The large variancearound the mean shows that in all language pairs we can find test sentences for which nearly all or barelyany probability mass has been covered after 1, 000 samples. This means that, even after taking 1, 000samples, only about half of the probability space has been explored. The situation is much more extremewhen translating data that differ in domain from the data that the NMT model has seen during training,as can be seen in the bottom half of Figure 4 on the FLORES dataset. Naturally, the NMT model is much

8

Page 9: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

0%20%40%60%80%

100% en-ne ne-en en-si si-en

0 500 1000# samples

0%20%40%60%80%

100%

0 500 1000# samples

0 500 1000# samples

0 500 1000# samples

Figure 4: Cumulative probability of 1,000 ancestral samples on the held-out in-domain (top) and FLORES

(bottom) test sets. The dark blue line shows the average cumulative probability over all test sentences,the shaded area represents 1 standard deviation away from the average. The black dots to the right showthe final cumulative probability for each individual test sentence.

more uncertain on out-of-domain data, and this is very clear from the amount of probability mass thathas been covered after sampling 1, 000 translations: on average, only between 0.2% and 0.9% of theprobability space has been explored. This signifies that the set of likely translations under the modelis very large and the probability distribution over those sentences mostly quite flat, especially so in theout-of-domain case. In fact, if we look at each input sequence individually, we even see that for 18.5%(en-ne), 15.7% (ne-en), 9.2% (en-si) and 3.3% (si-en) of them all 1, 000 samples are unique. On theFLORES data these numbers increase to 52.1% (en-ne), 86.8% (ne-en), 84.6% (en-si) and 87.3% (si-en).For these input sequences, the probability distribution learned by the NMT model is so flat that in these1, 000 translations sampled no single translation stands out over the others.

Ott et al. (2018) performed a similar experiment on a high-resource English-French system trainedon 35.5 million sentence pairs from WMT’14. They find that after drawing 1, 000 samples from theWMT’14 validation set only about 20% of the probability space has been explored, and after drawing10, 000 samples still less than 25% has been explored.

7.2 Sampling the Beam Search SolutionA natural question to ask now is how likely it is to sample the beam search solution. In Figure 5 we showthe fraction of input sequences for which the beam search output is contained within the set of samples asa function of sample size (measured by exact string matching). We find that for held-out in-domain datathe beam search output is quite a likely event, oftentimes being sampled after only taking a few samples.For between 77.1% and 92.2% of the input sequences the beam search translation has been sampled aftertaking 1, 000 samples. The situation is again very different in the out-of-domain case. For the FLORES

dataset the curve looks much more linear and increases very slowly. For only between 4.8% and 8.4% ofthe input sequences, the beam search output has been sampled after taking 1, 000 samples. For the otherinput sequences, the beam search output wasn’t sampled at all. The beam search output is thus a muchmore arbitrary choice on out-of-domain data.

7.3 Empty SequencesIn a recent study Stahlberg and Byrne (2019) showed that oftentimes the true mode of a trained NMTsystem is the empty sequence. This is a worrying fact as the most common decision rule in machinetranslation is to approximately retrieve the mode using beam search. We find that for between 7.2% and29.1% of input sequences for held-out in-domain data and between 2.8% and 33.3% of input sequences

9

Page 10: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

0%20%40%60%80%

100% en-ne ne-en en-si si-en

0 250 500 750 1000# samples

0%20%40%60%80%

100%

0 250 500 750 1000# samples

0 250 500 750 1000# samples

0 250 500 750 1000# samples

Figure 5: Fraction of the held-out in-domain (top) and FLORES (bottom) test data for which the beamsearch solution is contained within the set of samples as a function of sample size.

BLEU METEOR

Test beam single sample MBRM OracleM beam single sample MBRM OracleM

en-ne ID 22.1 17.1 (0.2) 18.3 23.7 (0.2) 37.3 36.1 (0.1) 40.5 43.4 (0.0)FLORES 3.8 1.7 (0.1) 1.6 2.3 (0.1) 31.3 30.6 (0.1) 34.9 37.0 (0.0)

ne-en ID 29.7 24.0 (0.2) 28.2 36.7 (0.1) 29.8 26.6 (0.1) 30.1 35.3 (0.1)FLORES 6.8 3.3 (0.1) 4.6 6.3 (0.0) 17.2 12.8 (0.1) 16.6 20.1 (0.0)

en-si ID 11.3 7.1 (0.2) 9.5 16.1 (0.1) 34.3 31.0 (0.1) 36.3 41.5 (0.1)FLORES 1.3 0.9 (0.1) 0.9 1.4 (0.1) 30.3 30.3 (0.1) 34.8 36.9 (0.0)

si-en ID 25.5 17.3 (0.2) 24.1 33.7 (0.2) 28.7 24.1 (0.1) 28.9 36.0 (0.0)FLORES 6.8 3.4 (0.1) 5.3 7.5 (0.0) 18.4 13.7 (0.1) 17.7 21.6 (0.0)

Table 1: BLEU and METEOR scores on all language pairs using different decision rules. Numbersbetween brackets represent 1 standard deviation from the average. MBRM shows MBR scores using30 samples and sentence-level METEOR as the utility. OracleM uses sentence-level METEOR with thereference to select the optimal sample out of 30 random samples.

for FLORES data an empty sequence was sampled at least once in 1, 000 samples. However, when anempty sequence was sampled it only occurred on average 1.3±0.7 (1 standard deviation) times in 1, 000samples. Even though it could well be, as the evidence that Stahlberg and Byrne (2019) provide isstrong, that the true mode of the model distribution for our models is the empty sequence, it is still arather unlikely outcome.

7.4 Automatic Evaluation

The number of translations that an NMT model assigns mass to can be very large as we have seenin Section 7.1. We now investigate what the average quality of these sample is. For quality assess-ments, we compute detokenised BLEU using SacreBLEU (Papineni et al., 2002; Post, 2018) and ME-TEOR (Denkowski and Lavie, 2011) against a single reference. We translate the test sets using a singleancestral sample per input sentence and repeat the experiment 30 times to report average and 1 standarddeviation in Table 1 (single sample). For reference, we also include beam search scores (beam), the defacto method to obtain translations in NMT. We see that, on average, samples of the model always per-form worse than beam search translations under these metrics. This is no surprise, of course, as ancestralsampling is not a fully fledged decision rule, but simply a technique to unbiasedly explore the modeldistribution. Moreover, beam search itself does come with some adjustments to perform well (such asa specific beam size and length penalty). The gap between sampling and beam search is between 0.4and 8.2 BLEU and between 0 and 4.7 METEOR. The gap can thus be quite large, but overall the qualityof an average sample is reasonable compared to beam search. We also observe that the variance of thesample scores is small.

10

Page 11: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

37.0

38.6

40.3

42.0

43.7M

ETEO

Ren-ne

29.5

31.0

32.6

34.1

35.7ne-en

33.9

35.9

37.9

39.9

41.9en-si

28.4

30.4

32.4

34.4

36.4si-en

5 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 30# samples

21.1

21.8

22.6

23.3

24.0

BLEU

5 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 30# samples

29.4

31.3

33.3

35.3

37.2

5 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 30# samples

10.3

11.8

13.4

14.9

16.5

5 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 30# samples

25.1

27.4

29.8

32.1

34.4

31.0

32.5

34.1

35.7

37.3

MET

EOR

en-ne

16.8

17.6

18.5

19.4

20.3ne-en

30.0

31.8

33.6

35.4

37.2en-si

18.0

18.9

19.9

20.8

21.8si-en

5 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 30# samples

1.8

2.3

2.8

3.4

3.9

BLEU

5 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 30# samples

4.6

5.2

5.7

6.3

6.9

5 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 30# samples

1.0

1.2

1.3

1.4

1.5

5 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 305 10 15 20 25 30# samples

5.3

5.9

6.5

7.1

7.7

Figure 6: METEOR and BLEU scores for oracle-selected samples as a function of sample size on theheld-out in domain (top) and FLORES (bottom) test sets. For each sample size we repeat the experiment4 times and show a box plot per sample size. The blue lines show beam search scores.

7.5 Selecting the Best Sample

So far we looked at the performance of single samples. If we had multiple sampled translations, however,we could design a decision rule to select the “best” sample from such a set. In this experiment weinvestigate what the maximum performance is that we could achieve, if we could select the very best-scoring sample under an automatic evaluation metric. For that, we employ an oracle selection procedureusing sentence-level METEOR with the reference translation to select the best sample from a set ofsamples. We do this for a sample size varying from 5 to 30 samples and repeat each experiment fourtimes. We show box plots of the results in Figure 6 where we report corpus-level METEOR and BLEU,but note that we have selected purely on sentence-level METEOR scores.

BLEU and METEOR scores steadily increase with sample size. For a given sample size we observethat variance is generally very small, showing again that samples are of consistent quality with respectto those metrics. Between 5 and 10 samples are required to outperform beam search on METEOR.On in-domain data this also holds for BLEU scores. On out-of-domain data BLEU still improves withincreasing sample size, but requires more samples to meet beam search performance. Overall, this ex-periment shows that samples are of decent and consistent quality. For fewer than 30 samples the modelcould meet or outperform beam search performance in most cases, if we knew how to choose a bestsample from the set. This is a motivating result for looking into sampling-based decision rules.

7.6 Minimum Bayes Risk Decoding

We have seen that the model distribution, especially on out-of-domain data, is quite flat over a large setof likely candidates. Yet, this set represents various statistics of the data well and holds potentially good

11

Page 12: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

translations. Altogether, even though the model spreads probability mass over many translations, thesetranslations do share statistics that correlate with the reference translation. This motivates a decision rulethat exploits all information we have available about the distribution.

Let u(y, h) be a utility function, that is, a function that assesses a hypothesis h against a reference y.For a given utility, statistical decision theory (Bickel and Doksum, 1977) prescribes that the optimumdecision is the one that maximises expected utility (i.e., negative Bayes risk) under the model, y? =argmaxh∈H(x) Ep(y|x,θ)[u(y, h)], where the maximisation is over the entire set of possible translationsH(x). Note that there is no need for a human-annotated reference, expected utility is computed by havingthe model fill in reference translations. This decision rule, also know as minimum Bayes risk (MBR)decoding, enjoyed some popularity in days of statistical machine translation (Kumar and Byrne, 2004;Tromble et al., 2008). It is especially suited where we trust a model in expectation but not its mode inparticular (Smith, 2011, Section 5.3).6 MBR decoding, much like MAP decoding, is clearly intractable.We can at best obtain unbiased estimates of expected utility (e.g., via Monte Carlo sampling) and wecannot search over the entirety of H(x). Still, a tractable approximation can be designed, albeit withoutany optimality guarantees. We use MC both to approximate expected utility, for a given h ∈ H(x), andto approximate the support H(x) of the distribution. In particular, we maximise over the support H(x)of the empirical distribution obtained by ancestral sampling:

y? = argmaxh∈H(x)

1

S

S∑s=1

u(y(s), h) for y(s) ∼ p(y|x, θ) , (3)

which runs in time O(S2).7 Note that even though approximate, this decision rule has interesting prop-erties. For example, our estimates improve with sample size, occasional pathologies in the set pose nothreat, and the rule does not rely on a notion of incremental search.

As the utility function, we chose METEOR, which unlike BLEU is well-defined at the sentence level.8

This does mean we expect to see the largest gains on corpus-level METEOR, rather than corpus-levelBLEU, nonetheless we report both. We approximate expected utility using S = 30 ancestral samples,and use the translations we sample to make up our approximation toH(x). Results are shown in Table 1.As expected, MBR considerably outperforms the average single sample performance in terms of ME-TEOR and in most cases beats or is on par with beam search decoding (we observe similar results whencomparing with beam search in label-smoothed systems, see Appendix B). The gap with oracle selectioncan be due to error in our MC approximation to MBR, or it signals that there is room for improvementin the model. In terms of BLEU we also observe considerable improvement over average sampling per-formance in most cases, but it does not match the performance of beam search. This is most likely dueto the choice of utility function that favours statistics that METEOR rewards.

We show MBR to be effective with as few as 30 samples, once again showing that exploring themodel as a probability distribution holds great potential. Previous approximations of MBR in NMT haveemployed beam search to guide the search and evaluate expected utility (with probabilities renormalisedto sum to 1 in the beam), they have reported the need for very large beams (Stahlberg et al., 2017; Blainet al., 2017; Shu and Nakayama, 2017). In particular, Blain et al. (2017) investigate the quality of beamsearch outputs with very large beam sizes. They confirm the beam search curse, but find that largerbeams do contain better translations using an oracle selection rule. This motivates them to implementan approximation to MBR known as consensus decoding (DeNero et al., 2009) for re-ranking NMT n-best lists. They find marginal improvements in BLEU when using BEER as the utility, but using humanevaluations find that they do much better than the beam search output. They claim the inability to directly

6MAP decoding is in fact MBR with a very strict utility function which evaluates to 1 if a translation exactly matches thereference, and 0 otherwise. As a community, we acknowledge by means of our evaluation strategies (manual or automatic) thatexact matching is inadequate for translation, unlike many unstructured classification problems, admits multiple solutions.

7We remark that more practical approximations do exist. For example, DeNero et al. (2009) trade unbiased estimates ofexpected utility for linear-time scoring.

8Even though one can alter BLEU such that it is defined at the sentence level (for example, by adding a small positiveconstant to n-gram counts), this “smoothing” in effect biases BLEU’s sufficient statistics. Unbiased statistics are the key toMBR, thus we opt for a metric that is already defined at the sentence level.

12

Page 13: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

score better translations higher is a deficiency of the model scoring function. We argue this is anotherpiece of evidence for the inadequacy of the mode.

8 Discussion

In this work, we discuss the inadequacy of the mode in neural machine translation systems and questionthe appropriateness of the use of MAP decoding in NMT.

Many recent studies have criticised NMT as a model, or its estimation procedure (MLE), on the basisof findings about its mode, either exact or approximate through beam search. We claim that inadequaciesabout the mode on their own are not sufficient evidence to criticise NMT as a model. Our results showthat by criticising NMT models as probability distributions, using unbiased samples from the model, theyare able to match data statistics well on expectation and it is beam search that strays away from statisticsof the data. We also show that many biases that NMT is known to have are only or mostly there due tothe use of MAP decoding.

We further claim that MAP decoding is hardly justified in NMT, because the most likely translationsaccumulate very little mass making the ranking among them arbitrary. Our experiments verify thisshowing that NMT’s probability distribution is extremely flat outside of its training domain and that thebeam search output is a very rare event under this distribution. We show the same findings for the emptysequence, that in a recent study has been found to often be the true mode of the distribution.

We therefore argue that, as MLE is meant to capture the distribution as a whole, and not just the modeof a distribution, our decision rules should reflect this and consider the distribution more holistically. Ourfindings show that, on average, samples are of decent quality and contain translations that outperform thebeam output even for a small number of samples, further motivating the use of sampling-based decisionrules. We show that a straightforward implementation of a well-known sampling-based decision rule,minimum Bayes risk decoding, shows good results. This confirms that even though the set of likelytranslations under the model is large, these share many statistics that correlate well with the reference.

MLE-trained NMT models admit probabilistic interpretation and an advantage of the probabilisticframework is that a lot of methodology is already in place when it comes to model criticism as well asmaking predictions. We therefore advocate for criticising NMT models as probability distributions andmaking predictions using decision rules that take into account the distributions more holistically. We seeno problems with improving beam search for MLE-trained NMT or changing the training objective tobetter fit beam search, but one needs to take care to criticise the model appropriately. We hope that ourwork paves the way for research into practical sampling-based decision rules and motivates researchersto assess model improvements to MLE-trained NMT systems from a probabilistic perspective.

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research andinnovation programme under grant agreement No 825299 (GoURMET). We also thankKhalil Sima’an, Lina Murady and Milos Stanojevic for comments and helpful discussions.

A Titan Xp card used for this research was donated by the NVIDIA Corporation.

ReferencesDaniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov,

and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of the 54thAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2442–2452,Berlin, Germany, August. Association for Computational Linguistics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learningto Align and Translate. In ICLR, 2015, San Diego, USA.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence predic-tion with recurrent neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,editors, Advances in Neural Information Processing Systems 28, pages 1171–1179. Curran Associates, Inc.

13

Page 14: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

Peter J Bickel and Kjell A Doksum. 1977. Mathematical statistics: basic ideas and selected topics. Holden-DayInc., Oakland, CA, USA.

Frederic Blain, Lucia Specia, and Pranava Madhyastha. 2017. Exploring hypotheses spaces in neural machinetranslation. Asia-Pacific Association for Machine Translation (AAMT), editor, Machine Translation SummitXVI. Nagoya, Japan.

David M Blei. 2014. Build, compute, critique, repeat: Data analysis with latent variable models. Annual Reviewof Statistics and Its Application, 1:203–232.

Leon Bottou and Yann L. Cun. 2004. Large scale online learning. In S. Thrun, L. K. Saul, and B. Scholkopf,editors, Advances in Neural Information Processing Systems 16, pages 217–224. MIT Press.

Bottou. 1991. Une approche theorique de lapprentissage connexioniste; applications a la reconnaissance de laparole.

Nicolas Boulanger-Lewandowski, Y. Bengio, and Pascal Vincent. 2012. High-dimensional sequence transduction.ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 12.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014a. On the properties ofneural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop onSyntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, October. Associationfor Computational Linguistics.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. 2014b. Learning phrase representations using RNN encoder–decoder for statistical ma-chine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process-ing (EMNLP), pages 1724–1734, Doha, Qatar, October. Association for Computational Linguistics.

Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the bible in 100 languages.Language Resources and Evaluation, 49(2):375–395, Jun.

Eldan Cohen and J. Christopher Beck. 2019. Empirical analysis of beam search performance degradation in neuralsequence models. In ICML.

John DeNero, David Chiang, and Kevin Knight. 2009. Fast consensus decoding over translation forests. InProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International JointConference on Natural Language Processing of the AFNLP, pages 567–575, Suntec, Singapore, August. Asso-ciation for Computational Linguistics.

Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic metric for reliable optimization and evaluationof machine translation systems. In Proceedings of WMT, 2011, pages 85–91, Edinburgh, Scotland, July.

Bryan Eikema and Wilker Aziz. 2019. Auto-encoding variational neural machine translation. In Proceedingsof the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 124–141, Florence, Italy,August. Association for Computational Linguistics.

Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2013. Bayesian Data Analysis. Chapmanand Hall/CRC, 3rd ed. edition.

Miguel Graca, Yunsu Kim, Julian Schamper, Shahram Khadivi, and Hermann Ney. 2019. Generalizing back-translation in neural machine translation. In Proceedings of the Fourth Conference on Machine Translation(Volume 1: Research Papers), pages 45–52, Florence, Italy, August. Association for Computational Linguistics.

Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor O.K. Li. 2017. Learning to translate in real-time withneural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Associationfor Computational Linguistics: Volume 1, Long Papers, pages 1053–1062, Valencia, Spain, April. Associationfor Computational Linguistics.

Francisco Guzman, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary,and Marc’Aurelio Ranzato. 2019. The FLORES evaluation datasets for low-resource machine translation:Nepali–English and Sinhala–English. In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6100–6113, Hong Kong, China, November. Association for Computational Linguistics.

P. E. Hart, N. J. Nilsson, and B. Raphael. 1968. A formal basis for the heuristic determination of minimum costpaths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107.

14

Page 15: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference.Journal of Machine Learning Research, 14(4):1303–1347.

Liang Huang, Kai Zhao, and Mingbo Ma. 2017. When to finish? optimal beam search for neural text genera-tion (modulo beam size). In Proceedings of the 2017 Conference on Empirical Methods in Natural LanguageProcessing, pages 2134–2139, Copenhagen, Denmark, September. Association for Computational Linguistics.

Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of theFirst Workshop on Neural Machine Translation, pages 28–39, Vancouver, August. Association for Computa-tional Linguistics.

Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. InProceedings of the Human Language Technology Conference of the North American Chapter of the Associationfor Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Massachusetts, USA, May 2 - May7. Association for Computational Linguistics.

Aviral Kumar and Sunita Sarawagi. 2019. Calibration of encoder decoder models for neural machine translation.arXiv preprint arXiv:1903.00802.

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilisticmodels for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conferenceon Machine Learning, ICML 01, page 282289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

Pierre Lison, Jorg Tiedemann, and Milen Kouylekov. 2018. OpenSubtitles2018: Statistical rescoring of sen-tence alignments in large, noisy parallel corpora. In Proceedings of the Eleventh International Conferenceon Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May. European Language ResourcesAssociation (ELRA).

Rafael Muller, Simon Kornblith, and Geoffrey E Hinton. 2019. When does label smoothing help? In Advances inNeural Information Processing Systems 32, pages 4694–4703. Curran Associates, Inc.

Kenton Murray and David Chiang. 2018. Correcting length bias in neural machine translation. In Proceedings ofthe Third Conference on Machine Translation: Research Papers, pages 212–223, Brussels, Belgium, October.Association for Computational Linguistics.

Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 2018. Analyzing uncertainty in neuralmachine translation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Con-ference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3956–3965,Stockholmsmssan, Stockholm Sweden, 10–15 Jul. PMLR.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluationof machine translation. In Proceedings of ACL, 2002, pages 311–318, Philadelphia, USA, July.

Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of WMT, 2018, pages 186–191,Brussels, Belgium.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training withrecurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan,Puerto Rico, May 2-4, 2016, Conference Track Proceedings.

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407.

Harshil Shah and David Barber. 2018. Generative neural machine translation. In S. Bengio, H. Wallach,H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Pro-cessing Systems 31, pages 1346–1355. Curran Associates, Inc.

Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risktraining for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), pages 1683–1692, Berlin, Germany, August. Associationfor Computational Linguistics.

Raphael Shu and Hideki Nakayama. 2017. Later-stage minimum bayes-risk decoding for neural machine transla-tion. CoRR, abs/1704.03169.

15

Page 16: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

Raphael Shu and Hideki Nakayama. 2018. Improving beam search by removing monotonic constraint for neuralmachine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 339–344, Melbourne, Australia, July. Association for ComputationalLinguistics.

Noah A. Smith. 2011. Linguistic Structure Prediction. Morgan and Claypool.

Pavel Sountsov and Sunita Sarawagi. 2016. Length bias in encoder decoder models and a case for global condi-tioning. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages1516–1525, Austin, Texas, November. Association for Computational Linguistics.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout:A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958.

Felix Stahlberg and Bill Byrne. 2019. On NMT search errors and model errors: Cat got your tongue? In Proceed-ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3347–3353, Hong Kong, China,November. Association for Computational Linguistics.

Felix Stahlberg, Adria de Gispert, Eva Hasler, and Bill Byrne. 2017. Neural machine translation by minimising theBayes-risk with respect to syntactic translation lattices. In Proceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 362–368, Valencia,Spain, April. Association for Computational Linguistics.

Milos Stanojevic and Mark Steedman. 2020. Max-Margin Incremental CCG Parsing. In Proceedings of the 58thAnnual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association forComputational Linguistics.

Mitchell Stern, Daniel Fried, and Dan Klein. 2017. Effective inference for generative neural parsing. In Pro-ceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1695–1700,Copenhagen, Denmark, September. Association for Computational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le. 2014. Sequence to sequence learning with neural networks.In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, NIPS, 2014, pages3104–3112. Montreal, Canada.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking theinception architecture for computer vision. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 2818–2826.

Jorg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Lrec, volume 2012, pages 2214–2218.

Roy Tromble, Shankar Kumar, Franz Och, and Wolfgang Macherey. 2008. Lattice Minimum Bayes-Risk decodingfor statistical machine translation. In Proceedings of the 2008 Conference on Empirical Methods in NaturalLanguage Processing, pages 620–629, Honolulu, Hawaii, October. Association for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, andIllia Polosukhin. 2017. Attention is all you need. In NIPS, pages 6000–6010.

Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-sequence learning as beam-search optimization. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1296–1306,Austin, Texas, November. Association for Computational Linguistics.

Yilin Yang, Liang Huang, and Mingbo Ma. 2018. Breaking the beam search curse: A study of (re-)scoringmethods and stopping criteria for neural machine translation. In Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing, pages 3054–3059, Brussels, Belgium, October-November.Association for Computational Linguistics.

Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. 2019. Bridging the gap between training andinference for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association forComputational Linguistics, pages 4334–4343, Florence, Italy, July. Association for Computational Linguistics.

Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. 2020. Trading off diversity andquality in natural language generation. CoRR, abs/2004.10450.

16

Page 17: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

A Analysis Models

A.1 Length AnalysisWe model the training data using a hierarchical Gamma-Poisson model. Each target sequence lengthis modelled as being a draw from a Poisson distribution with a Poisson rate parameter specific to thatsequence.

All Poisson rates share a common population-level Gamma prior with population-level parameters αand β. The population-level parameters are given fixed Exponential priors.

α ∼ Exp(1) β ∼ Exp(10)

λi ∼ Gamma(α, β) yi ∼ Poisson(λi)

Here, i indexes one particular data point. This model is very flexible, because we allow the model toassign each datapoint its own Poisson rate. We model test groups as an extension of the training group.Test group data points are also modelled as draws from a Gamma-Poisson model, but parameterisedslightly differently.

µ = E [Gamma(α, β|DT )] η ∼ Exp(1.)

sg ∼ Exp(η) tg = 1/µ

λgi ∼ Gamma(sg, tg) ygi ∼ Poisson(λgi)

Here, i again indexes a particular data point, and g a group in {test, sampling, beam}. All Poisson ratesare individual to each datapoint in each group. The Poisson rates do share a group-level Gamma prior,whose parameters are sg and tg. sg shares a prior among all test groups and therefore ties all test groupstogether. tg is derived from posterior inferences on the training data by taking the expected posteriorPoisson rate in the training data and inverting it. This is done such that the mean Poisson rate for eachtest group is sg · µ, where sg can be seen as a parameter that scales the expected posterior training ratefor each test group individually.

A.2 Lexical & Word Order AnalysesThe model for training data is described below.

α ∼ Gamma(1, 1) β ∼ Gamma(1, 1)

θ ∼ Dir(α) ψu ∼ Dir(β)

u ∼ Multinomial(θ) b|u ∼ Multinomial(ψu)

Here, we have one Gamma-Dirichlet-Multinomial model to model unigram counts u, and a separateGamma-Dirichlet-Multinomial model for each u (the first word of a bigram) that b (the second word of abigram) conditions on. This means that we effectively have V + 1 such models (where V is BPE vocab-ulary size) in total to model the training data. We do collapsed inference for each Dirichlet-Multinomial(as we are not interested in assessing θ or φu), and infer posteriors approximately using SVI with Gammaapproximate posterior distributions.

We model the other three groups using the inferred posterior distributions on the training data Dt. Wecompute the expected posterior concentration of the Dirichlets in the training models and normalise itsuch that it sums to 1. The other groups are modelled by scaling this normalised concentration parameterusing a scalar. This scalar, sg for unigrams or mg for bigrams, can be interpreted as the amount ofagreement of each test group with the training group. The higher this scalar is, the more peaked the testgroup multinomials will be about the training group lexical distribution.

17

Page 18: Bryan Eikema Abstract · in experiments. Unless the model distribution is extremely peaked about the mode for every plausible input, criticising the model in terms of properties of

µ(α) = E [α|DT ] . µ(β) = E [β|DT ]

ηs ∼ Gamma(1, 0.2) ηm ∼ Gamma(1, 0.2)

sg ∼ Gamma(1, ηs) mg ∼ Gamma(1, ηm)

θg ∼ Dir(sg · µ(α)) ψg ∼ Dir(mg · µ(β))

ug ∼ Multinomial(θg) bg|ug ∼ Multinomial(ψg)

g ∈ {GS,AS, beam}

B Label Smoothing Results

See Table 2.

BLEU METEOR

Test LS beam single sample MBRM OracleM beam single sample MBRM OracleM

en-ne ID n 22.1 17.1 (0.2) 18.3 23.7 (0.2) 37.3 36.1 (0.1) 40.5 43.4 (0.0)y 20.6 8.2 (0.2) 36.7 29.0 (0.3)

FLORES n 3.8 1.7 (0.1) 1.6 2.3 (0.1) 31.3 30.6 (0.1) 34.9 37.0 (0.0)y 4.8 1.1 (0.1) 32.5 24.2 (0.2)

ne-en ID n 29.7 24.0 (0.2) 28.2 36.7 (0.1) 29.8 26.6 (0.1) 30.1 35.3 (0.1)y 31 14.4 (0.3) 30.6 18.2 (0.1)

FLORES n 6.8 3.3 (0.1) 4.6 6.3 (0.0) 17.2 12.8 (0.1) 16.6 20.1 (0.0)y 8.6 2.1 (0.1) 19.2 9.1 (0.1)

en-si ID n 11.3 7.1 (0.2) 9.5 16.1 (0.1) 34.3 31.0 (0.1) 36.3 41.5 (0.1)y 10.9 3.2 (0.1) 34.2 26.5 (0.1)

FLORES n 1.3 0.9 (0.1) 0.9 1.4 (0.1) 30.3 30.3 (0.1) 34.8 36.9 (0.0)y 2.9 0.5 (0.1) 31.8 24.6 (0.1)

si-en ID n 25.5 17.3 (0.2) 24.1 33.7 (0.2) 28.7 24.1 (0.1) 28.9 36.0 (0.0)y 25.8 8.7 (0.2) 28.9 15.5 (0.2)

FLORES n 6.8 3.4 (0.1) 5.3 7.5 (0.0) 18.4 13.7 (0.1) 17.7 21.6 (0.0)y 7.7 2.4 (0.1) 20 9.7 (0.1)

Table 2: BLEU and METEOR scores on all language pairs using different decision rules for NMT modelstrained with (y) and without (n) label smoothing (LS). Numbers between brackets represent 1 standarddeviation from the average. MBRM shows MBR scores using 30 samples and sentence-level METEORas the utility. OracleM uses sentence-level METEOR with the reference to select the optimal sample outof 30 random samples. REMARKS Label smoothing seems like a rather drastic change to MLE, as itcan be seen from the massive drop in single sample performance, and because of that we do not performMBR or Oracle experiments in that case. Interestingly, compare METEOR with MBRM selection fora model trained without LS to METEOR with beam search selection for a model trained with LS: theaverage improvement due to LS is 0.1 (ID) and 1.5 (FLORES) whereas the average improvement due toMBR is 1.4 (ID) and 1.7 (FLORES). This is very motivating as it suggests that better exploration of anMLE-trained model is competitive with changes to the objective.

18